Module 2: Subsample alignment

Module 2: Subsample alignment#

This module will require two rules: one to generate randomly subsampled data, and the other to align these data. We will develop the first one as a module, and then combine it with a provided module for alignment. These rules must apply to each seed specified in the seeds.txt file from the previous step, and the analysis for which is expected to be deposited in it’s own folder.

Let’s start with the configuration file. The NEXTFLOW pipeline defines some parameters for this analysis. Here, we will simply copy those across, while adding our input_namespace and output_namespace parameters.

The file config/config.yaml should therefore look like this:

input_namespace: in
output_namespace: out
params:
  n_random: 50
  master_fasta: resources/beta.fasta
  master_metadata: resources/beta.metadata.tsv

The collation of workflow parameters under the params heading has no special meaning and can be removed if desired, but can help to maintain structure and readability, especially when workflows increase in complexity.

We also note that we require two auxiliary files to perform this analysis. The first (beta.fasta) requires access to GISAID to obtain, but the second (beta.metadata.tsv) is available through the vocpl github repository. Create a subsample_alignment/resources folder and copy these files into that folder now.

Random subsampling#

Let us examine the random subsampling rules. The raw rule would consist of three lines, which read:

head -n1 resources/beta.metadata.tsv > results/out/subsample_metadata.tsv
shuf -n 50 resources/beta.metadata.tsv >> results/out/subsample_metadata.tsv
tail -n +2 results/out/subsample_metadata.tsv | cut -f1 > results/out/subsample_ids.tsv

We want to substitute dynamic names in place of fixed names, making use of Snakemake’s input, output and params directives. We would then obtain:

head -n1 {input.master_metadata} > {output.subsample_metadata}
shuf -n {params.n_random} {input.master_metadata} >> {output.subsample_metadata}
tail -n +2 {output.subsample_metadata} | cut -f1 > {output.subsample_ids}

Now, let’s place this set of commands inside a Snakemake rule and provide the necessary directives:

rule random_subsample_ids_metadata:
    input:
        master_metadata = expand(
            srcdir("../{master_metadata}"),
            master_metadata=config["params"]["master_metadata"],
        ),
    output:
        subsample_ids = expand(
            "results/{outdir}/subsample_ids.tsv",
            outdir=config["output_namespace"],
        ),
        subsample_metadata = expand(
            "results/{outdir}/subsample_metadata.tsv",
            outdir=config["output_namespace"],
        ),
        ),
    params:
        n_random=config["params"]["n_random"],
    shell:
        """
        head -n1 {input.master_metadata} > {output.subsample_metadata}
        shuf -n {params.n_random} {input.master_metadata} >> {output.subsample_metadata}
        tail -n +2 {output.subsample_metadata} | cut -f1 > {output.subsample_ids}
        """

The params.n_random is the simplest to understand, since this reads directly from the configuration file (config["params"]["n_random"]). We define both subsample_ids and subsample_metadata as named outputs for convenience, with each specifying the output filename subsample_ids.tsv and subsample_metadata.tsv to be deposited in the folder "results/{outdir}". Note that outdir refers to config["output_namespace"] here, so this is the destination folder as (will be) provided by GRAPEVNE. The resource file master_metadata (again, named for convenience) is specified with srcdir since we are providing the file with our analysis (see the previous tutorial step if this is unclear). It is worth taking a moment to ensure that you are comfortable with this rule before continuing.

We have now not only converted this 3 line instruction into a valid Snakemake rule, but at the same time made it GRAPEVNE compliant simply by specifying input_namespace and output_namespace in the configuration, and ensuring that all relevant input and output folders reference them in the Snakefile. Instead of building out the particulars of this rule further, let us combine this module with some others that have already been built to produce our desired subsample_alignment module.

You will note that, so far, neither module that we have written depend on the input from any other modules. Instead, both depend only on pre-existing files that are provided as resources with those rules. In the next step we will look at input namespaces for situations where we receive input from one, or indeed multiple different modules.