Data dependent rules

Data dependent rules#

In this section we will discuss file dependencies, and file-content dependencies. Note that these are not Phyloflow-specific issues, and instead reflect more general Snakemake considerations that we feel it is important to be aware of when building modular workflows.

As stated previously, a valid Snakemake rule typically consists of an input directive, an output directive, and a shell (or run) directive. Within the Snakemake rule these directives tend to appear in this order. However, this ordering can be deceptive as it implies that the presence of input files drives the generation of output files. But, Snakemake is a build system which works by determining which inputs are require to build a set of target ouputs (not the other way around!).

To conceptualise this more clearly, considered the following file structure:

|- a.txt
|- b.txt
|- c.txt

If we wanted to copy the contents of these files to a set of similarly named .backup files, then we could conceptually write the following Snakefile:

rule echo:
    input:
        "{name}.txt"
    output:
        "{name}.txt.backup"
    shell:
        "echo {input} > {output}"

However, when attempting to execute this pipeline Snakemake will produce the following error "WorkflowError: Target rules may not contain wildcards.". This is because Snakemake determines which rules to run based on a set of file targets (as opposed to processing a series of instructions on some input data [which is more analogous to NEXTFLOW channels]).

The simplest way to resolve this issue is to provide a default rule (the first rule in a Snakefile is the default… by default). If we therefore add a target rule at the top of our file:

rule target:
    input:
        "a.txt.backup",
        "b.txt.backup",
        "c.txt.backup",

then we are stating our desired targets, and Snakemake is then able to make use of the echo rule (with its wildcards) to derive a ruleset that can produce a.txt.backup, given our rule (echo) and the fact that a.txt exists in the current folder. However, without this target directive, Snakemake doesn’t know what it’s target is, so it cannot derive a ruleset to create it.

There are some more elegant ways to write this, such as specifying the names in a list (i.e. expand("{name}.txt.backup", name=['a', 'b', 'c'])), but they all suffer from the same issue - they are not data-dependent, i.e. you must know the names of the desired files before you launch your pipeline and provide them as your targets.

This would seem a serious impediment, especially if we are interested in building modular workflows where depedencies cannot be known beforehand. However, in practise simple workflows rely on consistent naming conventions and are unaffected, and in-fact functionality does exist within Snakemake to overcome this problem for more complex workflows, namely through the use of checkpoints. Briefly, checkpoints allow data-dependent execution rules to exist by re-evaluating the rule-chain at the end of these rules. This allows targets to be regenerated once rules complete. The use of checkpoints requires some finese, but we will demonstrate their utility using our seeds.txt file.

Checkpoints with seeds.txt#

Given our seeds.txt file, we want to perform an analysis on each item in this list. But this is a data-dependency - we do not know the contents of the file before the workflow runs. To overcome this issue we introduce a checkpoint into our workflow and produce target rules based on the output of those checkpoints.

To demonstrate their utility, here is a truncated and annotated Snakefile for a more all-encompassing subsample_alignment module.

configfile: "config/config.yaml"
outdir = config["output_namespace"]

# The 'checkpoint' rule is executed once seeds.txt is available.
# We make a copy of the file in this case as only the output
# from the checkpoint is accessible later on...
checkpoint get_seeds:
    input:
        expand("results/{indir}/seeds.txt",
               indir=config["input_namespace"])
    output:
        expand("results/{outdir}/seeds.txt",
               outdir=outdir)
    shell:
        "cp {input} {output}"

# Standard Python function to read the contents of a file and
# return its contents as a list. This will be a list of seeds
# in our case. We could also have written this as a lambda
# function within a rule, but keeping it separate is clearer.
def read_seeds_file(wildcards):
    with open(checkpoints.get_seeds.get().output[0], "r") as file:
        return file.read().splitlines()

# This rule uses the seeds list from read_seeds_file()
# to create a (data-dependent) list of file dependencies
rule target:
    input:
        expand(
            "results/{outdir}/s{key}/subsample_aln.fasta",
            outdir=outdir,
            key=read_seeds_file,
        )
    default_target: True

# This rule tells Snakemake how to generate the new target files
# and could be specified with other input dependencies as required,
# leading to the formation of rule chains.
rule subsample_alignment:
    output:
        expand(
            "results/{outdir}/s{{key}}/subsample_aln.fasta",
            outdir=outdir,
        ),
    # `inputs`, `params` and `shell` directives removed for brevity.
    # Instead, we just echo the seed name into the output file as a demo
    shell:
        "echo {{key}} {output}"

The key here is that the Python function read_seeds_file depends on (and has access to) the output(s) of the checkpoint rule. Execution will wait until this succeeds, at which point read_seeds_file is run, which provides build targets for the rule target. In our example, those build rules specify which seeds (or ‘key’ wildcards as they are represented above) are generated. Without this arrangement Snakemake would not know which seeds to analyse.