Validation checks#
There are many forms of validation can be applied to a workflow. These range from assessing whether modules are compatibile with one another, to data-depedent validation procedures. The latter (data-depedent validation) can generally only be assessed when data become available in the workflow, and as such are best handled by Utility modules that are designed to verify file-formats, file-contents, etc.
Module compatibility on the other hand, can be assessed during construction of the workflows directly in the GRAPEVNE Builder. This is achieved by checking which depedencies are being fullfilled - and which are not - for a particular module.
We can demonstrate this process by making use of the Download and Filter
modules from the Builder Tutorial set.
Open the GRAPEVNE Builder, or clear the current graph (performing either action
will automatically delete any previous Test builds). Ensure that your
repository is set to Directory Listing (github) and kraemer-lab/vneyard,
then click Get Module List. Once the modules load, filter the list to include
only those belonging to the Tutorial Builder group.
Now, drag the Download module into the main graph building area, but this
time leave the default values as they are. Now, drag a Filter module into
the graph and connect the Out port of the Download module to the In port
of the Filter module.
Notice that when you draw this connection, the Filter module will turn grey
for several second and, in this case will then turn red
A red module indicates that its dependencies are not being met given its current inputs. However, we constructed exactly this graph earlier without issue, so what is the problem?
The problem lies in the fact that the Filter module is (by default) set to
look for (and read) from a file called data.csv. As the Download module is
currently set to retrieve the Snakemake logo (a .png file), this dependency
is not being fullfilled, and so this is indicated by the module turning red.
To resolve the problem, in this case, we want to change the output filename
from the Download module so that it remains compatible with the Filter
module. To do this, click on the Download module and change the filename
parameter (config-params-filename) to data.csv. Note that the module remains
red for the time being.
Module dependencies are assessed when new links are made between modules - but
this process is quite computationally intensive, so to re-run the validation
check it is necessary to open the module of interest (by clickng on it; click
on the Filter module now),
and then click on the Validate button that appears at the top of the
information panel. Do this and you should see the module first turn grey
(indicating that the validation check has started), and then turn Blue,
indicating that all is well and the module dependencies are being met through
its inputs.
Note
In practise, these checks are implemented (in the background) by attempting to build a workflow graph for the specific module in isolation. This produces a list of dependencies that are not currently being fullfilled. We then build another workflow graph, this time including its parent connections (its immediate, first-order inputs). If those modules fullfill all of the original missing dependencies (acknowledging that they will report their own [separate] dependencies that may be missing), then we know that the target module has its dependencies fullfilled. By applying this philosophy iteratively throughout the graph, we can assess whether each module is being provided with appropriate data and can highlight the specific modules that are not.
It is also worth noting that although the modules are now seen as compatible by
GRAPEVNE, we did not actually change the url of the file that we are downloading,
therefore if we ran this workflow now we would receive an error since the
Filter module would attempt to open data.csv as a comma-separated-variable
file, whereas it is infact just a renamed png image. This highlights the
fundamental difference between
a dependency that we can assess during construction (basic syntax and
requirements checking), versus a data-dependent check, that requires the
workflow to run in order to load and validate data files, to ensure that the
data being received is of the correct type and in the correct format.
We could in-fact place a very simple data-validation module into our pipeline
in-between the Download module and the Filter module. Or (as would perhaps
be more fitting in this example) we could apply data validation checks during
the reading of the csv file itself, as part of the Filter module. Either
option is available depending on the requirements of the situation.