Basic snakemake tutorial#

With snakemake you can run a workflow that automatically runs the necessary steps to process your data. The workflow is pre-defined in workflow/Snakefile and can be customized using the provided configuration file.

It searches for dataset folders in the specified path and searches for tiffs in each of them. Each dataset will be processed in parallel and the results will be saved in the specified output folder called derivatives.

Why use snakemake?#

Snakemake is a powerful workflow management system that allows you to run complex data analysis pipelines in a reproducible and efficient manner. For each defined rule (a rule is a step in the workflow, for instance running suite2p), Snakemake will check if the output files already exist and if they are up to date. If they are not, it will run the rule and create the output files. This way, you can easily rerun only the parts of the workflow that need to be updated, without having to rerun the entire analysis pipeline each time.

Dry Run A dry run is a simulation of the workflow that shows you what would happen if you ran it, without actually executing any commands. This is useful for checking if everything is set up correctly before running the workflow. What you will see as an output is a DAG, i.e. a directed acyclic graph, that shows the dependencies between the different rules in the workflow. You can also see which files will be created and which rules will be executed. For rules to be linked together, input and output names must match: rule A will be linked to rule B if the output of rule A is the input of rule B.

To preview the workflow without running it:

snakemake --jobs 1 all --dry-run

all is a keyword that tells snakemake to run all the rules in the workflow. The --jobs argument specifies the number of jobs to run in parallel. In this case, we are running one job at a time. You can increase this number to run multiple jobs in parallel. Dry run can also be abbreviated to -np.

To run the workflow you can skip the --dry-run argument and run the command:

snakemake --jobs 5 all

If you wish to force the re-execution of a given rule you can use the --force-rerun argument followed by the name of the rule you want to rerun. For example, if you want to rerun the rule suite2p, you can use the command:

snakemake --jobs 5 all --forcerun suite2p

You can also rerun a specific dataset by specifying the output of interest:

snakemake --jobs 1 /path/to/derivatives/dataset_name/suite2p/plane_0/F.npy

This will trigger the analysis that lead to the creation of the file F.npy in the specified dataset folder.

Once you have tested the workflow locally, you can use further arguments to submit the jobs to a cluster. If you are using SLURM, you can use the following command:

snakemake --executor slurm --jobs 5 all

Other useful arguments are:

  • --latency-wait: to wait for a certain amount of time before checking if the output files are ready.

  • --rerun-incomplete: to rerun any incomplete jobs.

  • --unlock: to unlock the workflow.