Snakemake Tutorial - GRIOTE · 09/20/2016 Snakemake Tutorial 9 Snakefle (1/2) import glob, ntpath...
Transcript of Snakemake Tutorial - GRIOTE · 09/20/2016 Snakemake Tutorial 9 Snakefle (1/2) import glob, ntpath...
09/20/2016 Snakemake Tutorial 1
Snakemake TutorialSnakemake Tutorial
09/20/2016 Snakemake Tutorial 2
SummarySummary● Context
● Common uses (4 steps)
● Specifc uses (2 steps)
● Still in progress
09/20/2016 Snakemake Tutorial 3
● An easy scripts scheduler in python
● Can be used on computing grid
● Can be used with virtual environments
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 4
Context Common uses Specific uses Still in progress
● What for?
For example, you need to compute the 1500 fastq files of the Lanner experiment in a succession of parallelised treatments.
09/20/2016 Snakemake Tutorial 5
Context Common uses Specific uses Still in progress
● Te pipeline we will buildRule 'unzip'
Rule 'miniQC'
Rule 'fastqc'
Fichier ERR1307260
Fichier ERR1307264
Rule 'unzip'
Rule 'all'
Rule 'fastqc'
... ...
Fichier ERR130726...
Rule 'htseq'
Rule 'tophat' Rule 'htseq'
Rule 'tophat'
... ...
09/20/2016 Snakemake Tutorial 6
● Prerequisites
conda installed
access to the BiRD grid
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 7
● 1st Step
– Virtual environment creation– Basic snakefile creation (first part)
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 8
● Environment (shell commands)
conda create -n myTutoEnv python=3.5 snakemake fastQC bioconductor-deseq2 -c bioconda
qlogin
source activate myTutoEnv
touch snakefile
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 9
● Snakefle (1/2)import glob, ntpath
inFiles = set() ## set of all input files (only file names)
## extract file names from pathsinPaths = glob.glob( "/sandbox/ylelievre/singlecell/tuto/fastq/*.fastq.gz" ) for p in inPaths: inFiles.add( os.path.basename(p).replace(".fastq.gz", "") )
##############
Context Common uses Specific uses Still in progress
Basename extractionsfrom the files to be
computed
09/20/2016 Snakemake Tutorial 10
● Snakefle (2/2)rule all: input: I1 = expand("fastQC/{filename}_fastqc.zip",filename=inFiles)
##############
rule fastqc: input: "/sandbox/ylelievre/singlecell/tuto/fastq/{afile}.fastq.gz" output: "fastQC/{afile}_fastqc.zip" shell: "fastqc -o fastQC {input}"
Context Common uses Specific uses Still in progress
The 1st rule (all) definesthe target of the pipeline
The subsequent rules define the differentsteps to reach the target (here just one step)
09/20/2016 Snakemake Tutorial 11
● Launch the pipeline (shell command)
snakemake -p -n -s mySnakefileName
Context Common uses Specific uses Still in progress
-p : print the commands executed
-n : dryrun
-s : if the Snakefile name is not 'Snakefile'
09/20/2016 Snakemake Tutorial 12
● What can be observed ?– The rules are executed sequentially (SGE not used)– The 5 rules fastqc are executed before the rule all– There is not a specific order between the fastqc rules
Context Common uses Specific uses Still in progress
Rule 'fastqc'
Rule 'fastqc'
Rule 'all'...
Fichier ERR1307260
Fichier ERR1307265
Fichier ERR130726...
09/20/2016 Snakemake Tutorial 13
● Preparation of the next step
– Delete a zip file (ERR1307264) from the directory ./fastQC
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 14
● 2nd Step
– Basic snakefile creation (second part)
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 15
● Snakefle updaterule all: input: I1 = expand("fastQC/{filename}_unzip_fastqc.done",filename=inFiles)
##############...
rule unzip_fastqc: input: zips = "fastQC/{zipfile}_fastqc.zip" output: touch("fastQC/{zipfile}_unzip_fastqc.done") shell: "unzip -o {input.zips} -d ./fastQC"
Context Common uses Specific uses Still in progress
Input modified
Rule unzip_fastqc added
09/20/2016 Snakemake Tutorial 16
● Launch the pipeline (shell command)
snakemake -p
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 17
● What can be observed ?– All the rules fastqc are not necessarily executed before
the unzip rules
Context Common uses Specific uses Still in progress
Rule 'unzip'
Rule 'unzip'Rule 'all'
...
Rule 'fastqc'
Fichier ERR1307260
Fichier ERR1307263
Fichier ERR130726...
Fichier ERR1307264 Rule 'unzip'
09/20/2016 Snakemake Tutorial 18
● Preparation of the next step
– Delete 2 zip files (ERR1307263 and ERR1307264) from the directory ./fastQC
– Delete 2 done files (ERR1307262 and ERR1307264) from the directory ./fastQC
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 19
● 3rd Step
– Basic snakefile creation (second part)– Missing files and snakemake
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 20
● Snakefle updaterule all: input: "QC/miniQC.done"
##############...
rule miniQC: input: I1 = expand("fastQC/{filename}_unzip_fastqc.done",filename=inFiles) output: touch("QC/miniQC.done") shell: "./miniQC.sh"
Context Common uses Specific uses Still in progress
Input modified
Rule miniQC added
09/20/2016 Snakemake Tutorial 21
● Launch the pipeline (shell command)
snakemake -p
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 22
● What can be observed ?
Context Common uses Specific uses Still in progress
Rule 'unzip' Rule 'miniQC'
Rule 'fastqc'
Fichier ERR1307260
Fichier ERR1307263
Fichier ERR1307262
Fichier ERR1307264 Rule 'unzip'
Fichier ERR1307261
Rule 'all'
09/20/2016 Snakemake Tutorial 23
● Preparation of the next step
– Delete 2 zip files (ERR1307263 and ERR1307264) from the directory ./fastQC
– Delete 2 done files (ERR1307263 and ERR1307264) from the directory ./fastQC
– Delete the miniQC.done file from the directory ./QC
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 24
● 4th Step
– Snakefile with a config file
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 25
● Snakefle with a confg fle (1/3)import glob, ntpath
configfile: "config_tuto.json"
inFiles = set() ## set of all input files (only file names)
## extract file names from pathsinPaths = glob.glob( config["FASTQ_PATH"]+"*.fastq.gz" )for p in inPaths: inFiles.add( os.path.basename(p).replace(".fastq.gz", "") )
##############
Context Common uses Specific uses Still in progress
Load the config file
Call of the fieldFASTQ_PATH
from the json file
09/20/2016 Snakemake Tutorial 26
● Snakefle with a confg fle (2/3)rule all: input: config["ANALYSIS_PATH"]+"QC/miniQC.done"
##############rule fastqc: input: config["FASTQ_PATH"]+"{afile}.fastq.gz" output: config["ANALYSIS_PATH"]+"fastQC/{afile}_fastqc.zip" params: analysis_path=config["ANALYSIS_PATH"] shell: "fastqc -o {params.analysis_path}fastQC {input}"
Context Common uses Specific uses Still in progress
params: is required in order toinject config field into the shell script
09/20/2016 Snakemake Tutorial 27
● Snakefle with a confg fle (3/3)rule unzip_fastqc: input: zips = config["ANALYSIS_PATH"]+"fastQC/{zipfile}_fastqc.zip" output: touch(config["ANALYSIS_PATH"]+"fastQC/{zipfile}_unzip_fastqc.done") params: analysis_path=config["ANALYSIS_PATH"] shell: "unzip -o {input.zips} -d {params.analysis_path}fastQC"
rule miniQC: input: expand(config["ANALYSIS_PATH"]+"fastQC/{filename}_unzip_fastqc.done",filename=inFiles) output: touch(config["ANALYSIS_PATH"]+"QC/miniQC.done") params: fastq_type=config["FASTQ_TYPE"] shell: "./miniQC.sh {params.fastq_type}"
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 28
● Launch the pipeline (shell command)
snakemake -p
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 29
● What can be observed ?
Context Common uses Specific uses Still in progress
Rule 'unzip'
Rule 'miniQC'
Rule 'fastqc'
Fichier ERR1307260
Fichier ERR1307263
Fichier ERR1307262
Fichier ERR1307264 Rule 'unzip'
Fichier ERR1307261
Rule 'all'
Rule 'fastqc'
09/20/2016 Snakemake Tutorial 30
● Preparation of the next step
– Delete all zip files from the directory ./fastQC– Delete all done files from the directory ./fastQC– Delete the miniQC.done file from the directory ./QC– Shell commands :
source deactivateexit
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 31
Context Common uses Specific uses Still in progress
● 5th Step
– Snakemake and computing grid– Snakemake and wrappers
09/20/2016 Snakemake Tutorial 32
Context Common uses Specific uses Still in progress
Use of the option –cluster to perform snakemake on SGE
● Environment (shell commands)
source activate myTutoEnvmkdir logssnakemake -p --cluster "qsub -o ./logs/ -e ./logs/" --jobs 50
-o and -e define the repository where SGE send the output and error
--jobs define the number of threads allocated
09/20/2016 Snakemake Tutorial 33
● What can be observed ? (1/2)If you were lucky, you would observe this!
Rule 'unzip'
Rule 'miniQC'
Rule 'fastqc'
Fichier ERR1307260
Fichier ERR1307264 Rule 'unzip'
Rule 'all'
Rule 'fastqc'
... ...Fichier ERR130726...
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 34
● What can be observed ? (2/2)Except for the persons who are doing this tutorial directly in their home directory (…), your jobs should be in error because your current working directory has not been passed to the computing grid.
More over, your virtual environment variables, also, have not been transmitted to the grid.
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 35
Context Common uses Specific uses Still in progress
We will use a wrapper in order to transmit the environment variables
● Environment (shell commands)
ctrl ctouch myGridWrapper.sh
09/20/2016 Snakemake Tutorial 36
Context Common uses Specific uses Still in progress
Transmit the virtual environment variables on the node
● myGridWrapper.sh (shell commands)
#$ -S /bin/bash#$ -cwd#$ -V
{exec_job}
Transmit the current working directory
Execute the script
09/20/2016 Snakemake Tutorial 37
Context Common uses Specific uses Still in progress
● Shell commands
snakemake -p --cluster "qsub -o ./logs/ -e ./logs/" --jobs 50 --jobscript myGridWrapper.sh
09/20/2016 Snakemake Tutorial 38
● What can be observed ?Missed again!
You should get an error like 'OSError: Missing files after 5 seconds:'
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 39
Context Common uses Specific uses Still in progress
● Shell commands (try again)ctrl csnakemake -p --latency-wait 60 --cluster "qsub -o ./logs/ -e
./logs/" --jobs 50 --jobscript myGridWrapper.sh
Snakemake needs some latency to properly work with a grid engine
09/20/2016 Snakemake Tutorial 40
● What can be observed ?Finally!!!!
Rule 'unzip'
Rule 'miniQC'
Rule 'fastqc'
Fichier ERR1307260
Fichier ERR1307264 Rule 'unzip'
Rule 'all'
Rule 'fastqc'
... ...Fichier ERR130726...
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 41
Context Common uses Specific uses Still in progress
● 6th Step
– Snakemake and virtual environmentS
09/20/2016 Snakemake Tutorial 42
● Let's play with virtual environments!Often, we need to use antinomic version of tools in a same pipeline.
One solution is the use of virtual environments.
Example : use snakmake with python 3 and tophat2 with python 2
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 43
Context Common uses Specific uses Still in progress
Preparation of a virtual environment which is compliant with tophat and htseqcount
● Environment (shell commands)
conda create --name myTophatTutoEnv python=2.7.8 tophat htseq pysam=0.8.3 -c bioconda
09/20/2016 Snakemake Tutorial 44
● Snakefle with a confg fle (1/2)rule all: input: I1 = config["ANALYSIS_PATH"]+"QC/miniQC.done", I2 = expand(config["ANALYSIS_PATH"]+"counts/{filename}.counts",filename=inFiles)
rule tophat2_SE: input: config["FASTQ_PATH"]+"{bamname}.fastq.gz" output: config["ANALYSIS_PATH"]+"BAM/{bamname}/accepted_hits.bam" params: cpu=config["TOPHAT_CPU"], gene=config["GENE_REFERENCE"], bowtie=config["BOWTIE_INDEX"], analysis_path=config["ANALYSIS_PATH"] shell: """ source activate myTophatTutoEnv tophat2 -p {params.cpu} -G {params.gene} -o {params.analysis_path}BAM/{wildcards.bamname} {params.bowtie} {input} """
Update rule all
Context Common uses Specific uses Still in progress
Add the following rules
09/20/2016 Snakemake Tutorial 45
● Snakefle with a confg fle (2/2)
rule htseqcount: input: config["ANALYSIS_PATH"]+"BAM/{htseqname}/accepted_hits.bam" output: config["ANALYSIS_PATH"]+"counts/{htseqname}.counts" params: gene=config["GENE_REFERENCE"], analysis_path=config["ANALYSIS_PATH"] shell: """ source activate myTophatTutoEnv htseq-count -f bam -r pos -i gene_name -s no {input} {params.gene} > {params.analysis_path}counts/{wildcards.htseqname}.counts """
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 46
● Preparation of the next step
– Delete all zip files from the directory ./fastQC– Delete all done files from the directory ./fastQC– Delete the miniQC.done file from the directory ./QC
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 47
Context Common uses Specific uses Still in progress
● Shell commands
snakemake -p --latency-wait 60 --cluster "qsub -o ./logs/ -e ./logs/" --jobs 50 --jobscript myGridWrapper.sh
09/20/2016 Snakemake Tutorial 48
● What can be observed ?Rule 'unzip'
Rule 'miniQC'
Rule 'fastqc'
Fichier ERR1307260
Fichier ERR1307264
Rule 'unzip'
Rule 'all'
Rule 'fastqc'
... ...
Fichier ERR130726...
Context Common uses Specific uses Still in progress
Rule 'htseq'
Rule 'tophat' Rule 'htseq'
Rule 'tophat'
... ...
09/20/2016 Snakemake Tutorial 49
● A young technology
● Problem with the NFS fle system (solved with BeeGFS)
Context Common uses Specific uses Still in progress
09/20/2016 Snakemake Tutorial 50
Tank you for your perseverance!