/ - Diff - BIOI2 Formations - Forge Logicielle I2BC

« Previous | Next »

Revision 387c5cc5

Added by Chloé QUIGNOT about 1 year ago

ID 387c5cc55ec074d6673b009468bd206bea931bea
Parent 428015d2
Child d1a8897a

add a few comments to readme for advanced demo, add lambda function to 2 rules, specify full paths in config

     rule fastqc:
         input:
             "data/{sample}.fastq.gz",
             lambda wc: config["samples"][wc.sample]
         output:
             report(
                 multiext("result/fastqc/{sample}", "_fastqc.zip", "_fastqc.html"),
-...
     rule bowtie2:
         input:
             fq="data/{sample}.fastq.gz",
             fq=lambda wc: config["samples"][wc.sample],
             idxFile=rules.bowtieIndex.output,
         output:
             bam=pipe("result/bowtie/{sample}.sam"),

     samples:
         SRR3099585_chr18: SRR3099585_chr18.fastq.gz
         SRR3099586_chr18: SRR3099586_chr18.fastq.gz
         SRR3099587_chr18: SRR3099587_chr18.fastq.gz
         SRR3105697_chr18: SRR3105697_chr18.fastq.gz
         SRR3105698_chr18: SRR3105698_chr18.fastq.gz
         SRR3105699_chr18: SRR3105699_chr18.fastq.gz
         SRR3099585_chr18: Data/SRR3099585_chr18.fastq.gz
         SRR3099586_chr18: Data/SRR3099586_chr18.fastq.gz
         SRR3099587_chr18: Data/SRR3099587_chr18.fastq.gz
         SRR3105697_chr18: Data/SRR3105697_chr18.fastq.gz
         SRR3105698_chr18: Data/SRR3105698_chr18.fastq.gz
         SRR3105699_chr18: Data/SRR3105699_chr18.fastq.gz
     references:
         annot: "references/O.tauri_annotation.gff"
         fasta: "references/O.tauri_genome.fna"
         annot: "Data/O.tauri_annotation.gff"
         fasta: "Data/O.tauri_genome.fna"
     bowtie:
         idx: "references/bowtie2index/Otauri"
         idx: "bowtie2index/Otauri"
         idxthreads: 1
         extra: ""
         alignthreads: 8

     # Utilisation du Snakefile
     # Snakemake pipeline presentation
     Pour utiliser ce snakemake (tested with snakemake version 8.4.6) avec le cluster i2bc:
     Le Snakefile a été reformaté avec le logiciel snakefmt (v0.10.0).
     L'arborescence du code:
     ## The code
     The pipeline's `code` folder contains the following arborescence of files:
     ```text
     ├── Snakefile
     ├── profile
     │   └── config.yaml
-...
         ├── fastqc.rst
         ├── featureCounts.rst
         └── multiqc.rst
     ```
     The Snakefile is the file that contains all of Snakemake's rules. The configuration file `profile/config.yaml`
     is a way of specifying options that you would otherwise add to the snakemake command line. The `report` folder
     contains a set of hand-written report templates for each tool or method that was used and that will later
     be incorporated into a final report file generated by Snakemake.
     NB: this Snakefile was reformatted using the snakefmt tool (v0.10.0) in order to comply with the commonly
     accepted (and encouraged) Snakemake formatting standards
     ## The required data (to download)
     This pipeline runs on single-end bulk RNA-seq data and outputs quality reports on the samples provided as well as tables of
     gene expression counts per sample.
     Example input files can be found under (this link)[https://doi.org/10.5281/zenodo.8340293] (same as in Exercises 1A, 1B and 1C, see (this page)[https://bioi2.i2bc.paris-saclay.fr/training/snakemake/exercises]):
     ```text
     Data/
     ├── O.tauri_annotation.gff
     ├── O.tauri_genome.fna
     ├── SRR3099585_chr18.fastq.gz
     ├── SRR3099586_chr18.fastq.gz
     ├── SRR3099587_chr18.fastq.gz
     ├── SRR3105697_chr18.fastq.gz
     ├── SRR3105698_chr18.fastq.gz
     └── SRR3105699_chr18.fastq.gz
     ```
     They contain 6 fastq files (chromosome 18 from SRR309958[5,6,7] and SRR310569[7,8,9]), as well as
     the genome fasta of their reference species *O. tauri* and its annotation.
     # Usage
     **Requirements:**
     - tools: snakemake, fastqc, bowtie, featureCounts, multiqc
     - data: fastq files and their corresponding reference genome
     **Before you run the pipeline:**
     - edit the line `cd /path/to/the/project/` in the `runRnaseq.sh` file to your actual project folder e.g. `/path/to/snakemake_examples/exercise2_advanced/`
     - edit the configuration file `configfile.yaml` so that the paths to the various samples and reference files are correct (a good habit is to use absolute paths) so that Snakemake can find them
     For example, if my working directories look like this:
     ```text
     /home/john.doe/snakemake_tutorial/Data/
     ├── O.tauri_annotation.gff
     ├── O.tauri_genome.fna
     ├── SRR3099585_chr18.fastq.gz
     ├── SRR3099586_chr18.fastq.gz
     ├── SRR3099587_chr18.fastq.gz
     ├── SRR3105697_chr18.fastq.gz
     ├── SRR3105698_chr18.fastq.gz
     └── SRR3105699_chr18.fastq.gz
     ```
     and
     ```text
     /home/john.doe/snakemake_examples/exercise2_advanced/code/
     ├── Snakefile
     ├── profile
     │   └── config.yaml
     └── report
         ├── alignPipeline.rst
         ├── bowtie2.rst
         ├── fastqc.rst
         ├── featureCounts.rst
         └── multiqc.rst
     ```
     Then my configuration file should look like this (for example):
     ```text
     samples:
         SRR3099585_chr18: /home/john.doe/snakemake_tutorial/Data/SRR3099585_chr18.fastq.gz
         SRR3099586_chr18: /home/john.doe/snakemake_tutorial/Data/SRR3099586_chr18.fastq.gz
         SRR3099587_chr18: /home/john.doe/snakemake_tutorial/Data/SRR3099587_chr18.fastq.gz
         SRR3105697_chr18: /home/john.doe/snakemake_tutorial/Data/SRR3105697_chr18.fastq.gz
         SRR3105698_chr18: /home/john.doe/snakemake_tutorial/Data/SRR3105698_chr18.fastq.gz
         SRR3105699_chr18: /home/john.doe/snakemake_tutorial/Data/SRR3105699_chr18.fastq.gz
     references:
         annot: "/home/john.doe/snakemake_tutorial/Data/O.tauri_annotation.gff"
         fasta: "/home/john.doe/snakemake_tutorial/Data/O.tauri_genome.fna"
     bowtie:
         idx: "/home/john.doe/snakemake_examples/exercise2_advanced/result/bowtie2index/Otauri"
         idxthreads: 1
         extra: ""
         alignthreads: 8
     featureCounts:
         extra: '-t "gene" -g "ID" -s "2"'
     tempDir: "/tmp"
     ```
     **Running the pipeline:**
     Outputs will be generated in the current working directory (i.e. the directory in which you
     are when you run the Snakemake command), within a `result` folder.
     To use this Snakemake pipeline (tested with snakemake version 8.4.6) on the I2BC cluster you can use the little bash script provided:
     ```bash
     qsub -V -l walltime=4:00:00 -l select=1:ncpus=1:mem=100mb runRnaseq.sh
     ```
     If you have a look at the runRnaseq.sh script, you will see that we run snakemake twice: the first time to run the pipeline and
     generate the output files; the second time to create the final report using the `--report myFinalReport.html` option.
     # Output
     ```text
     result/
     ├── benchmark
     ├── bowtie
     ├── fastqc
     ├── featureCounts
     ├── logs
     ├── multiqc_data
     └── multiqc_report.html
     ```
     Le snakefile définit toutes les règles, le profile contient des options utilisées souvent dans la ligne de commande et le dossier report contient les informations qui vont être inscrits dans le rapport.
     Pour exécuter ce code, il faut aussi des données. Elles sont disponnibles sur Zenodo (https://zenodo.org/records/3997237). Ce sont des données single-end RNA-seq.
     - Il faut mettre les données dans un repertoire : data
     - Les annotations sont dans le dossier references mais ce nom peut changer grâce au fichier de configuration (configfile.yaml)
     - Les résultats seront écrits dans le répertoire : result
     - Dans le fichier runRnaseq.sh, il faut modifier le chemin vers le dossier projet.
     Voila un exemple d'organisation du dossier de données mais aussi de résultat :
     ├── references
     │   ├── O.tauri_annotation.gff
     │   ├── O.tauri_genome.fna
     │   └── bowtie2index
     ├── configfile.yaml
     ├── data
     │   ├── SRR3099585_chr18.fastq.gz
     │   ├── SRR3099586_chr18.fastq.gz
     │   ├── SRR3099587_chr18.fastq.gz
     │   ├── SRR3105697_chr18.fastq.gz
     │   ├── SRR3105698_chr18.fastq.gz
     │   └── SRR3105699_chr18.fastq.gz
     ├── result
     │   ├── benchmark
     │   ├── bowtie
     │   ├── fastqc
     │   ├── featureCounts
     │   ├── logs
     │   ├── multiqc_data
     │   └── multiqc_report.html
     └── runSnakemake.sh
     La ligne de commande : qsub -V -l walltime=4:00:00 -l select=1:ncpus=1:mem=100mb runRnaseq.sh
     Pour obtenir le rapport il faut relancer la même commande snakemake mais il faut ajouter l'option : --report myFinalReport.html
     # Les fonctions utilisées
     ## Fonction général
     ## Fonctions générales
     - envmodules : pour spécifier les environnements utilisés pour chaque règles (https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#using-environment-modules)
     - wildcard_constraints : permet une verification de la valeur des variables (https://snakemake.readthedocs.io/en/stable/tutorial/additional_features.html#constraining-wildcards)
     ## rule fastqc
     - Utilisation d'une fonction au lieu d'un nom de fichier pour l'input (https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#input-functions)
     - Utilisation de la fonction report pour créer un rapport après exécution du script (https://snakemake.readthedocs.io/en/stable/snakefiles/reporting.html#reports)
     - multiext : variant de la fonction expand (https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#the-multiext-function)
-...
     ## rule bowtie2 et samtoolsSort
     - Utilisation d'une fonction au lieu d'un nom de fichier pour l'input (https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#input-functions)
     - pipe : permet de remplacer l'utilisation de pipe et de scinder les différents éléments de la ligne de commande en plusieurs règles (https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#piped-output)
     - config.get : permet de mettre une valeur par défaut si elle manque dans le fichier de configuration
     - config.get : permet de mettre une valeur par défaut si elle manque dans le fichier de configuration (NB: c'est une fonction de Python qui marche ici car config est un dictionnaire)
     ## rule extractCounts
-...
     ## rule matrixCounts
     - On peut utiliser autant de wildcards que l'on veut. Dans cette règle 2 wildcards sont utilisées.
     - On peut utiliser autant de wildcards que l'on veut. Dans cette règle 2 wildcards sont utilisées (sample et colExtra).

Also available in: Unified diff

Project

General

Profile

BIOI2 Formations

Revision 387c5cc5

Added by Chloé QUIGNOT about 1 year ago