/ - Diff - BIOI2 Formations - Forge Logicielle I2BC

     # Examples and solutions of the Snakemake BIOI2 training session
     Instructions are on the BIOI2 website: https://bioi2.i2bc.paris-saclay.fr/training/snakemake/
     To download this repository, open a terminal and type:
     ```bash
     git clone https://forge.i2bc.paris-saclay.fr/git/bioi2_formations/snakemake_examples.git
     ```
     ## Organisation of this repository
     ```text
     ├── README.md
     ├── exercise0
     ├── exercise0_improved_after_1A
     ├── exercise0_improved_after_1B
     ├── exercise0_improved_after_1C
     └── demo_advanced
     ```
     ### Exercise 0
     The example Snakemake pipeline to execute in [Exercise 0](https://bioi2.i2bc.paris-saclay.fr/training/snakemake/exercises/exercise-0-objective) is in the `exercise0` folder.
     **exercise0_improved_after_1X** folders are examples of improvements of the initial SnakeFile after applying what you've learnt in Exercises 1A, 1B and 1C. We advise you to have a look at them once you've finished with the forementioned exercises.
     ### Exercise 2
     The `demo_advanced` folder is an example solution for Exercise 2 comprising several different types of syntaxes that you could encounter in Snakefiles.
     ## Executing the SnakeFiles
     If you're in the folder that contains the Snakefile (and if the Snakefile is named Snakefile), you can just type:
     ```bash
     snakemake --cores 1
     ```
     If you'd like to specify the Snakefile in the command line (because it's not in your current directory or because it's named differently):
     ```bash
     snakemake --cores 1 -s /path/to/snakefile.smk
     ```

     import yaml
     with open('samples.yaml', 'r') as file
        content = yaml.safe_load(file)
     samples = content['samples']
     rule targets:
         input:
             expand("fasta/{sample}.fasta", sample=samples),
             "fusionFasta/allSequences.fasta",
             "mafft/mafft_res.fasta",
     # Update 1: add the threads directive to all rules specifying
     #           the maximum number of threads/CPUs/processors to
     #           use per rule
     # Update 2: add the resources directive to all rules specifying
     #           the maximum amount of memory, walltime etc. to
     #           use per rule
     rule loadData:
         output:
             "fasta/{sample}.fasta",
         params:
             dirFasta = "fasta",
         log:
             stdout = "logs/{sample}_wget.stdout",
             stderr = "logs/{sample}_wget.stderr",
         threads: 1
         resources:
             mem="1gb",
             time_min="00:05:00",
         shell:
             """
                 wget --output-file {log.stderr} \
                    --directory-prefix {params.dirFasta} \
                    https://www.uniprot.org/uniprot/{wildcards.sample}.fasta > {log.stdout}
             """
     rule fusionFasta:
         input:
             expand("fasta/{sample}.fasta", sample=samples),
         output:
             "fusionFasta/allSequences.fasta",
         log:
             "logs/fusionData.stderr",
         threads: 1
         resources:
             mem="1gb",
             time_min="00:05:00",
         shell:
             """
                 cat {input} > {output} 2> {log}
             """
     # Update 3: add the envmodules directive to rules that use
     #           non-standard tools such as mafft so that Snakemake
     #           automatically "activates" the tool on the cluster
     #           NB: use "module avail" to see the right syntax
     rule mafft:
         input:
             "fusionFasta/allSequences.fasta",
         output:
             "mafft/mafft_res.fasta",
         log:
             "logs/whichMafft.txt",
         threads: 1
         resources:
             mem="1gb",
             time_min="00:05:00",
         envmodules:
             "nodes/mafft-7.475"
         shell:
             """
                 mafft {input} > {output} 2> {log}
             """
     # Update 4: add a profile configuration file: profile/config.yaml
     #           to specify options instead of specifying them in the
     #           command line at execution

     # cluster-specific options (for PBSpro environment):
     jobs: 6
     executor: cluster-generic
     cluster-generic-submit-cmd: "qsub -l ncpus={threads} -l mem={resources.mem} -l walltime={resources.time_min}"
     cluster-generic-cancel-cmd: "qdel"
     # set default resources for each job to 1 cpu and 1Gb if not specified otherwise:
     default-resources: [threads=1, mem="1Gb", time_min="02:00:00"]
     # software option:
     software-deployment-method: env-modules
     # to avoid typing -p everytime:
     printshellcmds: True
     # deactivate global stop when there's an error with 1 input:
     keep-going: True
     # retry running 3 times if fail
     restart-times: 3
     # in case there is latency when jobs are run on the cluster, wait a while
     latency-wait: 180

     Pour faire fonctionner le pipeline il faut se connecter sur un noeud du cluster puis:
     - charger l'environnement snakemake:
     module load snakemake/snakemake-8.4.6
     module load nodes/mafft-7.475
     - executer le programme, se placer dans ce dossier et:
     snakemake --cores 1

exercise0_improved_after_1C/samples.yaml
		samples: ["P01325", "P01308"]

Project

General

Profile

BIOI2 Formations

Revision 428015d2

Added by Chloe Quignot about 1 year ago