Pipelines¶
Download and index genome files¶
This pipeline is for the indexing of genomes once they have been loaded into the VRE. It indexes each new genome with Bowtie2, BWA and GEM. These indexes can then be used by the other pipelines.
Running from the command line¶
Parameters¶
- taxon_id : int
- Species taxonomic ID
- assembly : str
- Genomic assembly ID
- genome : str
- Location of the genomes FASTA file
Returns¶
Bowtie2 index files BWA index files GEM index file
Example¶
When running the pipeline on a local machine without COMPSs:
1 2 3 4 5 | python process_genome.py \
--config tests/json/config_genome_indexer.json \
--in_metadata tests/json/input_genome_indexer.json \
--out_metadata tests/json/output_genome_indexer.json \
--local
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_genome.py \
--config tests/json/config_genome_indexer.json \
--in_metadata tests/json/input_genome_indexer.json \
--out_metadata tests/json/output_genome_indexer.json
|
Methods¶
-
class
process_genome.
process_genome
(configuration=None)[source]¶ Workflow to download and pre-index a given genome
-
run
(input_files, metadata, output_files)[source]¶ Main run function for the indexing of genome assembly FASTA files. The pipeline uses Bowtie2, BWA and GEM ready for use in pipelines that rely on alignment.
Parameters: - input_files (dict) –
- genome : str
- List of file locations
- metadata (dict) –
- genome : dict
- Required meta data
- output_files (dict) –
- bwa_index : str
- Location of the BWA index archive files
- bwt_index : str
- Location of the Bowtie2 index archive file
- gem_index : str
- Location of the GEM index file
- genome_gem : str
- Location of a the FASTA file generated for the GEM indexing step
Returns: - outputfiles (dict) – List of locations for the output index files
- output_metadata (dict) – Metadata about each of the files
- input_files (dict) –
-
BioBamBam Alignment Filtering¶
This pipeline to filter sequencing artifacts from aligned reads.
Running from the command line¶
Parameters¶
- config : str
- Configuration JSON file
- in_metadata : str
- Location of input JSON metadata for files
- out_metadata : str
- Location of output JSON metadata for files
Returns¶
- filtered : file
- Filtered bam file
Example¶
REQUIREMENT - Needs an aligned bam file
When running the pipeline on a local machine without COMPSs:
1 2 3 4 5 | python process_biobambam.py \
--config tests/json/config_biobambam.json \
--in_metadata tests/json/input_biobambam.json \
--out_metadata tests/json/output_biobambam.json \
--local
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_biobambam.py \
--config tests/json/config_biobambam.json \
--in_metadata tests/json/input_biobambam.json \
--out_metadata tests/json/output_biobambam.json
|
Methods¶
-
class
process_biobambam.
process_biobambam
(configuration=None)[source]¶ Functions for filtering FastQ alignments with BioBamBam.
-
run
(input_files, metadata, output_files)[source]¶ Main run function for filtering FastQ aligned reads using BioBamBam.
Parameters: - input_files (dict) –
Location of the initial input files required by the workflow
- bam : str
- Location of BAM file
- metadata (dict) –
Input file meta data associated with their roles
bam : str
- output_files (dict) –
Output file locations
filtered : str
Returns: output_files (dict) – Output file locations associated with their roles, for the output
- filtered : str
Filtered version of the bam file
output_metadata (dict) – Output metadata for the associated files in output_files
filtered : Metadata
- input_files (dict) –
-
Bowtie2 Alignment¶
This pipeline aligns FASTQ reads to a given indexed genome. The pipeline can handle single-end and paired-end reads.
Running from the command line¶
Parameters¶
- config : str
- Configuration JSON file
- in_metadata : str
- Location of input JSON metadata for files
- out_metadata : str
- Location of output JSON metadata for files
Returns¶
- bam : file
- Aligned reads in bam file
Example¶
REQUIREMENT - Needs the indexing step to be run first
When running the pipeline on a local machine without COMPSs:
1 2 3 4 5 | python process_align_bowtie.py \
--config tests/json/config_bowtie2.json \
--in_metadata tests/json/input_bowtie2.json \
--out_metadata tests/json/output_bowtie2.json \
--local
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_align_bowtie.py \
--config tests/json/config_bowtie2_single.json \
--in_metadata tests/json/input_bowtie2_single_metadata.json \
--out_metadata tests/json/output_bowtie2_single.json
|
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_align_bowtie.py \
--config tests/json/config_bowtie2_paired.json \
--in_metadata tests/json/input_bowtie2_paired_metadata.json \
--out_metadata tests/json/output_bowtie2_paired.json
|
Methods¶
-
class
process_align_bowtie.
process_bowtie
(configuration=None)[source]¶ Functions for aligning FastQ files with Bowtie2
-
run
(input_files, metadata, output_files)[source]¶ Main run function for aligning FastQ reads with Bowtie2.
Currently this can only handle a single data file and a single background file.
Parameters: - input_files (dict) –
Location of the initial input files required by the workflow
- genome : str
- Genome FASTA file
- index : str
- Location of the BWA archived index files
- loc : str
- Location of the FASTQ reads files
- fastq2 : str
- [OPTIONAL] Location of the FASTQ reads file for paired end data
- metadata (dict) –
Input file meta data associated with their roles
genome : str index : str loc : str fastq2 : str
- output_files (dict) –
Output file locations
- bam : str
- Output bam file location
Returns: output_files (dict) – Output file locations associated with their roles, for the output
- bam : str
Aligned FASTQ short read file locations
output_metadata (dict) – Output metadata for the associated files in output_files
bam : Metadata
- input_files (dict) –
-
BSgenome Builder¶
This pipeline can process FASTQ to identify protein-DNA binding sites.
Running from the command line¶
Parameters¶
- config : str
- Configuration JSON file
- in_metadata : str
- Location of input JSON metadata for files
- out_metadata : str
- Location of output JSON metadata for files
Returns¶
- bsgenome : file
- BSgenome index
- genome_2bit : file
- Compressed representation of the genome required for generating the index
- chrom_size : file
- Location of the chrom.size file
- seed_file : file
- Configuaration file for generating the BSgenome R package
Example¶
When running the pipeline on a local machine without COMPSs:
1 2 3 4 5 | python process_bsgenome.py \
--config tests/json/config_bsgenome.json \
--in_metadata tests/json/input_bsgenome.json \
--out_metadata tests/json/output_bsgenome.json \
--local
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_bsgenome.py \
--config tests/json/config_bsgenome.json \
--in_metadata tests/json/input_bsgenome.json \
--out_metadata tests/json/output_bsgenome.json
|
Methods¶
-
class
process_bsgenome.
process_bsgenome
(configuration=None)[source]¶ Workflow to download and pre-index a given genome
-
run
(input_files, metadata, output_files)[source]¶ Main run function for the indexing of genome assembly FASTA files. The pipeline uses Bowtie2, BWA and GEM ready for use in pipelines that rely on alignment.
Parameters: - input_files (dict) –
- genome : str
- Location of the FASTA input file
- metadata (dict) –
- genome : dict
- Required meta data
- output_files (dict) –
- BSgenome : str
- Location of a the BSgenome R package
Returns: - outputfiles (dict) – List of locations for the output index files
- output_metadata (dict) – Metadata about each of the files
- input_files (dict) –
-
BS Seeker2 Indexer¶
This pipeline can process FASTQ to identify protein-DNA binding sites.
Running from the command line¶
Parameters¶
- config : str
- Configuration JSON file
- in_metadata : str
- Location of input JSON metadata for files
- out_metadata : str
- Location of output JSON metadata for files
Returns¶
- index : file
- BS Seeker2 index
Example¶
When running the pipeline on a local machine without COMPSs:
1 2 3 4 5 | python process_bs_seeker_index.py \
--config tests/json/config_wgbs_index.json \
--in_metadata tests/json/input_wgbs_index_metadata.json \
--out_metadata tests/json/output_wgbs_index.json \
--local
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_bs_seeker_index.py \
--config tests/json/config_wgbs_index.json \
--in_metadata tests/json/input_wgbs_index_metadata.json \
--out_metadata tests/json/output_wgbs_index.json
|
Methods¶
-
class
process_bs_seeker_index.
process_bs_seeker_index
(configuration=None)[source]¶ Functions for aligning FastQ files with BWA
-
run
(input_files, metadata, output_files)[source]¶ Main run function for generatigng the index files required by BS Seeker2.
Parameters: - input_files (dict) –
List of strings for the locations of files. These should include:
- genome_fa : str
- Genome assembly in FASTA
- metadata (dict) –
Input file meta data associated with their roles
genome : str
- output_files (dict) –
Output file locations
- bam : str
- Output bam file location
Returns: output_files – Output file locations associated with their roles, for the output
index : str
Return type: dict
- input_files (dict) –
-
BS Seeker2 Aligner¶
This pipeline aligns FASTQ paired end reads using BS Seeker2 and Bowtie2.
Running from the command line¶
Parameters¶
- config : str
- Configuration JSON file
- in_metadata : str
- Location of input JSON metadata for files
- out_metadata : str
- Location of output JSON metadata for files
Returns¶
- bam : file
- Aligned Bam file
- bai : file
- Aligned Bam index file
Example¶
When running the pipeline on a local machine without COMPSs:
1 2 3 4 5 | python process_bs_seeker_aligner.py \
--config tests/json/config_wgbs_align.json \
--in_metadata tests/json/input_wgbs_align_metadata.json \
--out_metadata tests/json/output_wgbs_align.json \
--local
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_bs_seeker_aligner.py \
--config tests/json/config_wgbs_align.json \
--in_metadata tests/json/input_wgbs_align_metadata.json \
--out_metadata tests/json/output_wgbs_align.json
|
Methods¶
-
class
process_bs_seeker_aligner.
process_bs_seeker_aligner
(configuration=None)[source]¶ Functions for downloading and processing whole genome bisulfate sequencings (WGBS) files. Files are filtered, aligned and analysed for points of methylation
-
run
(input_files, metadata, output_files)[source]¶ This pipeline processes paired-end FASTQ files to identify methylated regions within the genome.
Parameters: - input_files (dict) –
List of strings for the locations of files. These should include:
- genome_fa : str
- Genome assembly in FASTA
- fastq1 : str
- Location for the first filtered FASTQ file for single or paired end reads
- fastq2 : str
- Location for the second filtered FASTQ file if paired end reads
- index : str
- Location of the index file
- metadata (dict) –
Input file meta data associated with their roles
genome_fa : dict fastq1 : dict fastq2 : dict index : dict
- output_files (dict) – bam : str bai : str
Returns: bam|bai – Location of the alignment bam file and the associated index
Return type: str
- input_files (dict) –
-
BiSulphate Sequencing Filter¶
This pipeline processes FASTQ files to filter out duplicate reads.
Running from the command line¶
Parameters¶
- config : str
- Configuration JSON file
- in_metadata : str
- Location of input JSON metadata for files
- out_metadata : str
- Location of output JSON metadata for files
Returns¶
- fastq1_filtered|fastq1_filtered : str
- Locations of the filtered FASTQ files from which alignments were made
- fastq2_filtered|fastq2_filtered : str
- Locations of the filtered FASTQ files from which alignments were made
Example¶
When running the pipeline on a local machine without COMPSs:
1 2 3 4 5 | python process_bs_seeker_filter.py \
--config tests/json/config_wgbs_filter.json \
--in_metadata tests/json/input_wgbs_filter_metadata.json \
--out_metadata tests/json/output_metadata.json \
--local
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_bs_seeker_filter.py \
--config tests/json/config_wgbs_filter.json \
--in_metadata tests/json/input_wgbs_filter_metadata.json \
--out_metadata tests/json/output_metadata.json
|
Methods¶
-
class
process_bs_seeker_filter.
process_bsFilter
(configuration=None)[source]¶ Functions for filtering FASTQ files. Files are filtered for removal of duplicate reads. Low quality reads in qseq file can also be filtered.
-
run
(input_files, metadata, output_files)[source]¶ This pipeline processes FASTQ files to filter duplicate entries
Parameters: - input_files (dict) –
List of strings for the locations of files. These should include:
- fastq1 : str
- Location for the first FASTQ file for single or paired end reads
- fastq2 : str
- Location for the second FASTQ file if paired end reads [OPTIONAL]
- metadata (dict) –
Input file meta data associated with their roles
fastq1 : str
- fastq2 : str
- [OPTIONAL]
- output_files (dict) –
fastq1_filtered : str
- fastq2_filtered : str
- [OPTIONAL]
Returns: - fastq1_filtered|fastq1_filtered (str) – Locations of the filtered FASTQ files from which alignments were made
- fastq2_filtered|fastq2_filtered (str) – Locations of the filtered FASTQ files from which alignments were made
- input_files (dict) –
-
BS Seeker2 Methylation Peak Caller¶
BWA Alignment - bwa aln¶
This pipeline aligns FASTQ reads to a given indexed genome. The pipeline can handle single-end and paired-end reads.
Running from the command line¶
Parameters¶
- config : str
- Configuration JSON file
- in_metadata : str
- Location of input JSON metadata for files
- out_metadata : str
- Location of output JSON metadata for files
Returns¶
- bam : file
- Aligned reads in bam file
Example¶
REQUIREMENT - Needs the indexing step to be run first
When running the pipeline on a local machine without COMPSs:
1 2 3 4 5 | python process_align_bwa.py \
--config tests/json/config_chipseq.json \
--in_metadata tests/json/input_chipseq.json \
--out_metadata tests/json/output_chipseq.json \
--local
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_align_bwa.py \
--config tests/json/config_bwa_aln_single.json \
--in_metadata tests/json/input_bwa_aln_single_metadata.json \
--out_metadata tests/json/output_bwa_aln_single.json
|
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_align_bwa.py \
--config tests/json/config_bwa_aln_paired.json \
--in_metadata tests/json/input_bwa_aln_paired_metadata.json \
--out_metadata tests/json/output_bwa_aln_paired.json
|
Methods¶
-
class
process_align_bwa.
process_bwa
(configuration=None)[source]¶ Functions for aligning FastQ files with BWA ALN
-
run
(input_files, metadata, output_files)[source]¶ Main run function for aligning FastQ reads with BWA ALN.
Parameters: - input_files (dict) –
Location of the initial input files required by the workflow
- genome : str
- Genome FASTA file
- index : str
- Location of the BWA archived index files
- loc : str
- Location of the FASTQ reads files
- fastq2 : str
- [OPTIONAL] Location of the FASTQ reads file for paired end data
- metadata (dict) –
Input file meta data associated with their roles
genome : str index : str loc : str fastq2 : str
- output_files (dict) –
Output file locations
- bam : str
- Output bam file location
Returns: output_files (dict) – Output file locations associated with their roles, for the output
- bam : str
Aligned FASTQ short read file locations
output_metadata (dict) – Output metadata for the associated files in output_files
bam : Metadata
- input_files (dict) –
-
BWA Alignment - bwa mem¶
This pipeline aligns FASTQ reads to a given indexed genome. The pipeline can handle single-end and paired-end reads.
Running from the command line¶
Parameters¶
- config : str
- Configuration JSON file
- in_metadata : str
- Location of input JSON metadata for files
- out_metadata : str
- Location of output JSON metadata for files
Returns¶
- bam : file
- Aligned reads in bam file
Example¶
REQUIREMENT - Needs the indexing step to be run first
When running the pipeline on a local machine without COMPSs:
1 2 3 4 5 | python process_align_bwa.py \
--config tests/json/config_chipseq.json \
--in_metadata tests/json/input_chipseq.json \
--out_metadata tests/json/output_chipseq.json \
--local
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_align_bwa_mem.py \
--config tests/json/config_bwa_mem_single.json \
--in_metadata tests/json/input_bwa_mem_single_metadata.json \
--out_metadata tests/json/output_bwa_mem_single.json
|
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_align_bwa_mem.py \
--config tests/json/config_bwa_mem_paired.json \
--in_metadata tests/json/input_bwa_mem_paired_metadata.json \
--out_metadata tests/json/output_bwa_mem_paired.json
|
Methods¶
-
class
process_align_bwa_mem.
process_bwa_mem
(configuration=None)[source]¶ Functions for aligning FastQ files with BWA MEM
-
run
(input_files, metadata, output_files)[source]¶ Main run function for aligning FastQ data with BWA MEM.
Parameters: - input_files (dict) –
Location of the initial input files required by the workflow
- genome : str
- Genome FASTA file
- index : str
- Location of the BWA archived index files
- loc : str
- Location of the FASTQ reads files
- fastq2 : str
- [OPTIONAL] Location of the FASTQ reads file for paired end data
- metadata (dict) –
Input file meta data associated with their roles
genome : str index : str loc : str fastq2 : str
- output_files (dict) –
Output file locations
- bam : str
- Output bam file location
Returns: output_files (dict) – Output file locations associated with their roles, for the output
- bam : str
Aligned FASTQ short read file locations
output_metadata (dict) – Output metadata for the associated files in output_files
bam : Metadata
- input_files (dict) –
-
ChIP-Seq Analysis¶
This pipeline can process FASTQ to identify protein-DNA binding sites.
Running from the command line¶
Parameters¶
- config : str
- Configuration JSON file
- in_metadata : str
- Location of input JSON metadata for files
- out_metadata : str
- Location of output JSON metadata for files
Returns¶
- bed : file
- Bed files with the locations of transcription factor binding sites within the genome
Example¶
REQUIREMENT - Needs the indexing step to be run first
When running the pipeline on a local machine without COMPSs:
1 2 3 4 5 | python process_chipseq.py \
--config tests/json/config_chipseq.json \
--in_metadata tests/json/input_chipseq.json \
--out_metadata tests/json/output_chipseq.json \
--local
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_chipseq.py \
--config tests/json/config_chipseq.json \
--in_metadata tests/json/input_chipseq.json \
--out_metadata tests/json/output_chipseq.json
|
Methods¶
-
class
process_chipseq.
process_chipseq
(configuration=None)[source]¶ Functions for processing Chip-Seq FastQ files. Files are the aligned, filtered and analysed for peak calling
-
run
(input_files, metadata, output_files)[source]¶ Main run function for processing ChIP-seq FastQ data. Pipeline aligns the FASTQ files to the genome using BWA. MACS 2 is then used for peak calling to identify transcription factor binding sites within the genome.
Currently this can only handle a single data file and a single background file.
Parameters: - input_files (dict) –
Location of the initial input files required by the workflow
- genome : str
- Genome FASTA file
- index : str
- Location of the BWA archived index files
- loc : str
- Location of the FASTQ reads files
- fastq2 : str
- Location of the paired end FASTQ file [OPTIONAL]
- bg_loc : str
- Location of the background FASTQ reads files [OPTIONAL]
- fastq2_bg : str
- Location of the paired end background FASTQ reads files [OPTIONAL]
- metadata (dict) –
Input file meta data associated with their roles
genome : str index : str
- bg_loc : str
- [OPTIONAL]
- output_files (dict) –
Output file locations
bam [, “bam_bg”] : str filtered [, “filtered_bg”] : str narrow_peak : str summits : str broad_peak : str gapped_peak : str
Returns: output_files (dict) – Output file locations associated with their roles, for the output
- bam [, “bam_bg”] : str
Aligned FASTQ short read file [ and aligned background file] locations
- filtered [, “filtered_bg”] : str
Filtered versions of the respective bam files
- narrow_peak : str
Results files in bed4+1 format
- summits : str
Results files in bed6+4 format
- broad_peak : str
Results files in bed6+3 format
- gapped_peak : str
Results files in bed12+3 format
output_metadata (dict) – Output metadata for the associated files in output_files
bam [, “bam_bg”] : Metadata filtered [, “filtered_bg”] : Metadata narrow_peak : Metadata summits : Metadata broad_peak : Metadata gapped_peak : Metadata
- input_files (dict) –
-
iDamID-Seq Analysis¶
This pipeline can process FASTQ to identify protein-DNA binding sites.
Running from the command line¶
Parameters¶
- config : str
- Configuration JSON file
- in_metadata : str
- Location of input JSON metadata for files
- out_metadata : str
- Location of output JSON metadata for files
Returns¶
- bigwig : file
- Bigwig file of the binding profile of transcription factors
Example¶
REQUIREMENT - Needs the indexing step to be run first
When running the pipeline on a local machine without COMPSs:
1 2 3 4 5 | python process_damidseq.py \
--config tests/json/config_idamidseq.json \
--in_metadata tests/json/input_idamidseq.json \
--out_metadata tests/json/output_idamidseq.json \
--local
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_damidseq.py \
--config tests/json/config_idamidseq.json \
--in_metadata tests/json/input_idamidseq.json \
--out_metadata tests/json/output_idamidseq.json
|
Methods¶
-
class
process_damidseq.
process_damidseq
(configuration=None)[source]¶ Functions for processing Chip-Seq FastQ files. Files are the aligned, filtered and analysed for peak calling
-
run
(input_files, metadata, output_files)[source]¶ Main run function for processing DamID-seq FastQ data. Pipeline aligns the FASTQ files to the genome using BWA. iDEAR is then used for peak calling to identify transcription factor binding sites within the genome.
Currently this can only handle a single data file and a single background file.
Parameters: - input_files (dict) –
Location of the initial input files required by the workflow
- genome : str
- Genome FASTA file
- index : str
- Location of the BWA archived index files
- fastq_1 : str
- Location of the FASTQ reads files
- fastq_2 : str
- Location of the FASTQ repeat reads files
- bg_fastq_1 : str
- Location of the background FASTQ reads files
- bg_fastq_2 : str
- Location of the background FASTQ repeat reads files
- metadata (dict) –
Input file meta data associated with their roles
genome : str index : str fastq_1 : str fastq_2 : str bg_fastq_1 : str bg_fastq_2 : str
- output_files (dict) –
Output file locations
bam [, “bam_bg”] : str filtered [, “filtered_bg”] : str
Returns: output_files (dict) – Output file locations associated with their roles, for the output
- bam [, “bam_bg”] : str
Aligned FASTQ short read file [ and aligned background file] locations
- filtered [, “filtered_bg”] : str
Filtered versions of the respective bam files
- bigwig : str
Location of the bigwig peaks
output_metadata (dict) – Output metadata for the associated files in output_files
bam [, “bam_bg”] : Metadata filtered [, “filtered_bg”] : Metadata bigwig : Metadata
- input_files (dict) –
-
iNPS¶
This pipeline can process bam file to identify nucleosome positions.
Running from the command line¶
Parameters¶
- config : str
- Configuration JSON file
- in_metadata : str
- Location of input JSON metadata for files
- out_metadata : str
- Location of output JSON metadata for files
Returns¶
- bed : file
- Bed files with the locations of nucleosome binding sites within the genome
Example¶
REQUIREMENT - Needs the indexing step to be run first
When running the pipeline on a local machine without COMPSs:
1 2 3 4 5 | python process_iNPS.py \
--config tests/json/config_inps.json \
--in_metadata tests/json/input_iNPS_metadata.json \
--out_metadata tests/json/output_iNPS.json \
--local
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_iNPS.py \
--config tests/json/config_inps.json \
--in_metadata tests/json/input_iNPS_metadata.json \
--out_metadata tests/json/output_iNPS.json
|
Methods¶
-
class
process_iNPS.
process_iNPS
(configuration=None)[source]¶ Functions for improved nucleosome positioning algorithm (iNPS). Bam Files are analysed for peaks for nucleosome positioning
-
run
(input_files, metadata, output_files)[source]¶ This pipeline processes bam files to identify nucleosome regions within the genome and generates bed files.
Parameters: - input_files (dict) – bam_file : str Location of the aligned sequences in bam format
- output_files (dict) – peak_bed : str Location of the collated bed file of nucleosome peak calls
Returns: peak_bed – Location of the collated bed file of nucleosome peak calls
Return type: str
-
MACS2 Analysis¶
Transcript binding site peak caller for ChIP-seq data
Running from the command line¶
Parameters¶
- config : str
- Configuration JSON file
- in_metadata : str
- Location of input JSON metadata for files
- out_metadata : str
- Location of output JSON metadata for files
Returns¶
- bam : file
- Aligned reads in bam file
Example¶
REQUIREMENT - Needs the indexing step to be run first
When running the pipeline on a local machine without COMPSs:
1 2 3 4 5 | python process_align_bwa.py \
--config tests/json/config_macs2.json \
--in_metadata tests/json/input_macs2.json \
--out_metadata tests/json/output_macs2.json \
--local
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_macs2.py \
--config tests/json/config_macs2_single.json \
--in_metadata tests/json/input_macs2_metadata.json \
--out_metadata tests/json/output_macs2.json
|
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_macs2.py \
--config tests/json/config_macs2_bgd_paired.json \
--in_metadata tests/json/input_macs2_bgd_paired_metadata.json \
--out_metadata tests/json/output_macs2_bgd.json
|
Methods¶
-
class
process_macs2.
process_macs2
(configuration=None)[source]¶ Functions for processing Chip-Seq FastQ files. Files are the aligned, filtered and analysed for peak calling
-
run
(input_files, metadata, output_files)[source]¶ Main run function for processing ChIP-seq FastQ data. Pipeline aligns the FASTQ files to the genome using BWA. MACS 2 is then used for peak calling to identify transcription factor binding sites within the genome.
Currently this can only handle a single data file and a single background file.
Parameters: - input_files (dict) –
Location of the initial input files required by the workflow
- bam : str
- Location of the aligned reads file
- bam_bg : str
- Location of the background aligned FASTQ reads file [OPTIONAL]
- metadata (dict) –
Input file meta data associated with their roles
bam : str
- bam_bg : str
- [OPTIONAL]
- output_files (dict) –
Output file locations
narrow_peak : str summits : str broad_peak : str gapped_peak : str
Returns: output_files (dict) – Output file locations associated with their roles, for the output
- narrow_peak : str
Results files in bed4+1 format
- summits : str
Results files in bed6+4 format
- broad_peak : str
Results files in bed6+3 format
- gapped_peak : str
Results files in bed12+3 format
output_metadata (dict) – Output metadata for the associated files in output_files
narrow_peak : Metadata summits : Metadata broad_peak : Metadata gapped_peak : Metadata
- input_files (dict) –
-
Mnase-Seq Analysis¶
This pipeline can process FASTQ to identify nucleosome binding sites.
Running from the command line¶
Parameters¶
- config : str
- Configuration JSON file
- in_metadata : str
- Location of input JSON metadata for files
- out_metadata : str
- Location of output JSON metadata for files
Returns¶
- bed : file
- Bed files with the locations of nucleosome binding sites within the genome
Example¶
REQUIREMENT - Needs the indexing step to be run first
When running the pipeline on a local machine without COMPSs:
1 2 3 4 5 | python process_mnaseseq.py \
--config tests/json/config_mnaseseq.json \
--in_metadata tests/json/input_mnaseseq.json \
--out_metadata tests/json/output_mnaseseq.json \
--local
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_mnaseseq.py \
--config tests/json/config_mnaseseq.json \
--in_metadata tests/json/input_mnaseseq.json \
--out_metadata tests/json/output_mnaseseq.json
|
Methods¶
-
class
process_mnaseseq.
process_mnaseseq
(configuration=None)[source]¶ Functions for downloading and processing Mnase-seq FastQ files. Files are downloaded from the European Nucleotide Archive (ENA), then aligned, filtered and analysed for peak calling
-
run
(input_files, metadata, output_files)[source]¶ Main run function for processing MNase-Seq FastQ data. Pipeline aligns the FASTQ files to the genome using BWA. iNPS is then used for peak calling to identify nucleosome position sites within the genome.
Parameters: - files_ids (list) – List of file locations
- metadata (list) – Required meta data
Returns: outputfiles – List of locations for the output bam, bed and tsv files
Return type: list
-
RNA-Seq Analysis¶
This pipeline can process FASTQ to quantify the level of expression of cDNAs.
Running from the command line¶
Parameters¶
- config : str
- Configuration JSON file
- in_metadata : str
- Location of input JSON metadata for files
- out_metadata : str
- Location of output JSON metadata for files
Returns¶
- bed : file
- WIG file with the levels of expression for genes
Example¶
When running the pipeline on a local machinewithout COMPSs:
1 2 3 4 5 | python process_rnaseq.py \
--config tests/json/config_rnaseq.json \
--in_metadata tests/json/input_rnaseq.json \
--out_metadata tests/json/output_rnaseq.json \
--local
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_rnaseq.py \
--config tests/json/config_rnaseq.json \
--in_metadata tests/json/input_rnaseq.json \
--out_metadata tests/json/output_rnaseq.json
|
Methods¶
-
class
process_rnaseq.
process_rnaseq
(configuration=None)[source]¶ Functions for downloading and processing RNA-seq FastQ files. Files are downloaded from the European Nucleotide Archive (ENA), then they are mapped to quantify the amount of cDNA
-
run
(input_files, metadata, output_files)[source]¶ Main run function for processing RNA-Seq FastQ data. Pipeline aligns the FASTQ files to the genome using Kallisto. Kallisto is then also used for peak calling to identify levels of expression.
Parameters: - files_ids (dict) – List of file locations (genome FASTA, FASTQ_01, FASTQ_02 (for paired ends))
- metadata (list) – Required meta data
- output_files (list) – List of output file locations
- input_files (list) – List of file locations
- metadata – Required meta data
- output_files – List of output file locations
Returns: outputfiles – List of locations for the output bam, bed and tsv files
Return type: list
Returns: - outputfiles (dict) – List of locations for the output index files
- output_metadata (dict) – Metadata about each of the files
-
TrimGalore¶
This pipeline can process FASTQ to trim poor base quality or adapter contamination.
Running from the command line¶
Parameters¶
- config : str
- Configuration JSON file
- in_metadata : str
- Location of input JSON metadata for files
- out_metadata : str
- Location of output JSON metadata for files
Returns¶
- fastq_trimmed : file
- Location of a fastq file containing the sequences after poor base qualities or contamination trimming
A full description of the Trim Galore files can be found at https://github.com/FelixKrueger/TrimGalore
Example¶
When running the pipeline on a local machine without COMPSs:
1 2 3 4 5 | python process_trim_galore.py \
--config tests/json/config_trimgalore.json \
--in_metadata tests/json/input_trimgalore_metadata.json \
--out_metadata tests/json/output_trimgalore.json \
--local
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_trim_galore.py \
--config tests/json/config_trimgalore.json \
--in_metadata tests/json/input_trimgalore_metadata.json \
--out_metadata tests/json/output_trimgalore.json
|
Methods¶
-
class
process_trim_galore.
process_trim_galore
(configuration=None)[source]¶ Functions for filtering FASTQ files. Files are filtered for removal of duplicate reads. Low quality reads in qseq file can also be filtered.
-
run
(input_files, metadata, output_files)[source]¶ This pipeline processes FASTQ files to trim low quality base calls and adapter sequences
Parameters: input_files (dict) – List of strings for the locations of files. These should include:
- fastq : str
- Location for the first FASTQ file for single or paired end reads
- metadata : dict
- Input file meta data associated with their roles
output_files : dict
fastq_trimmed : strReturns: fastq_trimmed|fastq_trimmed – Locations of the filtered FASTQ files from which trimmings were made Return type: str
-
Whole Genome BiSulphate Sequencing Analysis¶
Hi-C Analysis¶
This pipeline can process paired end FASTQ files to identify structural interactions that occur so that the genome can fold into the confines of the nucleus
Running from the command line¶
Parameters¶
- genome : str
- Location of the genomes FASTA file
- genome_gem : str
- Location of the genome GEM index file
- taxon_id : int
- Species taxonomic ID
- assembly : str
- Genomic assembly ID
- file1 : str
- Location of FASTQ file 1
- file2 : str
- Location of FASTQ file 2
- resolutions : str
- Comma separated list of resolutions to calculate the matrix for. [DEFAULT : 1000000,10000000]
- enzyme_name : str
- Name of the enzyme used to digest the genome (example ‘MboI’)
- window_type : str
- iter | frag. Analysis windowing type to use
- windows1 : str
- FASTQ sampling window sizes to use for the first paired end FASTQ file, the default is to use [[1,25], [1,50], [1,75], [1,100]]. This would be represented as 1,25,50,75,100 as input for this variable
- windows2 : str
- FASTQ sampling window sizes to use for the second paired end FASTQ file, the default is to use [[1,25], [1,50], [1,75], [1,100]]. This would be represented as 1,25,50,75,100 as input for this variable
- normalized : int
- 1 | 0. Determines whether the counts of alignments should be normalized
- tag : str
- Name for the experiment output files to use
Returns¶
Adjacency List : file HDF5 Adjacency Array : file
Example¶
REQUIREMENT - Needs the indexing step to be run first
When running the pipeline on a local machine:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | python process_hic.py \
--genome /<dataset_dir>/Homo_sapiens.GRCh38.fasta \
--genome_gem /<dataset_dir>/Homo_sapiens.GRCh38.gem \
--assembly GCA_000001405.25 \
--taxon_id 9606 \
--file1 /<dataset_dir>/<file_name>_1.fastq \
--file2 /<dataset_dir>/<file_name>_2.fastq \
--resolutions 1000000,10000000 \
--enzyme_name MboI \
--windows1 1,100 \
--windows2 1,100 \
--normalized 1 \
--tag Human.SRR1658573 \
--window_type frag
|
When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | runcompss \
--lang=python \
--library_path=${HOME}/bin \
--pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
--log_level=debug \
process_hic.py \
--taxon_id 9606 \
--genome /<dataset_dir>/.Human.GCA_000001405.22_gem.fasta \
--assembly GRCh38 \
--file1 /<dataset_dir>/Human.SRR1658573_1.fastq \
--file2 /<dataset_dir>/Human.SRR1658573_2.fastq \
--genome_gem /<dataset_dir>/Human.GCA_000001405.22_gem.fasta.gem \
--enzyme_name MboI \
--resolutions 10000,100000 \
--windows1 1,100 \
--windows2 1,100 \
--normalized 1 \
--tag Human.SRR1658573 \
--window_type frag
|
Methods¶
-
class
process_hic.
process_hic
(configuration=None)[source]¶ Functions for downloading and processing Mnase-seq FastQ files. Files are downloaded from the European Nucleotide Archive (ENA), then aligned, filtered and analysed for peak calling
-
run
(input_files, metadata, output_files)[source]¶ Main run function for processing MNase-Seq FastQ data. Pipeline aligns the FASTQ files to the genome using BWA. iNPS is then used for peak calling to identify nucleosome position sites within the genome.
Parameters: - files_ids (list) – List of file locations
- metadata (list) – Required meta data
- output_files (list) – List of output file locations
Returns: outputfiles – List of locations for the output bam, bed and tsv files
Return type: list
-