Pipelines

Download and index genome files

This pipeline is for the indexing of genomes once they have been loaded into the VRE. It indexes each new genome with Bowtie2, BWA and GEM. These indexes can then be used by the other pipelines.

Running from the command line

Parameters

taxon_id : int
Species taxonomic ID
assembly : str
Genomic assembly ID
genome : str
Location of the genomes FASTA file

Returns

Bowtie2 index files BWA index files GEM index file

Example

When running the pipeline on a local machine without COMPSs:

1
2
3
4
5
python process_genome.py                              \
   --config tests/json/config_genome_indexer.json \
   --in_metadata tests/json/input_genome_indexer.json \
   --out_metadata tests/json/output_genome_indexer.json \
   --local

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

1
2
3
4
5
6
7
8
9
runcompss                               \
   --lang=python                       \
   --library_path=${HOME}/bin          \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug \
   process_genome.py \
      --config tests/json/config_genome_indexer.json \
      --in_metadata tests/json/input_genome_indexer.json \
      --out_metadata tests/json/output_genome_indexer.json

Methods

class process_genome.process_genome(configuration=None)[source]

Workflow to download and pre-index a given genome

run(input_files, metadata, output_files)[source]

Main run function for the indexing of genome assembly FASTA files. The pipeline uses Bowtie2, BWA and GEM ready for use in pipelines that rely on alignment.

Parameters:
  • input_files (dict) –
    genome : str
    List of file locations
  • metadata (dict) –
    genome : dict
    Required meta data
  • output_files (dict) –
    bwa_index : str
    Location of the BWA index archive files
    bwt_index : str
    Location of the Bowtie2 index archive file
    gem_index : str
    Location of the GEM index file
    genome_gem : str
    Location of a the FASTA file generated for the GEM indexing step
Returns:

  • outputfiles (dict) – List of locations for the output index files
  • output_metadata (dict) – Metadata about each of the files

BioBamBam Alignment Filtering

This pipeline to filter sequencing artifacts from aligned reads.

Running from the command line

Parameters

config : str
Configuration JSON file
in_metadata : str
Location of input JSON metadata for files
out_metadata : str
Location of output JSON metadata for files

Returns

filtered : file
Filtered bam file

Example

REQUIREMENT - Needs an aligned bam file

When running the pipeline on a local machine without COMPSs:

1
2
3
4
5
python process_biobambam.py \
   --config tests/json/config_biobambam.json \
   --in_metadata tests/json/input_biobambam.json \
   --out_metadata tests/json/output_biobambam.json \
   --local

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

1
2
3
4
5
6
7
8
9
runcompss                     \
   --lang=python              \
   --library_path=${HOME}/bin \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug          \
   process_biobambam.py         \
      --config tests/json/config_biobambam.json \
      --in_metadata tests/json/input_biobambam.json \
      --out_metadata tests/json/output_biobambam.json

Methods

class process_biobambam.process_biobambam(configuration=None)[source]

Functions for filtering FastQ alignments with BioBamBam.

run(input_files, metadata, output_files)[source]

Main run function for filtering FastQ aligned reads using BioBamBam.

Parameters:
  • input_files (dict) –

    Location of the initial input files required by the workflow

    bam : str
    Location of BAM file
  • metadata (dict) –

    Input file meta data associated with their roles

    bam : str

  • output_files (dict) –

    Output file locations

    filtered : str

Returns:

  • output_files (dict) – Output file locations associated with their roles, for the output

    filtered : str

    Filtered version of the bam file

  • output_metadata (dict) – Output metadata for the associated files in output_files

    filtered : Metadata

Bowtie2 Alignment

This pipeline aligns FASTQ reads to a given indexed genome. The pipeline can handle single-end and paired-end reads.

Running from the command line

Parameters

config : str
Configuration JSON file
in_metadata : str
Location of input JSON metadata for files
out_metadata : str
Location of output JSON metadata for files

Returns

bam : file
Aligned reads in bam file

Example

REQUIREMENT - Needs the indexing step to be run first

When running the pipeline on a local machine without COMPSs:

1
2
3
4
5
python process_align_bowtie.py \
   --config tests/json/config_bowtie2.json \
   --in_metadata tests/json/input_bowtie2.json \
   --out_metadata tests/json/output_bowtie2.json \
   --local

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

1
2
3
4
5
6
7
8
9
runcompss                     \
   --lang=python              \
   --library_path=${HOME}/bin \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug          \
   process_align_bowtie.py         \
      --config tests/json/config_bowtie2_single.json \
      --in_metadata tests/json/input_bowtie2_single_metadata.json \
      --out_metadata tests/json/output_bowtie2_single.json
1
2
3
4
5
6
7
8
9
runcompss                     \
   --lang=python              \
   --library_path=${HOME}/bin \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug          \
   process_align_bowtie.py         \
      --config tests/json/config_bowtie2_paired.json \
      --in_metadata tests/json/input_bowtie2_paired_metadata.json \
      --out_metadata tests/json/output_bowtie2_paired.json

Methods

class process_align_bowtie.process_bowtie(configuration=None)[source]

Functions for aligning FastQ files with Bowtie2

run(input_files, metadata, output_files)[source]

Main run function for aligning FastQ reads with Bowtie2.

Currently this can only handle a single data file and a single background file.

Parameters:
  • input_files (dict) –

    Location of the initial input files required by the workflow

    genome : str
    Genome FASTA file
    index : str
    Location of the BWA archived index files
    loc : str
    Location of the FASTQ reads files
    fastq2 : str
    [OPTIONAL] Location of the FASTQ reads file for paired end data
  • metadata (dict) –

    Input file meta data associated with their roles

    genome : str index : str loc : str fastq2 : str

  • output_files (dict) –

    Output file locations

    bam : str
    Output bam file location
Returns:

  • output_files (dict) – Output file locations associated with their roles, for the output

    bam : str

    Aligned FASTQ short read file locations

  • output_metadata (dict) – Output metadata for the associated files in output_files

    bam : Metadata

BSgenome Builder

This pipeline can process FASTQ to identify protein-DNA binding sites.

Running from the command line

Parameters

config : str
Configuration JSON file
in_metadata : str
Location of input JSON metadata for files
out_metadata : str
Location of output JSON metadata for files

Returns

bsgenome : file
BSgenome index
genome_2bit : file
Compressed representation of the genome required for generating the index
chrom_size : file
Location of the chrom.size file
seed_file : file
Configuaration file for generating the BSgenome R package

Example

When running the pipeline on a local machine without COMPSs:

1
2
3
4
5
python process_bsgenome.py                            \
   --config tests/json/config_bsgenome.json \
   --in_metadata tests/json/input_bsgenome.json \
   --out_metadata tests/json/output_bsgenome.json \
   --local

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

1
2
3
4
5
6
7
8
9
runcompss                     \
   --lang=python              \
   --library_path=${HOME}/bin \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug          \
   process_bsgenome.py         \
      --config tests/json/config_bsgenome.json \
      --in_metadata tests/json/input_bsgenome.json \
      --out_metadata tests/json/output_bsgenome.json

Methods

class process_bsgenome.process_bsgenome(configuration=None)[source]

Workflow to download and pre-index a given genome

run(input_files, metadata, output_files)[source]

Main run function for the indexing of genome assembly FASTA files. The pipeline uses Bowtie2, BWA and GEM ready for use in pipelines that rely on alignment.

Parameters:
  • input_files (dict) –
    genome : str
    Location of the FASTA input file
  • metadata (dict) –
    genome : dict
    Required meta data
  • output_files (dict) –
    BSgenome : str
    Location of a the BSgenome R package
Returns:

  • outputfiles (dict) – List of locations for the output index files
  • output_metadata (dict) – Metadata about each of the files

BS Seeker2 Indexer

This pipeline can process FASTQ to identify protein-DNA binding sites.

Running from the command line

Parameters

config : str
Configuration JSON file
in_metadata : str
Location of input JSON metadata for files
out_metadata : str
Location of output JSON metadata for files

Returns

index : file
BS Seeker2 index

Example

When running the pipeline on a local machine without COMPSs:

1
2
3
4
5
python process_bs_seeker_index.py                            \
   --config tests/json/config_wgbs_index.json \
   --in_metadata tests/json/input_wgbs_index_metadata.json \
   --out_metadata tests/json/output_wgbs_index.json \
   --local

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

1
2
3
4
5
6
7
8
9
runcompss                     \
   --lang=python              \
   --library_path=${HOME}/bin \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug          \
   process_bs_seeker_index.py         \
      --config tests/json/config_wgbs_index.json \
      --in_metadata tests/json/input_wgbs_index_metadata.json \
      --out_metadata tests/json/output_wgbs_index.json

Methods

class process_bs_seeker_index.process_bs_seeker_index(configuration=None)[source]

Functions for aligning FastQ files with BWA

run(input_files, metadata, output_files)[source]

Main run function for generatigng the index files required by BS Seeker2.

Parameters:
  • input_files (dict) –

    List of strings for the locations of files. These should include:

    genome_fa : str
    Genome assembly in FASTA
  • metadata (dict) –

    Input file meta data associated with their roles

    genome : str

  • output_files (dict) –

    Output file locations

    bam : str
    Output bam file location
Returns:

output_files – Output file locations associated with their roles, for the output

index : str

Return type:

dict

BS Seeker2 Aligner

This pipeline aligns FASTQ paired end reads using BS Seeker2 and Bowtie2.

Running from the command line

Parameters

config : str
Configuration JSON file
in_metadata : str
Location of input JSON metadata for files
out_metadata : str
Location of output JSON metadata for files

Returns

bam : file
Aligned Bam file
bai : file
Aligned Bam index file

Example

When running the pipeline on a local machine without COMPSs:

1
2
3
4
5
python process_bs_seeker_aligner.py                            \
   --config tests/json/config_wgbs_align.json \
   --in_metadata tests/json/input_wgbs_align_metadata.json \
   --out_metadata tests/json/output_wgbs_align.json \
   --local

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

1
2
3
4
5
6
7
8
9
runcompss                     \
   --lang=python              \
   --library_path=${HOME}/bin \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug          \
   process_bs_seeker_aligner.py         \
      --config tests/json/config_wgbs_align.json \
      --in_metadata tests/json/input_wgbs_align_metadata.json \
      --out_metadata tests/json/output_wgbs_align.json

Methods

class process_bs_seeker_aligner.process_bs_seeker_aligner(configuration=None)[source]

Functions for downloading and processing whole genome bisulfate sequencings (WGBS) files. Files are filtered, aligned and analysed for points of methylation

run(input_files, metadata, output_files)[source]

This pipeline processes paired-end FASTQ files to identify methylated regions within the genome.

Parameters:
  • input_files (dict) –

    List of strings for the locations of files. These should include:

    genome_fa : str
    Genome assembly in FASTA
    fastq1 : str
    Location for the first filtered FASTQ file for single or paired end reads
    fastq2 : str
    Location for the second filtered FASTQ file if paired end reads
    index : str
    Location of the index file
  • metadata (dict) –

    Input file meta data associated with their roles

    genome_fa : dict fastq1 : dict fastq2 : dict index : dict

  • output_files (dict) – bam : str bai : str
Returns:

bam|bai – Location of the alignment bam file and the associated index

Return type:

str

BiSulphate Sequencing Filter

This pipeline processes FASTQ files to filter out duplicate reads.

Running from the command line

Parameters

config : str
Configuration JSON file
in_metadata : str
Location of input JSON metadata for files
out_metadata : str
Location of output JSON metadata for files

Returns

fastq1_filtered|fastq1_filtered : str
Locations of the filtered FASTQ files from which alignments were made
fastq2_filtered|fastq2_filtered : str
Locations of the filtered FASTQ files from which alignments were made

Example

When running the pipeline on a local machine without COMPSs:

1
2
3
4
5
python process_bs_seeker_filter.py                   \
   --config tests/json/config_wgbs_filter.json \
   --in_metadata tests/json/input_wgbs_filter_metadata.json \
   --out_metadata tests/json/output_metadata.json \
   --local

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

1
2
3
4
5
6
7
8
9
runcompss                                  \
   --lang=python                           \
   --library_path=${HOME}/bin              \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug                       \
   process_bs_seeker_filter.py                         \
      --config tests/json/config_wgbs_filter.json \
      --in_metadata tests/json/input_wgbs_filter_metadata.json \
      --out_metadata tests/json/output_metadata.json

Methods

class process_bs_seeker_filter.process_bsFilter(configuration=None)[source]

Functions for filtering FASTQ files. Files are filtered for removal of duplicate reads. Low quality reads in qseq file can also be filtered.

run(input_files, metadata, output_files)[source]

This pipeline processes FASTQ files to filter duplicate entries

Parameters:
  • input_files (dict) –

    List of strings for the locations of files. These should include:

    fastq1 : str
    Location for the first FASTQ file for single or paired end reads
    fastq2 : str
    Location for the second FASTQ file if paired end reads [OPTIONAL]
  • metadata (dict) –

    Input file meta data associated with their roles

    fastq1 : str

    fastq2 : str
    [OPTIONAL]
  • output_files (dict) –

    fastq1_filtered : str

    fastq2_filtered : str
    [OPTIONAL]
Returns:

  • fastq1_filtered|fastq1_filtered (str) – Locations of the filtered FASTQ files from which alignments were made
  • fastq2_filtered|fastq2_filtered (str) – Locations of the filtered FASTQ files from which alignments were made

BS Seeker2 Methylation Peak Caller

BWA Alignment - bwa aln

This pipeline aligns FASTQ reads to a given indexed genome. The pipeline can handle single-end and paired-end reads.

Running from the command line

Parameters

config : str
Configuration JSON file
in_metadata : str
Location of input JSON metadata for files
out_metadata : str
Location of output JSON metadata for files

Returns

bam : file
Aligned reads in bam file

Example

REQUIREMENT - Needs the indexing step to be run first

When running the pipeline on a local machine without COMPSs:

1
2
3
4
5
python process_align_bwa.py                            \
   --config tests/json/config_chipseq.json \
   --in_metadata tests/json/input_chipseq.json \
   --out_metadata tests/json/output_chipseq.json \
   --local

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

1
2
3
4
5
6
7
8
9
runcompss                     \
   --lang=python              \
   --library_path=${HOME}/bin \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug          \
   process_align_bwa.py         \
      --config tests/json/config_bwa_aln_single.json \
      --in_metadata tests/json/input_bwa_aln_single_metadata.json \
      --out_metadata tests/json/output_bwa_aln_single.json
1
2
3
4
5
6
7
8
9
runcompss                     \
   --lang=python              \
   --library_path=${HOME}/bin \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug          \
   process_align_bwa.py         \
      --config tests/json/config_bwa_aln_paired.json \
      --in_metadata tests/json/input_bwa_aln_paired_metadata.json \
      --out_metadata tests/json/output_bwa_aln_paired.json

Methods

class process_align_bwa.process_bwa(configuration=None)[source]

Functions for aligning FastQ files with BWA ALN

run(input_files, metadata, output_files)[source]

Main run function for aligning FastQ reads with BWA ALN.

Parameters:
  • input_files (dict) –

    Location of the initial input files required by the workflow

    genome : str
    Genome FASTA file
    index : str
    Location of the BWA archived index files
    loc : str
    Location of the FASTQ reads files
    fastq2 : str
    [OPTIONAL] Location of the FASTQ reads file for paired end data
  • metadata (dict) –

    Input file meta data associated with their roles

    genome : str index : str loc : str fastq2 : str

  • output_files (dict) –

    Output file locations

    bam : str
    Output bam file location
Returns:

  • output_files (dict) – Output file locations associated with their roles, for the output

    bam : str

    Aligned FASTQ short read file locations

  • output_metadata (dict) – Output metadata for the associated files in output_files

    bam : Metadata

BWA Alignment - bwa mem

This pipeline aligns FASTQ reads to a given indexed genome. The pipeline can handle single-end and paired-end reads.

Running from the command line

Parameters

config : str
Configuration JSON file
in_metadata : str
Location of input JSON metadata for files
out_metadata : str
Location of output JSON metadata for files

Returns

bam : file
Aligned reads in bam file

Example

REQUIREMENT - Needs the indexing step to be run first

When running the pipeline on a local machine without COMPSs:

1
2
3
4
5
python process_align_bwa.py                            \
   --config tests/json/config_chipseq.json \
   --in_metadata tests/json/input_chipseq.json \
   --out_metadata tests/json/output_chipseq.json \
   --local

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

1
2
3
4
5
6
7
8
9
runcompss                     \
   --lang=python              \
   --library_path=${HOME}/bin \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug          \
   process_align_bwa_mem.py         \
      --config tests/json/config_bwa_mem_single.json \
      --in_metadata tests/json/input_bwa_mem_single_metadata.json \
      --out_metadata tests/json/output_bwa_mem_single.json
1
2
3
4
5
6
7
8
9
runcompss                     \
   --lang=python              \
   --library_path=${HOME}/bin \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug          \
   process_align_bwa_mem.py         \
      --config tests/json/config_bwa_mem_paired.json \
      --in_metadata tests/json/input_bwa_mem_paired_metadata.json \
      --out_metadata tests/json/output_bwa_mem_paired.json

Methods

class process_align_bwa_mem.process_bwa_mem(configuration=None)[source]

Functions for aligning FastQ files with BWA MEM

run(input_files, metadata, output_files)[source]

Main run function for aligning FastQ data with BWA MEM.

Parameters:
  • input_files (dict) –

    Location of the initial input files required by the workflow

    genome : str
    Genome FASTA file
    index : str
    Location of the BWA archived index files
    loc : str
    Location of the FASTQ reads files
    fastq2 : str
    [OPTIONAL] Location of the FASTQ reads file for paired end data
  • metadata (dict) –

    Input file meta data associated with their roles

    genome : str index : str loc : str fastq2 : str

  • output_files (dict) –

    Output file locations

    bam : str
    Output bam file location
Returns:

  • output_files (dict) – Output file locations associated with their roles, for the output

    bam : str

    Aligned FASTQ short read file locations

  • output_metadata (dict) – Output metadata for the associated files in output_files

    bam : Metadata

ChIP-Seq Analysis

This pipeline can process FASTQ to identify protein-DNA binding sites.

Running from the command line

Parameters

config : str
Configuration JSON file
in_metadata : str
Location of input JSON metadata for files
out_metadata : str
Location of output JSON metadata for files

Returns

bed : file
Bed files with the locations of transcription factor binding sites within the genome

Example

REQUIREMENT - Needs the indexing step to be run first

When running the pipeline on a local machine without COMPSs:

1
2
3
4
5
python process_chipseq.py                            \
   --config tests/json/config_chipseq.json \
   --in_metadata tests/json/input_chipseq.json \
   --out_metadata tests/json/output_chipseq.json \
   --local

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

1
2
3
4
5
6
7
8
9
runcompss                     \
   --lang=python              \
   --library_path=${HOME}/bin \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug          \
   process_chipseq.py         \
      --config tests/json/config_chipseq.json \
      --in_metadata tests/json/input_chipseq.json \
      --out_metadata tests/json/output_chipseq.json

Methods

class process_chipseq.process_chipseq(configuration=None)[source]

Functions for processing Chip-Seq FastQ files. Files are the aligned, filtered and analysed for peak calling

run(input_files, metadata, output_files)[source]

Main run function for processing ChIP-seq FastQ data. Pipeline aligns the FASTQ files to the genome using BWA. MACS 2 is then used for peak calling to identify transcription factor binding sites within the genome.

Currently this can only handle a single data file and a single background file.

Parameters:
  • input_files (dict) –

    Location of the initial input files required by the workflow

    genome : str
    Genome FASTA file
    index : str
    Location of the BWA archived index files
    loc : str
    Location of the FASTQ reads files
    fastq2 : str
    Location of the paired end FASTQ file [OPTIONAL]
    bg_loc : str
    Location of the background FASTQ reads files [OPTIONAL]
    fastq2_bg : str
    Location of the paired end background FASTQ reads files [OPTIONAL]
  • metadata (dict) –

    Input file meta data associated with their roles

    genome : str index : str

    bg_loc : str
    [OPTIONAL]
  • output_files (dict) –

    Output file locations

    bam [, “bam_bg”] : str filtered [, “filtered_bg”] : str narrow_peak : str summits : str broad_peak : str gapped_peak : str

Returns:

  • output_files (dict) – Output file locations associated with their roles, for the output

    bam [, “bam_bg”] : str

    Aligned FASTQ short read file [ and aligned background file] locations

    filtered [, “filtered_bg”] : str

    Filtered versions of the respective bam files

    narrow_peak : str

    Results files in bed4+1 format

    summits : str

    Results files in bed6+4 format

    broad_peak : str

    Results files in bed6+3 format

    gapped_peak : str

    Results files in bed12+3 format

  • output_metadata (dict) – Output metadata for the associated files in output_files

    bam [, “bam_bg”] : Metadata filtered [, “filtered_bg”] : Metadata narrow_peak : Metadata summits : Metadata broad_peak : Metadata gapped_peak : Metadata

iDamID-Seq Analysis

This pipeline can process FASTQ to identify protein-DNA binding sites.

Running from the command line

Parameters

config : str
Configuration JSON file
in_metadata : str
Location of input JSON metadata for files
out_metadata : str
Location of output JSON metadata for files

Returns

bigwig : file
Bigwig file of the binding profile of transcription factors

Example

REQUIREMENT - Needs the indexing step to be run first

When running the pipeline on a local machine without COMPSs:

1
2
3
4
5
python process_damidseq.py                            \
   --config tests/json/config_idamidseq.json \
   --in_metadata tests/json/input_idamidseq.json \
   --out_metadata tests/json/output_idamidseq.json \
   --local

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

1
2
3
4
5
6
7
8
9
runcompss                     \
   --lang=python              \
   --library_path=${HOME}/bin \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug          \
   process_damidseq.py         \
      --config tests/json/config_idamidseq.json \
      --in_metadata tests/json/input_idamidseq.json \
      --out_metadata tests/json/output_idamidseq.json

Methods

class process_damidseq.process_damidseq(configuration=None)[source]

Functions for processing Chip-Seq FastQ files. Files are the aligned, filtered and analysed for peak calling

run(input_files, metadata, output_files)[source]

Main run function for processing DamID-seq FastQ data. Pipeline aligns the FASTQ files to the genome using BWA. iDEAR is then used for peak calling to identify transcription factor binding sites within the genome.

Currently this can only handle a single data file and a single background file.

Parameters:
  • input_files (dict) –

    Location of the initial input files required by the workflow

    genome : str
    Genome FASTA file
    index : str
    Location of the BWA archived index files
    fastq_1 : str
    Location of the FASTQ reads files
    fastq_2 : str
    Location of the FASTQ repeat reads files
    bg_fastq_1 : str
    Location of the background FASTQ reads files
    bg_fastq_2 : str
    Location of the background FASTQ repeat reads files
  • metadata (dict) –

    Input file meta data associated with their roles

    genome : str index : str fastq_1 : str fastq_2 : str bg_fastq_1 : str bg_fastq_2 : str

  • output_files (dict) –

    Output file locations

    bam [, “bam_bg”] : str filtered [, “filtered_bg”] : str

Returns:

  • output_files (dict) – Output file locations associated with their roles, for the output

    bam [, “bam_bg”] : str

    Aligned FASTQ short read file [ and aligned background file] locations

    filtered [, “filtered_bg”] : str

    Filtered versions of the respective bam files

    bigwig : str

    Location of the bigwig peaks

  • output_metadata (dict) – Output metadata for the associated files in output_files

    bam [, “bam_bg”] : Metadata filtered [, “filtered_bg”] : Metadata bigwig : Metadata

iNPS

This pipeline can process bam file to identify nucleosome positions.

Running from the command line

Parameters

config : str
Configuration JSON file
in_metadata : str
Location of input JSON metadata for files
out_metadata : str
Location of output JSON metadata for files

Returns

bed : file
Bed files with the locations of nucleosome binding sites within the genome

Example

REQUIREMENT - Needs the indexing step to be run first

When running the pipeline on a local machine without COMPSs:

1
2
3
4
5
python process_iNPS.py                           \
   --config tests/json/config_inps.json \
   --in_metadata tests/json/input_iNPS_metadata.json \
   --out_metadata tests/json/output_iNPS.json \
   --local

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

1
2
3
4
5
6
7
8
9
runcompss                                        \
   --lang=python                                 \
   --library_path=${HOME}/bin                    \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug                             \
   process_iNPS.py                           \
       --config tests/json/config_inps.json \
       --in_metadata tests/json/input_iNPS_metadata.json \
       --out_metadata tests/json/output_iNPS.json

Methods

class process_iNPS.process_iNPS(configuration=None)[source]

Functions for improved nucleosome positioning algorithm (iNPS). Bam Files are analysed for peaks for nucleosome positioning

run(input_files, metadata, output_files)[source]

This pipeline processes bam files to identify nucleosome regions within the genome and generates bed files.

Parameters:
  • input_files (dict) – bam_file : str Location of the aligned sequences in bam format
  • output_files (dict) – peak_bed : str Location of the collated bed file of nucleosome peak calls
Returns:

peak_bed – Location of the collated bed file of nucleosome peak calls

Return type:

str

MACS2 Analysis

Transcript binding site peak caller for ChIP-seq data

Running from the command line

Parameters

config : str
Configuration JSON file
in_metadata : str
Location of input JSON metadata for files
out_metadata : str
Location of output JSON metadata for files

Returns

bam : file
Aligned reads in bam file

Example

REQUIREMENT - Needs the indexing step to be run first

When running the pipeline on a local machine without COMPSs:

1
2
3
4
5
python process_align_bwa.py                            \
   --config tests/json/config_macs2.json \
   --in_metadata tests/json/input_macs2.json \
   --out_metadata tests/json/output_macs2.json \
   --local

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

1
2
3
4
5
6
7
8
9
runcompss                     \
   --lang=python              \
   --library_path=${HOME}/bin \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug          \
   process_macs2.py         \
      --config tests/json/config_macs2_single.json \
      --in_metadata tests/json/input_macs2_metadata.json \
      --out_metadata tests/json/output_macs2.json
1
2
3
4
5
6
7
8
9
runcompss                     \
   --lang=python              \
   --library_path=${HOME}/bin \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug          \
   process_macs2.py         \
      --config tests/json/config_macs2_bgd_paired.json \
      --in_metadata tests/json/input_macs2_bgd_paired_metadata.json \
      --out_metadata tests/json/output_macs2_bgd.json

Methods

class process_macs2.process_macs2(configuration=None)[source]

Functions for processing Chip-Seq FastQ files. Files are the aligned, filtered and analysed for peak calling

run(input_files, metadata, output_files)[source]

Main run function for processing ChIP-seq FastQ data. Pipeline aligns the FASTQ files to the genome using BWA. MACS 2 is then used for peak calling to identify transcription factor binding sites within the genome.

Currently this can only handle a single data file and a single background file.

Parameters:
  • input_files (dict) –

    Location of the initial input files required by the workflow

    bam : str
    Location of the aligned reads file
    bam_bg : str
    Location of the background aligned FASTQ reads file [OPTIONAL]
  • metadata (dict) –

    Input file meta data associated with their roles

    bam : str

    bam_bg : str
    [OPTIONAL]
  • output_files (dict) –

    Output file locations

    narrow_peak : str summits : str broad_peak : str gapped_peak : str

Returns:

  • output_files (dict) – Output file locations associated with their roles, for the output

    narrow_peak : str

    Results files in bed4+1 format

    summits : str

    Results files in bed6+4 format

    broad_peak : str

    Results files in bed6+3 format

    gapped_peak : str

    Results files in bed12+3 format

  • output_metadata (dict) – Output metadata for the associated files in output_files

    narrow_peak : Metadata summits : Metadata broad_peak : Metadata gapped_peak : Metadata

Mnase-Seq Analysis

This pipeline can process FASTQ to identify nucleosome binding sites.

Running from the command line

Parameters

config : str
Configuration JSON file
in_metadata : str
Location of input JSON metadata for files
out_metadata : str
Location of output JSON metadata for files

Returns

bed : file
Bed files with the locations of nucleosome binding sites within the genome

Example

REQUIREMENT - Needs the indexing step to be run first

When running the pipeline on a local machine without COMPSs:

1
2
3
4
5
python process_mnaseseq.py                           \
   --config tests/json/config_mnaseseq.json \
   --in_metadata tests/json/input_mnaseseq.json \
   --out_metadata tests/json/output_mnaseseq.json \
   --local

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

1
2
3
4
5
6
7
8
9
runcompss                                        \
   --lang=python                                 \
   --library_path=${HOME}/bin                    \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug                             \
   process_mnaseseq.py                           \
      --config tests/json/config_mnaseseq.json \
      --in_metadata tests/json/input_mnaseseq.json \
      --out_metadata tests/json/output_mnaseseq.json

Methods

class process_mnaseseq.process_mnaseseq(configuration=None)[source]

Functions for downloading and processing Mnase-seq FastQ files. Files are downloaded from the European Nucleotide Archive (ENA), then aligned, filtered and analysed for peak calling

run(input_files, metadata, output_files)[source]

Main run function for processing MNase-Seq FastQ data. Pipeline aligns the FASTQ files to the genome using BWA. iNPS is then used for peak calling to identify nucleosome position sites within the genome.

Parameters:
  • files_ids (list) – List of file locations
  • metadata (list) – Required meta data
Returns:

outputfiles – List of locations for the output bam, bed and tsv files

Return type:

list

RNA-Seq Analysis

This pipeline can process FASTQ to quantify the level of expression of cDNAs.

Running from the command line

Parameters

config : str
Configuration JSON file
in_metadata : str
Location of input JSON metadata for files
out_metadata : str
Location of output JSON metadata for files

Returns

bed : file
WIG file with the levels of expression for genes

Example

When running the pipeline on a local machinewithout COMPSs:

1
2
3
4
5
python process_rnaseq.py                                       \
   --config tests/json/config_rnaseq.json \
   --in_metadata tests/json/input_rnaseq.json \
   --out_metadata tests/json/output_rnaseq.json \
   --local

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

1
2
3
4
5
6
7
8
9
runcompss                                        \
   --lang=python                                 \
   --library_path=${HOME}/bin                    \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug                             \
   process_rnaseq.py                             \
      --config tests/json/config_rnaseq.json \
      --in_metadata tests/json/input_rnaseq.json \
      --out_metadata tests/json/output_rnaseq.json

Methods

class process_rnaseq.process_rnaseq(configuration=None)[source]

Functions for downloading and processing RNA-seq FastQ files. Files are downloaded from the European Nucleotide Archive (ENA), then they are mapped to quantify the amount of cDNA

run(input_files, metadata, output_files)[source]

Main run function for processing RNA-Seq FastQ data. Pipeline aligns the FASTQ files to the genome using Kallisto. Kallisto is then also used for peak calling to identify levels of expression.

Parameters:
  • files_ids (dict) – List of file locations (genome FASTA, FASTQ_01, FASTQ_02 (for paired ends))
  • metadata (list) – Required meta data
  • output_files (list) – List of output file locations
  • input_files (list) – List of file locations
  • metadata – Required meta data
  • output_files – List of output file locations
Returns:

outputfiles – List of locations for the output bam, bed and tsv files

Return type:

list

Returns:

  • outputfiles (dict) – List of locations for the output index files
  • output_metadata (dict) – Metadata about each of the files

TrimGalore

This pipeline can process FASTQ to trim poor base quality or adapter contamination.

Running from the command line

Parameters

config : str
Configuration JSON file
in_metadata : str
Location of input JSON metadata for files
out_metadata : str
Location of output JSON metadata for files

Returns

fastq_trimmed : file
Location of a fastq file containing the sequences after poor base qualities or contamination trimming

A full description of the Trim Galore files can be found at https://github.com/FelixKrueger/TrimGalore

Example

When running the pipeline on a local machine without COMPSs:

1
2
3
4
5
python process_trim_galore.py                   \
   --config tests/json/config_trimgalore.json \
   --in_metadata tests/json/input_trimgalore_metadata.json \
   --out_metadata tests/json/output_trimgalore.json \
   --local

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

1
2
3
4
5
6
7
8
9
runcompss                                  \
   --lang=python                           \
   --library_path=${HOME}/bin              \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug                       \
   process_trim_galore.py                         \
      --config tests/json/config_trimgalore.json \
      --in_metadata tests/json/input_trimgalore_metadata.json \
      --out_metadata tests/json/output_trimgalore.json

Methods

class process_trim_galore.process_trim_galore(configuration=None)[source]

Functions for filtering FASTQ files. Files are filtered for removal of duplicate reads. Low quality reads in qseq file can also be filtered.

run(input_files, metadata, output_files)[source]

This pipeline processes FASTQ files to trim low quality base calls and adapter sequences

Parameters:input_files (dict) –

List of strings for the locations of files. These should include:

fastq : str
Location for the first FASTQ file for single or paired end reads
metadata : dict
Input file meta data associated with their roles

output_files : dict

fastq_trimmed : str
Returns:fastq_trimmed|fastq_trimmed – Locations of the filtered FASTQ files from which trimmings were made
Return type:str

Whole Genome BiSulphate Sequencing Analysis

Hi-C Analysis

This pipeline can process paired end FASTQ files to identify structural interactions that occur so that the genome can fold into the confines of the nucleus

Running from the command line

Parameters

genome : str
Location of the genomes FASTA file
genome_gem : str
Location of the genome GEM index file
taxon_id : int
Species taxonomic ID
assembly : str
Genomic assembly ID
file1 : str
Location of FASTQ file 1
file2 : str
Location of FASTQ file 2
resolutions : str
Comma separated list of resolutions to calculate the matrix for. [DEFAULT : 1000000,10000000]
enzyme_name : str
Name of the enzyme used to digest the genome (example ‘MboI’)
window_type : str
iter | frag. Analysis windowing type to use
windows1 : str
FASTQ sampling window sizes to use for the first paired end FASTQ file, the default is to use [[1,25], [1,50], [1,75], [1,100]]. This would be represented as 1,25,50,75,100 as input for this variable
windows2 : str
FASTQ sampling window sizes to use for the second paired end FASTQ file, the default is to use [[1,25], [1,50], [1,75], [1,100]]. This would be represented as 1,25,50,75,100 as input for this variable
normalized : int
1 | 0. Determines whether the counts of alignments should be normalized
tag : str
Name for the experiment output files to use

Returns

Adjacency List : file HDF5 Adjacency Array : file

Example

REQUIREMENT - Needs the indexing step to be run first

When running the pipeline on a local machine:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
python process_hic.py                                   \
   --genome /<dataset_dir>/Homo_sapiens.GRCh38.fasta    \
   --genome_gem /<dataset_dir>/Homo_sapiens.GRCh38.gem  \
   --assembly GCA_000001405.25                          \
   --taxon_id 9606                                      \
   --file1 /<dataset_dir>/<file_name>_1.fastq           \
   --file2 /<dataset_dir>/<file_name>_2.fastq           \
   --resolutions 1000000,10000000                       \
   --enzyme_name MboI                                   \
   --windows1 1,100                                     \
   --windows2 1,100                                     \
   --normalized 1                                       \
   --tag Human.SRR1658573                            \
   --window_type frag

When using a local version of the [COMPS virtual machine](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
runcompss                        \
   --lang=python                 \
   --library_path=${HOME}/bin    \
   --pythonpath=/<pyenv_virtenv_dir>/lib/python2.7/site-packages/ \
   --log_level=debug             \
   process_hic.py                \
      --taxon_id 9606            \
      --genome /<dataset_dir>/.Human.GCA_000001405.22_gem.fasta \
      --assembly GRCh38          \
      --file1 /<dataset_dir>/Human.SRR1658573_1.fastq \
      --file2 /<dataset_dir>/Human.SRR1658573_2.fastq \
      --genome_gem /<dataset_dir>/Human.GCA_000001405.22_gem.fasta.gem \
      --enzyme_name MboI         \
      --resolutions 10000,100000 \
      --windows1 1,100           \
      --windows2 1,100           \
      --normalized 1             \
      --tag Human.SRR1658573     \
      --window_type frag

Methods

class process_hic.process_hic(configuration=None)[source]

Functions for downloading and processing Mnase-seq FastQ files. Files are downloaded from the European Nucleotide Archive (ENA), then aligned, filtered and analysed for peak calling

run(input_files, metadata, output_files)[source]

Main run function for processing MNase-Seq FastQ data. Pipeline aligns the FASTQ files to the genome using BWA. iNPS is then used for peak calling to identify nucleosome position sites within the genome.

Parameters:
  • files_ids (list) – List of file locations
  • metadata (list) – Required meta data
  • output_files (list) – List of output file locations
Returns:

outputfiles – List of locations for the output bam, bed and tsv files

Return type:

list