Tools for processing FastQ files¶

File Validation¶

Pipelines and functions assessing the quality of input files.

FastQC¶

class tool.validate_fastqc.fastqcTool(configuration=None)[source]¶

Tool for running indexers over a genome FASTA file

run(input_files, input_metadata, output_files)[source]¶

Tool for assessing the quality of reads in a FastQ file

Parameters:	input_files (dict) – fastq : str List of file locations metadata (dict) – fastq : dict Required meta data output_files (dict) – report : str Location of the HTML
Returns:	array – First element is a list of the index files. Second element is a list of the matching metadata
Return type:	list

validate(**kwargs)[source]¶

FastQC Validator

Parameters:	FastQC_file (str) – Location of the FastQ file report_loc (str) – Location of the output report file

TrimGalore¶

class tool.trimgalore.trimgalore(configuration=None)[source]¶

Tool for trimming FASTQ reads that are of low quality

static get_trimgalore_params(params)[source]¶

Function to handle for extraction of commandline parameters

Parameters:	params (dict) –
Returns:
Return type:	list

run(input_files, input_metadata, output_files)[source]¶

The main function to run TrimGalore to remove low quality and very short reads. TrimGalore uses CutAdapt and FASTQC for the analysis.

Parameters:

input_files (dict) –

fastq1 : string

Location of the FASTQ file

fastq2 : string

[OPTIONAL] Location of the paired end FASTQ file
metadata (dict) – Matching metadata for the inpit FASTQ files

Returns:

output_files (dict) –

fastq1_trimmed : str

Location of the trimmed FASTQ file

fastq2_trimmed : str

[OPTIONAL] Location of a trimmed paired end FASTQ file
output_metadata (dict) – Matching metadata for the output files

trimgalore_paired(**kwargs)[source]¶

Trims and removes low quality subsections and reads from paired-end FASTQ files

Parameters:	fastq_file_in (str) – Location of the input fastq file fastq_file_out (str) – Location of the output fastq file params (dict) – Parameters to use in TrimGalore
Returns:	Indicator of the success of the function
Return type:	bool

trimgalore_single(**kwargs)[source]¶

Trims and removes low quality subsections and reads from a singed-ended FASTQ file

Parameters:	fastq_file_in (str) – Location of the input fastq file fastq_file_out (str) – Location of the output fastq file params (dict) – Parameters to use in TrimGalore
Returns:	Indicator of the success of the function
Return type:	bool

trimgalore_version(**kwargs)[source]¶

Trims and removes low quality subsections and reads from a singed-ended FASTQ file

Parameters:	fastq_file_in (str) – Location of the input fastq file fastq_file_out (str) – Location of the output fastq file params (dict) – Parameters to use in TrimGalore
Returns:	Indicator of the success of the function
Return type:	bool

Indexers¶

Bowtie 2¶

class tool.bowtie_indexer.bowtieIndexerTool(configuration=None)[source]¶

Tool for running indexers over a genome FASTA file

bowtie2_indexer(**kwargs)[source]¶

Bowtie2 Indexer

Parameters:	file_loc (str) – Location of the genome assembly FASTA file idx_loc (str) – Location of the output index file

run(input_files, input_metadata, output_files)[source]¶

Tool for generating assembly aligner index files for use with the Bowtie 2 aligner

Parameters:	input_files (list) – List with a single str element with the location of the genome assembly FASTA file metadata (list) –
Returns:	array – First element is a list of the index files. Second element is a list of the matching metadata
Return type:	list

BSgenome Index¶

class tool.forge_bsgenome.bsgenomeTool(configuration=None)[source]¶

Tool for peak calling for iDamID-seq data

bsgenome_creater(**kwargs)[source]¶

Make BSgenome index files.Uses an R script that wraps the required code.

Parameters:	genome (str) – circo_chrom (str) – Comma separated list of chromosome ids that are circular in the genome seed_file_param (dict) – Parameters required for the function to build the seed file genome_2bit (str) – chrom_size (str) – seed_file (str) – bsgenome (str) –

static genome_to_2bit(genome, genome_2bit)[source]¶

Generate the 2bit genome file from a FASTA file

Parameters:	genome (str) – Location of the FASRA genome file genome_2bit (str) – Location of the 2bit genome file
Returns:	True if successful, False if not.
Return type:	bool

static get_chrom_size(genome_2bit, chrom_size, circ_chrom)[source]¶

Generate the chrom.size file and identify the available chromosomes in the 2Bit file.

Parameters:

genome_2bit (str) – Location of the 2bit genome file
chrom_size (str) – Location to save the chrom.size file to
circ_chrom (list) – List of chromosomes that are known to be circular

Returns:

If successful 2 lists – [0] : List of the linear chromosomes in the 2bit file [1] : List of circular chromosomes in the 2bit file
Returns (False, False) if there is an IOError

run(input_files, input_metadata, output_files)[source]¶

The main function to run iNPS for peak calling over a given BAM file and matching background BAM file.

Parameters:

input_files (list) – List of input bam file locations where 0 is the bam data file and 1 is the matching background bam file
metadata (dict) –

Returns:

output_files (list) – List of locations for the output files.
output_metadata (list) – List of matching metadata dict objects

BS-Seeker2 Indexer¶

class tool.bs_seeker_indexer.bssIndexerTool(configuration=None)[source]¶

Script from BS-Seeker2 for building the index for alignment. In this case it uses Bowtie2.

bss_build_index(**kwargs)[source]¶

Function to submit the FASTA file for the reference sequence and build the required index file used by the aligner.

Parameters:	fasta_file (str) – Location of the genome FASTA file aligner (str) – Aligner to use by BS-Seeker2. Currently only bowtie2 is available in this build aligner_path (str) – Location of the aligners binary file bss_path – Location of the BS-Seeker2 libraries idx_out (str) – Location of the output compressed index file
Returns:	bam_out – Location of the output bam alignment file
Return type:	str

static get_bss_index_params(params)[source]¶

Function to handle to extraction of commandline parameters and formatting them for use in the aligner for BWA ALN

Parameters:	params (dict) –
Returns:
Return type:	list

run(input_files, input_metadata, output_files)[source]¶

Tool for indexing the genome assembly using BS-Seeker2. In this case it is using Bowtie2

Parameters:	input_files (list) – FASTQ file metadata (list) –
Returns:	array – Location of the filtered FASTQ file
Return type:	list

BWA¶

class tool.bwa_indexer.bwaIndexerTool(configuration=None)[source]¶

Tool for running indexers over a genome FASTA file

bwa_indexer(**kwargs)[source]¶

BWA Indexer

Parameters:	file_loc (str) – Location of the genome assebly FASTA file idx_out (str) – Location of the output index file
Returns:
Return type:	bool

run(input_files, input_metadata, output_files)[source]¶

Function to run the BWA over a genome assembly FASTA file to generate the matching index for use with the aligner

Parameters:

input_files (dict) – List containing the location of the genome assembly FASTA file
meta_data (dict) –
output_files (dict) – List of outpout files generated

Returns:

output_files (dict) –

index : str

Location of the index file defined in the input parameters
output_metadata (dict) –

index : Metadata

Metadata relating to the index file

GEM¶

class tool.gem_indexer.gemIndexerTool(configuration=None)[source]¶

Tool for running indexers over a genome FASTA file

gem_indexer(**kwargs)[source]¶

GEM Indexer

Parameters:	genome_file (str) – Location of the genome assembly FASTA file idx_loc (str) – Location of the output index file

run(input_files, input_metadata, output_files)[source]¶

Tool for generating assembly aligner index files for use with the GEM indexer

Parameters:	input_files (list) – List with a single str element with the location of the genome assembly FASTA file input_metadata (list) –
Returns:	array – First element is a list of the index files. Second element is a list of the matching metadata
Return type:	list

Kallisto¶

class tool.kallisto_indexer.kallistoIndexerTool(configuration=None)[source]¶

Tool for running indexers over a genome FASTA file

kallisto_indexer(**kwargs)[source]¶

Kallisto Indexer

Parameters:	file_loc (str) – Location of the cDNA FASTA file for a genome idx_loc (str) – Location of the output index file

run(input_files, input_metadata, output_files)[source]¶

Tool for generating assembly aligner index files for use with Kallisto

Parameters:	input_files (list) – FASTA file location will all the cDNA sequences for a given genome input_metadata (list) –
Returns:	array – First element is a list of the index files. Second element is a list of the matching metadata
Return type:	list

Aligners¶

Bowtie2¶

class tool.bowtie_aligner.bowtie2AlignerTool(configuration=None)[source]¶

Tool for aligning sequence reads to a genome using BWA

bowtie2_aligner_paired(**kwargs)[source]¶

Bowtie2 Aligner - Paired End

Parameters:	genome_file_loc (str) – Location of the genomic fasta read_file_loc1 (str) – Location of the FASTQ file read_file_loc2 (str) – Location of the FASTQ file bam_loc (str) – Location of the output aligned bam file bt2_1_file (str) – Location of the <genome>.1.bt2 index file bt2_2_file (str) – Location of the <genome>.2.bt2 index file bt2_3_file (str) – Location of the <genome>.3.bt2 index file bt2_4_file (str) – Location of the <genome>.4.bt2 index file bt2_rev1_file (str) – Location of the <genome>.rev.1.bt2 index file bt2_rev2_file (str) – Location of the <genome>.rev.2.bt2 index file aln_params (dict) – Alignment parameters
Returns:	bam_loc – Location of the output file
Return type:	str

bowtie2_aligner_single(**kwargs)[source]¶

Bowtie2 Aligner - Single End

Parameters:	genome_file_loc (str) – Location of the genomic fasta read_file_loc1 (str) – Location of the FASTQ file bam_loc (str) – Location of the output aligned bam file bt2_1_file (str) – Location of the <genome>.1.bt2 index file bt2_2_file (str) – Location of the <genome>.2.bt2 index file bt2_3_file (str) – Location of the <genome>.3.bt2 index file bt2_4_file (str) – Location of the <genome>.4.bt2 index file bt2_rev1_file (str) – Location of the <genome>.rev.1.bt2 index file bt2_rev2_file (str) – Location of the <genome>.rev.2.bt2 index file aln_params (dict) – Alignment parameters
Returns:	bam_loc – Location of the output file
Return type:	str

static get_aln_params(params, paired=False)[source]¶

Function to handle to extraction of commandline parameters and formatting them for use in the aligner for Bowtie2

Parameters:	params (dict) – paired (bool) – Indicate if the parameters are paired-end specific. [DEFAULT=False]
Returns:
Return type:	list

run(input_files, input_metadata, output_files)[source]¶

The main function to align bam files to a genome using Bowtie2

Parameters:

input_files (dict) – File 0 is the genome file location, file 1 is the FASTQ file
metadata (dict) –
output_files (dict) –

Returns:

output_files (dict) – First element is a list of output_bam_files, second element is the matching meta data
output_metadata (dict)

untar_index(**kwargs)[source]¶

Extracts the Bowtie2 index files from the genome index tar file.

Parameters:	genome_file_name (str) – Location string of the genome fasta file genome_idx (str) – Location of the Bowtie2 index file bt2_1_file (str) – Location of the <genome>.1.bt2 index file bt2_2_file (str) – Location of the <genome>.2.bt2 index file bt2_3_file (str) – Location of the <genome>.3.bt2 index file bt2_4_file (str) – Location of the <genome>.4.bt2 index file bt2_rev1_file (str) – Location of the <genome>.rev.1.bt2 index file bt2_rev2_file (str) – Location of the <genome>.rev.2.bt2 index file
Returns:	Boolean indicating if the task was successful
Return type:	bool

BWA - ALN¶

class tool.bwa_aligner.bwaAlignerTool(configuration=None)[source]¶

Tool for aligning sequence reads to a genome using BWA

bwa_aligner_paired(**kwargs)[source]¶

BWA ALN Aligner - Paired End

Parameters:	genome_file_loc (str) – Location of the genomic fasta read_file_loc1 (str) – Location of the FASTQ file read_file_loc2 (str) – Location of the FASTQ file bam_loc (str) – Location of the output aligned bam file amb_file (str) – Location of the amb index file ann_file (str) – Location of the ann index file bwt_file (str) – Location of the bwt index file pac_file (str) – Location of the pac index file sa_file (str) – Location of the sa index file aln_params (dict) – Alignment parameters
Returns:	bam_loc – Location of the output file
Return type:	str

bwa_aligner_single(**kwargs)[source]¶

BWA ALN Aligner - Single Ended

Parameters:	genome_file_loc (str) – Location of the genomic fasta read_file_loc (str) – Location of the FASTQ file bam_loc (str) – Location of the output aligned bam file amb_file (str) – Location of the amb index file ann_file (str) – Location of the ann index file bwt_file (str) – Location of the bwt index file pac_file (str) – Location of the pac index file sa_file (str) – Location of the sa index file aln_params (dict) – Alignment parameters
Returns:	bam_loc – Location of the output file
Return type:	str

static get_aln_params(params)[source]¶

Function to handle to extraction of commandline parameters and formatting them for use in the aligner for BWA ALN

Parameters:	params (dict) –
Returns:
Return type:	list

run(input_files, input_metadata, output_files)[source]¶

The main function to align bam files to a genome using BWA

Parameters:

input_files (dict) – File 0 is the genome file location, file 1 is the FASTQ file
metadata (dict) –
output_files (dict) –

Returns:

output_files (dict) – First element is a list of output_bam_files, second element is the matching meta data
output_metadata (dict)

untar_index(**kwargs)[source]¶

Extracts the BWA index files from the genome index tar file.

Parameters:	genome_file_name (str) – Location string of the genome fasta file genome_idx (str) – Location of the BWA index file amb_file (str) – Location of the amb index file ann_file (str) – Location of the ann index file bwt_file (str) – Location of the bwt index file pac_file (str) – Location of the pac index file sa_file (str) – Location of the sa index file
Returns:	Boolean indicating if the task was successful
Return type:	bool

BWA - MEM¶

class tool.bwa_mem_aligner.bwaAlignerMEMTool(configuration=None)[source]¶

Tool for aligning sequence reads to a genome using BWA

bwa_aligner_paired(**kwargs)[source]¶

BWA MEM Aligner - Paired End

Parameters:	genome_file_loc (str) – Location of the genomic fasta read_file_loc1 (str) – Location of the FASTQ file read_file_loc2 (str) – Location of the FASTQ file bam_loc (str) – Location of the output aligned bam file amb_file (str) – Location of the amb index file ann_file (str) – Location of the ann index file bwt_file (str) – Location of the bwt index file pac_file (str) – Location of the pac index file sa_file (str) – Location of the sa index file mem_params (dict) – Alignment parameters
Returns:	bam_loc – Location of the output file
Return type:	str

bwa_aligner_single(**kwargs)[source]¶

BWA MEM Aligner - Single Ended

Parameters:	genome_file_loc (str) – Location of the genomic fasta read_file_loc (str) – Location of the FASTQ file bam_loc (str) – Location of the output aligned bam file amb_file (str) – Location of the amb index file ann_file (str) – Location of the ann index file bwt_file (str) – Location of the bwt index file pac_file (str) – Location of the pac index file sa_file (str) – Location of the sa index file mem_params (dict) – Alignment parameters
Returns:	bam_loc – Location of the output file
Return type:	str

static get_mem_params(params)[source]¶

Function to handle to extraction of commandline parameters and formatting them for use in the aligner for BWA MEM

Parameters:	params (dict) –
Returns:
Return type:	list

run(input_files, input_metadata, output_files)[source]¶

The main function to align bam files to a genome using BWA

Parameters:

input_files (dict) – File 0 is the genome file location, file 1 is the FASTQ file
metadata (dict) –
output_files (dict) –

Returns:

output_files (dict) – First element is a list of output_bam_files, second element is the matching meta data
output_metadata (dict)

untar_index(**kwargs)[source]¶

Extracts the BWA index files from the genome index tar file.

Parameters:	genome_file_name (str) – Location string of the genome fasta file genome_idx (str) – Location of the BWA index file amb_file (str) – Location of the amb index file ann_file (str) – Location of the ann index file bwt_file (str) – Location of the bwt index file pac_file (str) – Location of the pac index file sa_file (str) – Location of the sa index file
Returns:	Boolean indicating if the task was successful
Return type:	bool

BS-Seeker2 Aligner¶

class tool.bs_seeker_aligner.bssAlignerTool(configuration=None)[source]¶

Script from BS-Seeker2 for building the index for alignment. In this case it uses Bowtie2.

bs_seeker_aligner(**kwargs)[source]¶

Alignment of the paired ends to the reference genome

Generates bam files for the alignments

This is performed by running the external program rather than reimplementing the code from the main function to make it easier when it comes to updating the changes in BS-Seeker2

Parameters:	input_fastq1 (str) – Location of paired end FASTQ file 1 input_fastq2 (str) – Location of paired end FASTQ file 2 aligner (str) – Aligner to use aligner_path (str) – Location of the aligner genome_fasta (str) – Location of the genome FASTA file genome_idx (str) – Location of the tar.gz genome index file bam_out (str) – Location of the aligned bam file
Returns:	bam_out – Location of the BAM file generated during the alignment.
Return type:	file

bs_seeker_aligner_single(**kwargs)[source]¶

Alignment of the paired ends to the reference genome

Generates bam files for the alignments

This is performed by running the external program rather than reimplementing the code from the main function to make it easier when it comes to updating the changes in BS-Seeker2

Parameters:	input_fastq1 (str) – Location of paired end FASTQ file 1 input_fastq2 (str) – Location of paired end FASTQ file 2 aligner (str) – Aligner to use aligner_path (str) – Location of the aligner genome_fasta (str) – Location of the genome FASTA file genome_idx (str) – Location of the tar.gz genome index file bam_out (str) – Location of the aligned bam file
Returns:	bam_out – Location of the BAM file generated during the alignment.
Return type:	file

static get_aln_params(params, paired=False)[source]¶

Function to handle to extraction of commandline parameters and formatting them for use in the aligner for Bowtie2

Parameters:	params (dict) – paired (bool) – Indicate if the parameters are paired-end specific. [DEFAULT=False]
Returns:
Return type:	list

run(input_files, input_metadata, output_files)[source]¶

Tool for indexing the genome assembly using BS-Seeker2. In this case it is using Bowtie2

Parameters:	input_files (list) – FASTQ file output_files (list) – Results files. metadata (list) –
Returns:	array – Location of the filtered FASTQ file
Return type:	list

run_aligner(genome_idx, bam_out, script, params)[source]¶

Run the aligner

Parameters:	genome_idx (str) – Location of the genome index archive bam_out (str) – Location of the output bam file script (str) – Location of the BS Seeker2 aligner script params (list) – Parameter list for the aligner
Returns:	True if the function completed successfully
Return type:	bool

Filters¶

BioBamBam Filter¶

class tool.biobambam_filter.biobambam(configuration=None)[source]¶

Tool to sort and filter bam files

biobambam_filter_alignments(**kwargs)[source]¶

Sorts and filters the bam file.

It is important that all duplicate alignments have been removed. This can be run as an intermediate step, but should always be run as a check to ensure that the files are sorted and duplicates have been removed.

Parameters:	bam_file_in (str) – Location of the input bam file bam_file_out (str) – Location of the output bam file tmp_dir (str) – Tmp location for intermediate files during the sorting
Returns:	bam_file_out – Location of the output bam file
Return type:	str

run(input_files, input_metadata, output_files)[source]¶

The main function to run BioBAMBAMfilter to remove duplicates and spurious reads from the FASTQ files before analysis.

Parameters:

input_files (dict) – List of input bam file locations where 0 is the bam data file
metadata (dict) – Matching meta data for the input files
output_files (dict) – List of output file locations

Returns:

output_files (dict) – Filtered bam fie.
output_metadata (dict) – List of matching metadata dict objects

BS-Seeker2 Filter¶

class tool.bs_seeker_filter.filterReadsTool(configuration=None)[source]¶

Script from BS-Seeker2 for filtering FASTQ files to remove repeats

bss_seeker_filter(**kwargs)[source]¶

This is optional, but removes reads that can be problematic for the alignment of whole genome datasets.

If performing RRBS then this step can be skipped

This is a function that is installed as part of the BS-Seeker installation process.

Parameters:	infile (str) – Location of the FASTQ file
Returns:	outfile – Location of the filtered FASTQ file
Return type:	str

run(input_files, input_metadata, output_files)[source]¶

Tool for filtering duplicate entries from FASTQ files using BS-Seeker2

Parameters:	input_files (list) – FASTQ file input_metadata (list) –
Returns:	array – Location of the filtered FASTQ file
Return type:	list

Trim Galore¶

class tool.trimgalore.trimgalore(configuration=None)[source]

Tool for trimming FASTQ reads that are of low quality

static get_trimgalore_params(params)[source]

Function to handle for extraction of commandline parameters

Parameters:	params (dict) –
Returns:
Return type:	list

run(input_files, input_metadata, output_files)[source]

The main function to run TrimGalore to remove low quality and very short reads. TrimGalore uses CutAdapt and FASTQC for the analysis.

Parameters:

input_files (dict) –

fastq1 : string

Location of the FASTQ file

fastq2 : string

[OPTIONAL] Location of the paired end FASTQ file
metadata (dict) – Matching metadata for the inpit FASTQ files

Returns:

output_files (dict) –

fastq1_trimmed : str

Location of the trimmed FASTQ file

fastq2_trimmed : str

[OPTIONAL] Location of a trimmed paired end FASTQ file
output_metadata (dict) – Matching metadata for the output files

trimgalore_paired(**kwargs)[source]

Trims and removes low quality subsections and reads from paired-end FASTQ files

Parameters:	fastq_file_in (str) – Location of the input fastq file fastq_file_out (str) – Location of the output fastq file params (dict) – Parameters to use in TrimGalore
Returns:	Indicator of the success of the function
Return type:	bool

trimgalore_single(**kwargs)[source]

Trims and removes low quality subsections and reads from a singed-ended FASTQ file

Parameters:	fastq_file_in (str) – Location of the input fastq file fastq_file_out (str) – Location of the output fastq file params (dict) – Parameters to use in TrimGalore
Returns:	Indicator of the success of the function
Return type:	bool

trimgalore_version(**kwargs)[source]

Trims and removes low quality subsections and reads from a singed-ended FASTQ file

Parameters:	fastq_file_in (str) – Location of the input fastq file fastq_file_out (str) – Location of the output fastq file params (dict) – Parameters to use in TrimGalore
Returns:	Indicator of the success of the function
Return type:	bool

Peak Calling¶

BS-Seeker2 Methylation Caller¶

iDEAR¶

class tool.idear.idearTool(configuration=None)[source]¶

Tool for peak calling for iDamID-seq data

idear_peak_calling(**kwargs)[source]¶

Make iDamID-seq peak calls. These are saved as bed files That can then get displayed on genome browsers. Uses an R script that wraps teh iDEAR protocol.

Parameters:	sample_name (str) – bg_name (str) – sample_bam_tar_file (str) – Location of the aligned sequences in bam format bg_bam_tar_file (str) – Location of the aligned background sequences in bam format species (str) – Species name for the alignments assembly (str) – Assembly used for teh aligned sequences peak_bed (str) – Location of the peak bed file
Returns:	peak_bed – Location of the collated bed file
Return type:	str

run(input_files, input_metadata, output_files)[source]¶

The main function to run iNPS for peak calling over a given BAM file and matching background BAM file.

Parameters:

input_files (list) – List of input bam file locations where 0 is the bam data file and 1 is the matching background bam file
metadata (dict) –

Returns:

output_files (list) – List of locations for the output files.
output_metadata (list) – List of matching metadata dict objects

iNPS¶

class tool.inps.inps(configuration=None)[source]¶

Tool for peak calling for MNase-seq data

inps_peak_calling(**kwargs)[source]¶

Convert Bam to Bed then make Nucleosome peak calls. These are saved as bed files That can then get displayed on genome browsers.

Parameters:	bam_file (str) – Location of the aligned sequences in bam format peak_bed (str) – Location of the collated bed file of nucleosome peak calls
Returns:	peak_bed – Location of the collated bed file of nucleosome peak calls
Return type:	str

run(input_files, input_metadata, output_files)[source]¶

The main function to run iNPS for peak calling over a given BAM file and matching background BAM file.

Parameters:

input_files (list) – List of input bam file locations where 0 is the bam data file and 1 is the matching background bam file
metadata (dict) –

Returns:

output_files (list) – List of locations for the output files.
output_metadata (list) – List of matching metadata dict objects

Kallisto Quantification¶

class tool.kallisto_quant.kallistoQuantificationTool(configuration=None)[source]¶

Tool for quantifying RNA-seq alignments to calculate expression levels of genes within a genome.

kallisto_quant_paired(**kwargs)[source]¶

Kallisto quantifier for paired end RNA-seq data

Parameters:	idx_loc (str) – Location of the output index file fastq_file_loc_01 (str) – Location of the FASTQ sequence file fastq_file_loc_02 (str) – Location of the paired FASTQ sequence file
Returns:	wig_file_loc – Location of the wig file containing the levels of expression
Return type:	loc

kallisto_quant_single(**kwargs)[source]¶

Kallisto quantifier for single end RNA-seq data

Parameters:	idx_loc (str) – Location of the output index file fastq_file_loc (str) – Location of the FASTQ sequence file
Returns:	wig_file_loc – Location of the wig file containing the levels of expression
Return type:	loc

kallisto_tsv2bed(**kwargs)[source]¶: So that the TSV file can be viewed within the genome browser it is handy to convert the file to a BigBed file

kallisto_tsv2gff(**kwargs)[source]¶: So that the TSV file can be viewed within the genome browser it is handy to convert the file to a BigBed file

static load_gff_ensembl(gff_file)[source]¶: Function to extract all of the genes and their locations from a GFF file generated by ensembl

static load_gff_ucsc(gff_file)[source]¶: Function to extract all of the genes and their locations from a GFF file generated by ensembl

run(input_files, input_metadata, output_files)[source]¶

Tool for calculating the level of expression

Parameters:	input_files (list) – Kallisto index file for the FASTQ file for the experiemtnal alignments input_metadata (list) –
Returns:	array – First element is a list of the index files. Second element is a list of the matching metadata
Return type:	list

static seq_read_stats(file_in)[source]¶

Calculate the mean and standard deviation of the reads in a fastq file

Parameters:	file_in (str) – Location of a FASTQ file
Returns:	mean : Mean length of sequenced strands std : Standard deviation of lengths of sequenced strands
Return type:	dict

MACS2¶

class tool.macs2.macs2(configuration=None)[source]¶

Tool for peak calling for ChIP-seq data

static get_macs2_params(params)[source]¶

Function to handle to extraction of commandline parameters and formatting them for use in the aligner for BWA ALN

Parameters:	params (dict) –
Returns:
Return type:	list

macs2_peak_calling(**kwargs)[source]¶

Function to run MACS2 for peak calling on aligned sequence files and normalised against a provided background set of alignments.

Parameters:

name (str) – Name to be used to identify the files
bam_file (str) – Location of the aligned FASTQ files as a bam file
bai_file (str) – Location of the bam index file
bam_file_bgd (str) – Location of the aligned FASTQ files as a bam file representing background values for the cell
bai_file_bgd (str) – Location of the background bam index file
narrowpeak (str) – Location of the output narrowpeak file
summits_bed (str) – Location of the output summits bed file
broadpeak (str) – Location of the output broadpeak file
gappedpeak (str) – Location of the output gappedpeak file
chromosome (str) – If the tool is to be run over a single chromosome the matching chromosome name should be specified. If None then the whole bam file is analysed

Returns:

narrowPeak (file) – BED6+4 file - ideal for transcription factor binding site identification
summitPeak (file) – BED4+1 file - Contains the peak summit locations for everypeak
broadPeak (file) – BED6+3 file - ideal for histone binding site identification
gappedPeak (file) – BED12+3 file - Contains a merged set of the broad and narrow peak files
Definitions defined for each of these files have come from the MACS2
documentation described in the docs at https (//github.com/taoliu/MACS)

macs2_peak_calling_nobgd(**kwargs)[source]¶

Function to run MACS2 for peak calling on aligned sequence files without a background dataset for normalisation.

Parameters:

name (str) – Name to be used to identify the files
bam_file (str) – Location of the aligned FASTQ files as a bam file
bai_file (str) – Location of the bam index file
narrowpeak (str) – Location of the output narrowpeak file
summits_bed (str) – Location of the output summits bed file
broadpeak (str) – Location of the output broadpeak file
gappedpeak (str) – Location of the output gappedpeak file
chromosome (str) – If the tool is to be run over a single chromosome the matching chromosome name should be specified. If None then the whole bam file is analysed

Returns:

narrowPeak (file) – BED6+4 file - ideal for transcription factor binding site identification
summitPeak (file) – BED4+1 file - Contains the peak summit locations for everypeak
broadPeak (file) – BED6+3 file - ideal for histone binding site identification
gappedPeak (file) – BED12+3 file - Contains a merged set of the broad and narrow peak files
Definitions defined for each of these files have come from the MACS2
documentation described in the docs at https (//github.com/taoliu/MACS)

run(input_files, input_metadata, output_files)[source]¶

The main function to run MACS 2 for peak calling over a given BAM file and matching background BAM file.

Parameters:

input_files (dict) – List of input bam file locations where 0 is the bam data file and 1 is the matching background bam file
metadata (dict) –

Returns:

output_files (dict) – List of locations for the output files.
output_metadata (dict) – List of matching metadata dict objects

Hi-C Parsing¶

The following tools are a split out of the Hi-C pipelines generated to use the TADbit library.

FASTQ mapping¶

class tool.tb_full_mapping.tbFullMappingTool[source]¶

Tool for mapping fastq paired end files to the GEM index files

run(input_files, input_metadata, output_files)[source]¶

The main function to map the FASTQ files to the GEM file over different window sizes ready for alignment

Parameters:

input_files (list) –

gem_file : str

Location of the genome GEM index file

fastq_file_bgd : str

Location of the FASTQ file
metadata (dict) –

windows : list

List of lists with the window sizes to be computed

enzyme_name : str

Restriction enzyme used [OPTIONAL]

Returns:

output_files (list) – List of locations for the output files.
output_metadata (list) – List of matching metadata dict objects

tb_full_mapping_frag(**kwargs)[source]¶

Function to map the FASTQ files to the GEM file based on fragments derived from the restriction enzyme that was used.

Parameters:	gem_file (str) – Location of the genome GEM index file fastq_file_bgd (str) – Location of the FASTQ file enzyme_name (str) – Restriction enzyme name (MboI) windows (list) – List of lists with the window sizes to be computed window_file (str) – Location of the first window index file
Returns:	window_file – Location of the window index file
Return type:	str

tb_full_mapping_iter(**kwargs)[source]¶

Function to map the FASTQ files to the GEM file over different window sizes ready for alignment

Parameters:

gem_file (str) – Location of the genome GEM index file
fastq_file_bgd (str) – Location of the FASTQ file
windows (list) – List of lists with the window sizes to be computed
window1 (str) – Location of the first window index file
window2 (str) – Location of the second window index file
window3 (str) – Location of the third window index file
window4 (str) – Location of the fourth window index file

Returns:

window1 (str) – Location of the first window index file
window2 (str) – Location of the second window index file
window3 (str) – Location of the third window index file
window4 (str) – Location of the fourth window index file

Map Parsing¶

class tool.tb_parse_mapping.tbParseMappingTool[source]¶

Tool for parsing the mapped reads and generating the list of paired ends that have a match at both ends.

run(input_files, input_metadata, output_files)[source]¶

The main function to map the aligned reads and return the matching pairs. Parsing of the mappings can be either iterative of fragment based. If it is to be iteractive then the locations of 4 output file windows for each end of the paired end window need to be provided. If it is fragment based, then only 2 window locations need to be provided along within an enzyme name.

Parameters:

input_files (list) –

genome_file : str

Location of the genome FASTA file

window1_1 : str

Location of the first window index file

window1_2 : str

Location of the second window index file

window1_3 : str

[OPTIONAL] Location of the third window index file

window1_4 : str

[OPTIONAL] Location of the fourth window index file

window2_1 : str

Location of the first window index file

window2_2 : str

Location of the second window index file

window2_3 : str

[OPTIONAL] Location of the third window index file

window2_4 : str

[OPTIONAL] Location of the fourth window index file
metadata (dict) –

windows : list

List of lists with the window sizes to be computed

enzyme_name : str

Restricture enzyme name

mapping : list

The mapping function used. The options are iter or frag.

Returns:

output_files (list) – List of locations for the output files.
output_metadata (dict) – Dict of matching metadata dict objects

Example

Iterative:

from tool import tb_parse_mapping

genome_file = 'genome.fasta'

root_name_1 = "/tmp/data/expt_source_1".split
root_name_2 = "/tmp/data/expt_source_2".split
windows = [[1,25], [1,50], [1,75], [1,100]]

windows1 = []
windows2 = []

for w in windows:
    tail = "_full_" + w[0] + "-" + w[1] + ".map"
    windows1.append('/'.join(root_name_1) + tail)
    windows2.append('/'.join(root_name_2) + tail)

files = [genome_file] + windows1 + windows2

tpm = tb_parse_mapping.tb_parse_mapping()
metadata = {'enzyme_name' : 'MboI', 'mapping' : ['iter', 'iter'], 'expt_name' = 'test'}
tpm_files, tpm_meta = tpm.run(files, metadata)

Fragment based mapping:

from tool import tb_parse_mapping

genome_file = 'genome.fasta'

root_name_1 = "/tmp/data/expt_source_1".split
root_name_2 = "/tmp/data/expt_source_2".split
windows = [[1,100]]

start = windows[0][0]
end   = windows[0][1]

window1_1 = '/'.join(root_name_1) + "_full_" + start + "-" + end + ".map"
window1_2 = '/'.join(root_name_1) + "_frag_" + start + "-" + end + ".map"

window2_1 = '/'.join(root_name_2) + "_full_" + start + "-" + end + ".map"
window2_2 = '/'.join(root_name_2) + "_frag_" + start + "-" + end + ".map"

files = [
    genome_file,
    window1_1, window1_2,
    window2_1, window2_2,
]

tpm = tb_parse_mapping.tb_parse_mapping()
metadata = {'enzyme_name' : 'MboI', 'mapping' : ['frag', 'frag'], 'expt_name' = 'test'}
tpm_files, tpm_meta = tpm.run(files, metadata)

tb_parse_mapping_frag(**kwargs)[source]¶

Function to map the aligned reads and return the matching pairs

Parameters:	genome_seq (dict) – Object containing the sequence of each of the chromosomes enzyme_name (str) – Name of the enzyme used to digest the genome window1_full (str) – Location of the first window index file window1_frag (str) – Location of the second window index file window2_full (str) – Location of the first window index file window2_frag (str) – Location of the second window index file reads (str) – Location of the reads thats that has a matching location at both ends of the paired reads
Returns:	reads – Location of the intersection of mapped reads that have matching reads in both pair end files
Return type:	str

tb_parse_mapping_iter(**kwargs)[source]¶

Function to map the aligned reads and return the matching pairs

Parameters:	genome_seq (dict) – Object containing the sequence of each of the chromosomes enzyme_name (str) – Name of the enzyme used to digest the genome window1_1 (str) – Location of the first window index file window1_2 (str) – Location of the second window index file window1_3 (str) – Location of the third window index file window1_4 (str) – Location of the fourth window index file window2_1 (str) – Location of the first window index file window2_2 (str) – Location of the second window index file window2_3 (str) – Location of the third window index file window2_4 (str) – Location of the fourth window index file reads (str) – Location of the reads thats that has a matching location at both ends of the paired reads
Returns:	reads – Location of the intersection of mapped reads that have matching reads in both pair end files
Return type:	str

Filter Aligned Reads¶

class tool.tb_filter.tbFilterTool(configuration=None)[source]¶

Tool for filtering out experimetnal artifacts from the aligned data

run(input_files, input_metadata, output_files)[source]¶

The main function to filter the reads to remove experimental artifacts

Parameters:

input_files (list) –

reads : str

Location of the reads thats that has a matching location at both ends of the paired reads
metadata (dict) –

conservative : bool

Level of filtering to apply [DEFAULT : True]

Returns:

output_files (list) – List of locations for the output files.
output_metadata (list) – List of matching metadata dict objects

tb_filter(**kwargs)[source]¶

Function to filter out expoerimental artifacts

Parameters:	reads (str) – Location of the reads thats that has a matching location at both ends of the paired reads filtered_reads_file (str) – Location of the filtered reads conservative (bool) – Level of filtering [DEFAULT : True]
Returns:	filtered_reads – Location of the filtered reads
Return type:	str

Identify TADs and Compartments¶

class tool.tb_segment.tbSegmentTool[source]¶

Tool for finding tads and compartments in an adjacency matrix

run(input_files, input_metadata, output_files)[source]¶

The main function to the predict TAD sites and compartments for a given resolution from the Hi-C matrix

Parameters:

input_files (list) –

bamin : str

Location of the tadbit bam paired reads

biases : str

Location of the pickle hic biases
metadata (dict) –

resolution : int

Resolution of the Hi-C

workdir : str

Location of working directory

ncpus : int

Number of cpus to use

Returns:

output_files (list) – List of locations for the output files.
output_metadata (list) – List of matching metadata dict objects

tb_segment(**kwargs)[source]¶

Function to find tads and compartments in the Hi-C matrix

Parameters:

bamin (str) – Location of the tadbit bam paired reads
biases (str) – Location of the pickle hic biases
resolution (int) – Resolution of the Hi-C
callers (str) – 1 for ta calling, 2 for compartment calling
workdir (str) – Location of working directory
ncpus (int) – Number of cpus to use

Returns:

compartments (str) – Location of tsv file with compartment definition
tads (str) – Location of tsv file with tad definition
filtered_bins (str) – Location of filtered_bins png

Normalize paired end reads file¶

class tool.tb_normalize.tbNormalizeTool[source]¶

Tool for normalizing an adjacency matrix

run(input_files, input_metadata, output_files)[source]¶

The main function for the normalization of the Hi-C matrix to a given resolution

Parameters:

input_files (list) –

bamin : str

Location of the tadbit bam paired reads
metadata (dict) –

normalization: str

normalization(s) to apply. Order matters. Choices: [Vanilla, oneD]

resolution : str

Resolution of the Hi-C

min_perc : str

lower percentile from which consider bins as good.

max_perc : str

upper percentile until which consider bins as good.

workdir : str

Location of working directory

ncpus : str

Number of cpus to use

min_count : str

minimum number of reads mapped to a bin (recommended value could be 2500). If set this option overrides the perc_zero

fasta: str

Location of the fasta file with genome sequence, to compute GC content and number of restriction sites per bin. Required for oneD normalization

mappability: str

Location of the file with mappability, required for oneD normalization

rest_enzyme: str

For oneD normalization. Name of the restriction enzyme used to do the Hi-C experiment

Returns:

output_files (list) – List of locations for the output files.
output_metadata (list) – List of matching metadata dict objects

tb_normalize(**kwargs)[source]¶

Function to normalize to a given resolution the Hi-C matrix

Parameters:

bamin (str) – Location of the tadbit bam paired reads
normalization (str) – normalization(s) to apply. Order matters. Choices: [Vanilla, oneD]
resolution (str) – Resolution of the Hi-C
min_perc (str) – lower percentile from which consider bins as good.
max_perc (str) – upper percentile until which consider bins as good.
workdir (str) – Location of working directory
ncpus (str) – Number of cpus to use
min_count (str) – minimum number of reads mapped to a bin (recommended value could be 2500). If set this option overrides the perc_zero
fasta (str) – Location of the fasta file with genome sequence, to compute GC content and number of restriction sites per bin. Required for oneD normalization
mappability (str) – Location of the file with mappability, required for oneD normalization
rest_enzyme (str) – For oneD normalization. Name of the restriction enzyme used to do the Hi-C experiment

Returns:

hic_biases (str) – Location of HiC biases pickle file
interactions (str) – Location of interaction decay vs genomic distance pdf
filtered_bins (str) – Location of filtered_bins png

Extract binned matrix from paired end reads file¶

class tool.tb_bin.tbBinTool[source]¶

Tool for binning an adjacency matrix

run(input_files, input_metadata, output_files)[source]¶

The main function to the predict TAD sites for a given resolution from the Hi-C matrix

Parameters:

input_files (list) –

bamin : str

Location of the tadbit bam paired reads

biases : str

Location of the pickle hic biases
input_metadata (dict) –

resolution : int

Resolution of the Hi-C

coord1 : str

Coordinate of the region to retrieve. By default all genome, arguments can be either one chromosome name, or the coordinate in the form: “-c chr3:110000000-120000000”

coord2 : str

Coordinate of a second region to retrieve the matrix in the intersection with the first region.

norm : str

[[‘raw’]] normalization(s) to apply. Order matters. Choices: [norm, decay, raw]

workdir : str

Location of working directory

ncpus : int

Number of cpus to use

Returns:

output_files (list) – List of locations for the output files.
output_metadata (list) – List of matching metadata dict objects

tb_bin(**kwargs)[source]¶

Function to bin to a given resolution the Hi-C matrix

Parameters:

bamin (str) – Location of the tadbit bam paired reads
biases (str) – Location of the pickle hic biases
resolution (int) – Resolution of the Hi-C
coord1 (str) – Coordinate of the region to retrieve. By default all genome, arguments can be either one chromosome name, or the coordinate in the form: “-c chr3:110000000-120000000”
coord2 (str) – Coordinate of a second region to retrieve the matrix in the intersection with the first region.
norm (list) – [[‘raw’]] normalization(s) to apply. Order matters. Choices: [norm, decay, raw]
workdir (str) – Location of working directory
ncpus (int) – Number of cpus to use

Returns:

hic_contacts_matrix_raw (str) – Location of HiC raw matrix in text format
hic_contacts_matrix_nrm (str) – Location of HiC normalized matrix in text format
hic_contacts_matrix_raw_fig (str) – Location of HiC raw matrix in png format
hic_contacts_matrix_norm_fig (str) – Location of HiC normalized matrix in png format

Save Matrix to HDF5 File¶

class tool.tb_save_hdf5_matrix.tbSaveAdjacencyHDF5Tool[source]¶

Tool for filtering out experimetnal artifacts from the aligned data

run(input_files, output_files, metadata=None)[source]¶

The main function save the adjacency list from Hi-C into an HDF5 index file at the defined resolutions.

Parameters:

input_files (list) –

adj_list : str

Location of the adjacency list

hdf5_file : str

Location of the HDF5 output matrix file
metadata (dict) –

resolutions : list

Levels of resolution for the adjacency list to be daved at

assembly : str

Assembly of the aligned sequences

normalized : bool

Whether the dataset should be normalised before saving

Returns:

output_files (list) – List of locations for the output files.
output_metadata (list) – List of matching metadata dict objects

tb_matrix_hdf5(**kwargs)[source]¶

Function to the Hi-C matrix into an HDF5 file

This has to be run sequentially as it is not possible for multiple streams to write to the same HDF5 file. This is a run once and leave operatation. There also needs to be a check that no other process is writing to the HDF5 file at the same time. This should be done at the stage and unstaging level to prevent to file getting written to by multiple processes and generating conflicts.

This needs to include attributes for the chromosomes for each resolution - See the mg-rest-adjacency hdf5_reader for further details about the requirement. This prevents the need for secondary storage details outside of the HDF5 file.

Parameters:	hic_data (hic_data) – Hi-C data object hdf5_file (str) – Location of the HDF5 output matrix file resolution (int) – Resolution to read teh Hi-C adjacency list at chromosomes (list) – List of listsd of the chromosome names and their size in the order that they are presented for indexing
Returns:	hdf5_file – Location of the HDF5 output matrix file
Return type:	str

Generate TAD Predictions¶

class tool.tb_generate_tads.tbGenerateTADsTool[source]¶

Tool for taking the adjacency lists and predicting TADs

run(input_files, output_files, metadata=None)[source]¶

The main function to the predict TAD sites for a given resolution from the Hi-C matrix

Parameters:

input_files (list) –

adj_list : str

Location of the adjacency list
metadata (dict) –

resolutions : list

Levels of resolution for the adjacency list to be daved at

assembly : str

Assembly of the aligned sequences

Returns:

output_files (list) – List of locations for the output files.
output_metadata (list) – List of matching metadata dict objects

tb_generate_tads(**kwargs)[source]¶

Function to the predict TAD sites for a given resolution from the Hi-C matrix

Parameters:	expt_name (str) – Location of the adjacency list matrix_file (str) – Location of the HDF5 output matrix file resolution (int) – Resolution to read the Hi-C adjacency list at tad_file (str) – Location of the output TAD file
Returns:	tad_file – Location of the output TAD file
Return type:	str

tb_hic_chr(**kwargs)[source]¶: Get the list of chromosomes in the adjacency list

tb_merge_tad_files(**kwargs)[source]¶: Merge 2 TAD adjacnecny list files

Generate 3D models from binned interaction matrix¶

class tool.tb_model.tbModelTool[source]¶

Tool for normalizing an adjacency matrix

run(input_files, input_metadata, output_files)[source]¶

The main function for the normalization of the Hi-C matrix to a given resolution

Parameters:

input_files (list) –

hic_contacts_matrix_norm : str

Location of the tab-separated normalized matrix
metadata (dict) –

optimize_only: bool

True if only optimize, False for computing the models and stats

gen_pos_chrom_name : str

Coordinates of the genomic region to model.

resolution : str

Resolution of the Hi-C

gen_pos_begin : int

Genomic coordinate from which to start modeling.

gen_pos_end : int

Genomic coordinate where to end modeling.

num_mod_comp : int

Number of models to compute for each optimization step.

num_mod_comp : int

Number of models to keep.

max_dist : str

Range of numbers for optimal maxdist parameter, i.e. 400:1000:100; or just a single number e.g. 800; or a list of numbers e.g. 400 600 800 1000.

upper_bound : int

Range of numbers for optimal upfreq parameter, i.e. 0:1.2:0.3; or just a single number e.g. 0.8; or a list of numbers e.g. 0.1 0.3 0.5 0.9.

lower_bound : int

Range of numbers for optimal low parameter, i.e. -1.2:0:0.3; or just a single number e.g. -0.8; or a list of numbers e.g. -0.1 -0.3 -0.5 -0.9.

cutoff : str

Range of numbers for optimal cutoff distance. Cutoff is computed based on the resolution. This cutoff distance is calculated taking as reference the diameter of a modeled particle in the 3D model. i.e. 1.5:2.5:0.5; or just a single number e.g. 2; or a list of numbers e.g. 2 2.5.

workdir : str

Location of working directory

ncpus : str

Number of cpus to use

Returns:

output_files (list) – List of locations for the output files.
output_metadata (list) – List of matching metadata dict objects

tb_model(**kwargs)[source]¶

Function to normalize to a given resolution the Hi-C matrix

Parameters:

optimize_only (bool) – True if only optimize, False for computing the models and stats
hic_contacts_matrix_norm (str) – Location of the tab-separated normalized matrix
resolution (str) – Resolution of the Hi-C
gen_pos_chrom_name (str) – Coordinates of the genomic region to model.
gen_pos_begin (int) – Genomic coordinate from which to start modeling.
gen_pos_end (int) – Genomic coordinate where to end modeling.
num_mod_comp (int) – Number of models to compute for each optimization step.
num_mod_comp – Number of models to keep.
max_dist (str) – Range of numbers for optimal maxdist parameter, i.e. 400:1000:100; or just a single number e.g. 800; or a list of numbers e.g. 400 600 800 1000.
upper_bound (int) – Range of numbers for optimal upfreq parameter, i.e. 0:1.2:0.3; or just a single number e.g. 0.8; or a list of numbers e.g. 0.1 0.3 0.5 0.9.
lower_bound (int) – Range of numbers for optimal low parameter, i.e. -1.2:0:0.3; or just a single number e.g. -0.8; or a list of numbers e.g. -0.1 -0.3 -0.5 -0.9.
cutoff (str) – Range of numbers for optimal cutoff distance. Cutoff is computed based on the resolution. This cutoff distance is calculated taking as reference the diameter of a modeled particle in the 3D model. i.e. 1.5:2.5:0.5; or just a single number e.g. 2; or a list of numbers e.g. 2 2.5.
workdir (str) – Location of working directory
ncpus (str) – Number of cpus to use

Returns:

tadkit_models (str) – Location of TADkit json file
modeling_stats (str) – Location of the folder with the modeling files and stats