Tools for processing FastQ files¶
File Validation¶
Pipelines and functions assessing the quality of input files.
FastQC¶
-
class
tool.validate_fastqc.
fastqcTool
(configuration=None)[source]¶ Tool for running indexers over a genome FASTA file
-
run
(input_files, input_metadata, output_files)[source]¶ Tool for assessing the quality of reads in a FastQ file
Parameters: - input_files (dict) –
- fastq : str
- List of file locations
- metadata (dict) –
- fastq : dict
- Required meta data
- output_files (dict) –
- report : str
- Location of the HTML
Returns: array – First element is a list of the index files. Second element is a list of the matching metadata
Return type: list
- input_files (dict) –
-
TrimGalore¶
-
class
tool.trimgalore.
trimgalore
(configuration=None)[source]¶ Tool for trimming FASTQ reads that are of low quality
-
static
get_trimgalore_params
(params)[source]¶ Function to handle for extraction of commandline parameters
Parameters: params (dict) – Returns: Return type: list
-
run
(input_files, input_metadata, output_files)[source]¶ The main function to run TrimGalore to remove low quality and very short reads. TrimGalore uses CutAdapt and FASTQC for the analysis.
Parameters: - input_files (dict) –
- fastq1 : string
- Location of the FASTQ file
- fastq2 : string
- [OPTIONAL] Location of the paired end FASTQ file
- metadata (dict) – Matching metadata for the inpit FASTQ files
Returns: output_files (dict) –
- fastq1_trimmed : str
Location of the trimmed FASTQ file
- fastq2_trimmed : str
[OPTIONAL] Location of a trimmed paired end FASTQ file
output_metadata (dict) – Matching metadata for the output files
- input_files (dict) –
-
trimgalore_paired
(**kwargs)[source]¶ Trims and removes low quality subsections and reads from paired-end FASTQ files
Parameters: - fastq_file_in (str) – Location of the input fastq file
- fastq_file_out (str) – Location of the output fastq file
- params (dict) – Parameters to use in TrimGalore
Returns: Indicator of the success of the function
Return type: bool
-
trimgalore_single
(**kwargs)[source]¶ Trims and removes low quality subsections and reads from a singed-ended FASTQ file
Parameters: - fastq_file_in (str) – Location of the input fastq file
- fastq_file_out (str) – Location of the output fastq file
- params (dict) – Parameters to use in TrimGalore
Returns: Indicator of the success of the function
Return type: bool
-
trimgalore_version
(**kwargs)[source]¶ Trims and removes low quality subsections and reads from a singed-ended FASTQ file
Parameters: - fastq_file_in (str) – Location of the input fastq file
- fastq_file_out (str) – Location of the output fastq file
- params (dict) – Parameters to use in TrimGalore
Returns: Indicator of the success of the function
Return type: bool
-
static
Indexers¶
Bowtie 2¶
-
class
tool.bowtie_indexer.
bowtieIndexerTool
(configuration=None)[source]¶ Tool for running indexers over a genome FASTA file
-
bowtie2_indexer
(**kwargs)[source]¶ Bowtie2 Indexer
Parameters: - file_loc (str) – Location of the genome assembly FASTA file
- idx_loc (str) – Location of the output index file
-
run
(input_files, input_metadata, output_files)[source]¶ Tool for generating assembly aligner index files for use with the Bowtie 2 aligner
Parameters: - input_files (list) – List with a single str element with the location of the genome assembly FASTA file
- metadata (list) –
Returns: array – First element is a list of the index files. Second element is a list of the matching metadata
Return type: list
-
BSgenome Index¶
-
class
tool.forge_bsgenome.
bsgenomeTool
(configuration=None)[source]¶ Tool for peak calling for iDamID-seq data
-
bsgenome_creater
(**kwargs)[source]¶ Make BSgenome index files.Uses an R script that wraps the required code.
Parameters: - genome (str) –
- circo_chrom (str) – Comma separated list of chromosome ids that are circular in the genome
- seed_file_param (dict) – Parameters required for the function to build the seed file
- genome_2bit (str) –
- chrom_size (str) –
- seed_file (str) –
- bsgenome (str) –
-
static
genome_to_2bit
(genome, genome_2bit)[source]¶ Generate the 2bit genome file from a FASTA file
Parameters: - genome (str) – Location of the FASRA genome file
- genome_2bit (str) – Location of the 2bit genome file
Returns: True if successful, False if not.
Return type: bool
-
static
get_chrom_size
(genome_2bit, chrom_size, circ_chrom)[source]¶ Generate the chrom.size file and identify the available chromosomes in the 2Bit file.
Parameters: - genome_2bit (str) – Location of the 2bit genome file
- chrom_size (str) – Location to save the chrom.size file to
- circ_chrom (list) – List of chromosomes that are known to be circular
Returns: - If successful 2 lists – [0] : List of the linear chromosomes in the 2bit file [1] : List of circular chromosomes in the 2bit file
- Returns (False, False) if there is an IOError
-
run
(input_files, input_metadata, output_files)[source]¶ The main function to run iNPS for peak calling over a given BAM file and matching background BAM file.
Parameters: - input_files (list) – List of input bam file locations where 0 is the bam data file and 1 is the matching background bam file
- metadata (dict) –
Returns: - output_files (list) – List of locations for the output files.
- output_metadata (list) – List of matching metadata dict objects
-
BS-Seeker2 Indexer¶
-
class
tool.bs_seeker_indexer.
bssIndexerTool
(configuration=None)[source]¶ Script from BS-Seeker2 for building the index for alignment. In this case it uses Bowtie2.
-
bss_build_index
(**kwargs)[source]¶ Function to submit the FASTA file for the reference sequence and build the required index file used by the aligner.
Parameters: - fasta_file (str) – Location of the genome FASTA file
- aligner (str) – Aligner to use by BS-Seeker2. Currently only bowtie2 is available in this build
- aligner_path (str) – Location of the aligners binary file
- bss_path – Location of the BS-Seeker2 libraries
- idx_out (str) – Location of the output compressed index file
Returns: bam_out – Location of the output bam alignment file
Return type: str
-
BWA¶
-
class
tool.bwa_indexer.
bwaIndexerTool
(configuration=None)[source]¶ Tool for running indexers over a genome FASTA file
-
bwa_indexer
(**kwargs)[source]¶ BWA Indexer
Parameters: - file_loc (str) – Location of the genome assebly FASTA file
- idx_out (str) – Location of the output index file
Returns: Return type: bool
-
run
(input_files, input_metadata, output_files)[source]¶ Function to run the BWA over a genome assembly FASTA file to generate the matching index for use with the aligner
Parameters: - input_files (dict) – List containing the location of the genome assembly FASTA file
- meta_data (dict) –
- output_files (dict) – List of outpout files generated
Returns: output_files (dict) –
- index : str
Location of the index file defined in the input parameters
output_metadata (dict) –
- index : Metadata
Metadata relating to the index file
-
GEM¶
-
class
tool.gem_indexer.
gemIndexerTool
(configuration=None)[source]¶ Tool for running indexers over a genome FASTA file
-
gem_indexer
(**kwargs)[source]¶ GEM Indexer
Parameters: - genome_file (str) – Location of the genome assembly FASTA file
- idx_loc (str) – Location of the output index file
-
run
(input_files, input_metadata, output_files)[source]¶ Tool for generating assembly aligner index files for use with the GEM indexer
Parameters: - input_files (list) – List with a single str element with the location of the genome assembly FASTA file
- input_metadata (list) –
Returns: array – First element is a list of the index files. Second element is a list of the matching metadata
Return type: list
-
Kallisto¶
-
class
tool.kallisto_indexer.
kallistoIndexerTool
(configuration=None)[source]¶ Tool for running indexers over a genome FASTA file
-
kallisto_indexer
(**kwargs)[source]¶ Kallisto Indexer
Parameters: - file_loc (str) – Location of the cDNA FASTA file for a genome
- idx_loc (str) – Location of the output index file
-
run
(input_files, input_metadata, output_files)[source]¶ Tool for generating assembly aligner index files for use with Kallisto
Parameters: - input_files (list) – FASTA file location will all the cDNA sequences for a given genome
- input_metadata (list) –
Returns: array – First element is a list of the index files. Second element is a list of the matching metadata
Return type: list
-
Aligners¶
Bowtie2¶
-
class
tool.bowtie_aligner.
bowtie2AlignerTool
(configuration=None)[source]¶ Tool for aligning sequence reads to a genome using BWA
-
bowtie2_aligner_paired
(**kwargs)[source]¶ Bowtie2 Aligner - Paired End
Parameters: - genome_file_loc (str) – Location of the genomic fasta
- read_file_loc1 (str) – Location of the FASTQ file
- read_file_loc2 (str) – Location of the FASTQ file
- bam_loc (str) – Location of the output aligned bam file
- bt2_1_file (str) – Location of the <genome>.1.bt2 index file
- bt2_2_file (str) – Location of the <genome>.2.bt2 index file
- bt2_3_file (str) – Location of the <genome>.3.bt2 index file
- bt2_4_file (str) – Location of the <genome>.4.bt2 index file
- bt2_rev1_file (str) – Location of the <genome>.rev.1.bt2 index file
- bt2_rev2_file (str) – Location of the <genome>.rev.2.bt2 index file
- aln_params (dict) – Alignment parameters
Returns: bam_loc – Location of the output file
Return type: str
-
bowtie2_aligner_single
(**kwargs)[source]¶ Bowtie2 Aligner - Single End
Parameters: - genome_file_loc (str) – Location of the genomic fasta
- read_file_loc1 (str) – Location of the FASTQ file
- bam_loc (str) – Location of the output aligned bam file
- bt2_1_file (str) – Location of the <genome>.1.bt2 index file
- bt2_2_file (str) – Location of the <genome>.2.bt2 index file
- bt2_3_file (str) – Location of the <genome>.3.bt2 index file
- bt2_4_file (str) – Location of the <genome>.4.bt2 index file
- bt2_rev1_file (str) – Location of the <genome>.rev.1.bt2 index file
- bt2_rev2_file (str) – Location of the <genome>.rev.2.bt2 index file
- aln_params (dict) – Alignment parameters
Returns: bam_loc – Location of the output file
Return type: str
-
static
get_aln_params
(params, paired=False)[source]¶ Function to handle to extraction of commandline parameters and formatting them for use in the aligner for Bowtie2
Parameters: - params (dict) –
- paired (bool) – Indicate if the parameters are paired-end specific. [DEFAULT=False]
Returns: Return type: list
-
run
(input_files, input_metadata, output_files)[source]¶ The main function to align bam files to a genome using Bowtie2
Parameters: - input_files (dict) – File 0 is the genome file location, file 1 is the FASTQ file
- metadata (dict) –
- output_files (dict) –
Returns: - output_files (dict) – First element is a list of output_bam_files, second element is the matching meta data
- output_metadata (dict)
-
untar_index
(**kwargs)[source]¶ Extracts the Bowtie2 index files from the genome index tar file.
Parameters: - genome_file_name (str) – Location string of the genome fasta file
- genome_idx (str) – Location of the Bowtie2 index file
- bt2_1_file (str) – Location of the <genome>.1.bt2 index file
- bt2_2_file (str) – Location of the <genome>.2.bt2 index file
- bt2_3_file (str) – Location of the <genome>.3.bt2 index file
- bt2_4_file (str) – Location of the <genome>.4.bt2 index file
- bt2_rev1_file (str) – Location of the <genome>.rev.1.bt2 index file
- bt2_rev2_file (str) – Location of the <genome>.rev.2.bt2 index file
Returns: Boolean indicating if the task was successful
Return type: bool
-
BWA - ALN¶
-
class
tool.bwa_aligner.
bwaAlignerTool
(configuration=None)[source]¶ Tool for aligning sequence reads to a genome using BWA
-
bwa_aligner_paired
(**kwargs)[source]¶ BWA ALN Aligner - Paired End
Parameters: - genome_file_loc (str) – Location of the genomic fasta
- read_file_loc1 (str) – Location of the FASTQ file
- read_file_loc2 (str) – Location of the FASTQ file
- bam_loc (str) – Location of the output aligned bam file
- amb_file (str) – Location of the amb index file
- ann_file (str) – Location of the ann index file
- bwt_file (str) – Location of the bwt index file
- pac_file (str) – Location of the pac index file
- sa_file (str) – Location of the sa index file
- aln_params (dict) – Alignment parameters
Returns: bam_loc – Location of the output file
Return type: str
-
bwa_aligner_single
(**kwargs)[source]¶ BWA ALN Aligner - Single Ended
Parameters: - genome_file_loc (str) – Location of the genomic fasta
- read_file_loc (str) – Location of the FASTQ file
- bam_loc (str) – Location of the output aligned bam file
- amb_file (str) – Location of the amb index file
- ann_file (str) – Location of the ann index file
- bwt_file (str) – Location of the bwt index file
- pac_file (str) – Location of the pac index file
- sa_file (str) – Location of the sa index file
- aln_params (dict) – Alignment parameters
Returns: bam_loc – Location of the output file
Return type: str
-
static
get_aln_params
(params)[source]¶ Function to handle to extraction of commandline parameters and formatting them for use in the aligner for BWA ALN
Parameters: params (dict) – Returns: Return type: list
-
run
(input_files, input_metadata, output_files)[source]¶ The main function to align bam files to a genome using BWA
Parameters: - input_files (dict) – File 0 is the genome file location, file 1 is the FASTQ file
- metadata (dict) –
- output_files (dict) –
Returns: - output_files (dict) – First element is a list of output_bam_files, second element is the matching meta data
- output_metadata (dict)
-
untar_index
(**kwargs)[source]¶ Extracts the BWA index files from the genome index tar file.
Parameters: - genome_file_name (str) – Location string of the genome fasta file
- genome_idx (str) – Location of the BWA index file
- amb_file (str) – Location of the amb index file
- ann_file (str) – Location of the ann index file
- bwt_file (str) – Location of the bwt index file
- pac_file (str) – Location of the pac index file
- sa_file (str) – Location of the sa index file
Returns: Boolean indicating if the task was successful
Return type: bool
-
BWA - MEM¶
-
class
tool.bwa_mem_aligner.
bwaAlignerMEMTool
(configuration=None)[source]¶ Tool for aligning sequence reads to a genome using BWA
-
bwa_aligner_paired
(**kwargs)[source]¶ BWA MEM Aligner - Paired End
Parameters: - genome_file_loc (str) – Location of the genomic fasta
- read_file_loc1 (str) – Location of the FASTQ file
- read_file_loc2 (str) – Location of the FASTQ file
- bam_loc (str) – Location of the output aligned bam file
- amb_file (str) – Location of the amb index file
- ann_file (str) – Location of the ann index file
- bwt_file (str) – Location of the bwt index file
- pac_file (str) – Location of the pac index file
- sa_file (str) – Location of the sa index file
- mem_params (dict) – Alignment parameters
Returns: bam_loc – Location of the output file
Return type: str
-
bwa_aligner_single
(**kwargs)[source]¶ BWA MEM Aligner - Single Ended
Parameters: - genome_file_loc (str) – Location of the genomic fasta
- read_file_loc (str) – Location of the FASTQ file
- bam_loc (str) – Location of the output aligned bam file
- amb_file (str) – Location of the amb index file
- ann_file (str) – Location of the ann index file
- bwt_file (str) – Location of the bwt index file
- pac_file (str) – Location of the pac index file
- sa_file (str) – Location of the sa index file
- mem_params (dict) – Alignment parameters
Returns: bam_loc – Location of the output file
Return type: str
-
static
get_mem_params
(params)[source]¶ Function to handle to extraction of commandline parameters and formatting them for use in the aligner for BWA MEM
Parameters: params (dict) – Returns: Return type: list
-
run
(input_files, input_metadata, output_files)[source]¶ The main function to align bam files to a genome using BWA
Parameters: - input_files (dict) – File 0 is the genome file location, file 1 is the FASTQ file
- metadata (dict) –
- output_files (dict) –
Returns: - output_files (dict) – First element is a list of output_bam_files, second element is the matching meta data
- output_metadata (dict)
-
untar_index
(**kwargs)[source]¶ Extracts the BWA index files from the genome index tar file.
Parameters: - genome_file_name (str) – Location string of the genome fasta file
- genome_idx (str) – Location of the BWA index file
- amb_file (str) – Location of the amb index file
- ann_file (str) – Location of the ann index file
- bwt_file (str) – Location of the bwt index file
- pac_file (str) – Location of the pac index file
- sa_file (str) – Location of the sa index file
Returns: Boolean indicating if the task was successful
Return type: bool
-
BS-Seeker2 Aligner¶
-
class
tool.bs_seeker_aligner.
bssAlignerTool
(configuration=None)[source]¶ Script from BS-Seeker2 for building the index for alignment. In this case it uses Bowtie2.
-
bs_seeker_aligner
(**kwargs)[source]¶ Alignment of the paired ends to the reference genome
Generates bam files for the alignments
This is performed by running the external program rather than reimplementing the code from the main function to make it easier when it comes to updating the changes in BS-Seeker2
Parameters: - input_fastq1 (str) – Location of paired end FASTQ file 1
- input_fastq2 (str) – Location of paired end FASTQ file 2
- aligner (str) – Aligner to use
- aligner_path (str) – Location of the aligner
- genome_fasta (str) – Location of the genome FASTA file
- genome_idx (str) – Location of the tar.gz genome index file
- bam_out (str) – Location of the aligned bam file
Returns: bam_out – Location of the BAM file generated during the alignment.
Return type: file
-
bs_seeker_aligner_single
(**kwargs)[source]¶ Alignment of the paired ends to the reference genome
Generates bam files for the alignments
This is performed by running the external program rather than reimplementing the code from the main function to make it easier when it comes to updating the changes in BS-Seeker2
Parameters: - input_fastq1 (str) – Location of paired end FASTQ file 1
- input_fastq2 (str) – Location of paired end FASTQ file 2
- aligner (str) – Aligner to use
- aligner_path (str) – Location of the aligner
- genome_fasta (str) – Location of the genome FASTA file
- genome_idx (str) – Location of the tar.gz genome index file
- bam_out (str) – Location of the aligned bam file
Returns: bam_out – Location of the BAM file generated during the alignment.
Return type: file
-
static
get_aln_params
(params, paired=False)[source]¶ Function to handle to extraction of commandline parameters and formatting them for use in the aligner for Bowtie2
Parameters: - params (dict) –
- paired (bool) – Indicate if the parameters are paired-end specific. [DEFAULT=False]
Returns: Return type: list
-
run
(input_files, input_metadata, output_files)[source]¶ Tool for indexing the genome assembly using BS-Seeker2. In this case it is using Bowtie2
Parameters: - input_files (list) – FASTQ file
- output_files (list) – Results files.
- metadata (list) –
Returns: array – Location of the filtered FASTQ file
Return type: list
-
run_aligner
(genome_idx, bam_out, script, params)[source]¶ Run the aligner
Parameters: - genome_idx (str) – Location of the genome index archive
- bam_out (str) – Location of the output bam file
- script (str) – Location of the BS Seeker2 aligner script
- params (list) – Parameter list for the aligner
Returns: True if the function completed successfully
Return type: bool
-
Filters¶
BioBamBam Filter¶
-
class
tool.biobambam_filter.
biobambam
(configuration=None)[source]¶ Tool to sort and filter bam files
-
biobambam_filter_alignments
(**kwargs)[source]¶ Sorts and filters the bam file.
It is important that all duplicate alignments have been removed. This can be run as an intermediate step, but should always be run as a check to ensure that the files are sorted and duplicates have been removed.
Parameters: - bam_file_in (str) – Location of the input bam file
- bam_file_out (str) – Location of the output bam file
- tmp_dir (str) – Tmp location for intermediate files during the sorting
Returns: bam_file_out – Location of the output bam file
Return type: str
-
run
(input_files, input_metadata, output_files)[source]¶ The main function to run BioBAMBAMfilter to remove duplicates and spurious reads from the FASTQ files before analysis.
Parameters: - input_files (dict) – List of input bam file locations where 0 is the bam data file
- metadata (dict) – Matching meta data for the input files
- output_files (dict) – List of output file locations
Returns: - output_files (dict) – Filtered bam fie.
- output_metadata (dict) – List of matching metadata dict objects
-
BS-Seeker2 Filter¶
-
class
tool.bs_seeker_filter.
filterReadsTool
(configuration=None)[source]¶ Script from BS-Seeker2 for filtering FASTQ files to remove repeats
-
bss_seeker_filter
(**kwargs)[source]¶ This is optional, but removes reads that can be problematic for the alignment of whole genome datasets.
If performing RRBS then this step can be skipped
This is a function that is installed as part of the BS-Seeker installation process.
Parameters: infile (str) – Location of the FASTQ file Returns: outfile – Location of the filtered FASTQ file Return type: str
-
Trim Galore¶
-
class
tool.trimgalore.
trimgalore
(configuration=None)[source] Tool for trimming FASTQ reads that are of low quality
-
static
get_trimgalore_params
(params)[source] Function to handle for extraction of commandline parameters
Parameters: params (dict) – Returns: Return type: list
-
run
(input_files, input_metadata, output_files)[source] The main function to run TrimGalore to remove low quality and very short reads. TrimGalore uses CutAdapt and FASTQC for the analysis.
Parameters: - input_files (dict) –
- fastq1 : string
- Location of the FASTQ file
- fastq2 : string
- [OPTIONAL] Location of the paired end FASTQ file
- metadata (dict) – Matching metadata for the inpit FASTQ files
Returns: output_files (dict) –
- fastq1_trimmed : str
Location of the trimmed FASTQ file
- fastq2_trimmed : str
[OPTIONAL] Location of a trimmed paired end FASTQ file
output_metadata (dict) – Matching metadata for the output files
- input_files (dict) –
-
trimgalore_paired
(**kwargs)[source] Trims and removes low quality subsections and reads from paired-end FASTQ files
Parameters: - fastq_file_in (str) – Location of the input fastq file
- fastq_file_out (str) – Location of the output fastq file
- params (dict) – Parameters to use in TrimGalore
Returns: Indicator of the success of the function
Return type: bool
-
trimgalore_single
(**kwargs)[source] Trims and removes low quality subsections and reads from a singed-ended FASTQ file
Parameters: - fastq_file_in (str) – Location of the input fastq file
- fastq_file_out (str) – Location of the output fastq file
- params (dict) – Parameters to use in TrimGalore
Returns: Indicator of the success of the function
Return type: bool
-
trimgalore_version
(**kwargs)[source] Trims and removes low quality subsections and reads from a singed-ended FASTQ file
Parameters: - fastq_file_in (str) – Location of the input fastq file
- fastq_file_out (str) – Location of the output fastq file
- params (dict) – Parameters to use in TrimGalore
Returns: Indicator of the success of the function
Return type: bool
-
static
Peak Calling¶
BS-Seeker2 Methylation Caller¶
iDEAR¶
-
class
tool.idear.
idearTool
(configuration=None)[source]¶ Tool for peak calling for iDamID-seq data
-
idear_peak_calling
(**kwargs)[source]¶ Make iDamID-seq peak calls. These are saved as bed files That can then get displayed on genome browsers. Uses an R script that wraps teh iDEAR protocol.
Parameters: - sample_name (str) –
- bg_name (str) –
- sample_bam_tar_file (str) – Location of the aligned sequences in bam format
- bg_bam_tar_file (str) – Location of the aligned background sequences in bam format
- species (str) – Species name for the alignments
- assembly (str) – Assembly used for teh aligned sequences
- peak_bed (str) – Location of the peak bed file
Returns: peak_bed – Location of the collated bed file
Return type: str
-
run
(input_files, input_metadata, output_files)[source]¶ The main function to run iNPS for peak calling over a given BAM file and matching background BAM file.
Parameters: - input_files (list) – List of input bam file locations where 0 is the bam data file and 1 is the matching background bam file
- metadata (dict) –
Returns: - output_files (list) – List of locations for the output files.
- output_metadata (list) – List of matching metadata dict objects
-
iNPS¶
-
class
tool.inps.
inps
(configuration=None)[source]¶ Tool for peak calling for MNase-seq data
-
inps_peak_calling
(**kwargs)[source]¶ Convert Bam to Bed then make Nucleosome peak calls. These are saved as bed files That can then get displayed on genome browsers.
Parameters: - bam_file (str) – Location of the aligned sequences in bam format
- peak_bed (str) – Location of the collated bed file of nucleosome peak calls
Returns: peak_bed – Location of the collated bed file of nucleosome peak calls
Return type: str
-
run
(input_files, input_metadata, output_files)[source]¶ The main function to run iNPS for peak calling over a given BAM file and matching background BAM file.
Parameters: - input_files (list) – List of input bam file locations where 0 is the bam data file and 1 is the matching background bam file
- metadata (dict) –
Returns: - output_files (list) – List of locations for the output files.
- output_metadata (list) – List of matching metadata dict objects
-
Kallisto Quantification¶
-
class
tool.kallisto_quant.
kallistoQuantificationTool
(configuration=None)[source]¶ Tool for quantifying RNA-seq alignments to calculate expression levels of genes within a genome.
-
kallisto_quant_paired
(**kwargs)[source]¶ Kallisto quantifier for paired end RNA-seq data
Parameters: - idx_loc (str) – Location of the output index file
- fastq_file_loc_01 (str) – Location of the FASTQ sequence file
- fastq_file_loc_02 (str) – Location of the paired FASTQ sequence file
Returns: wig_file_loc – Location of the wig file containing the levels of expression
Return type: loc
-
kallisto_quant_single
(**kwargs)[source]¶ Kallisto quantifier for single end RNA-seq data
Parameters: - idx_loc (str) – Location of the output index file
- fastq_file_loc (str) – Location of the FASTQ sequence file
Returns: wig_file_loc – Location of the wig file containing the levels of expression
Return type: loc
-
kallisto_tsv2bed
(**kwargs)[source]¶ So that the TSV file can be viewed within the genome browser it is handy to convert the file to a BigBed file
-
kallisto_tsv2gff
(**kwargs)[source]¶ So that the TSV file can be viewed within the genome browser it is handy to convert the file to a BigBed file
-
static
load_gff_ensembl
(gff_file)[source]¶ Function to extract all of the genes and their locations from a GFF file generated by ensembl
-
static
load_gff_ucsc
(gff_file)[source]¶ Function to extract all of the genes and their locations from a GFF file generated by ensembl
-
run
(input_files, input_metadata, output_files)[source]¶ Tool for calculating the level of expression
Parameters: - input_files (list) – Kallisto index file for the FASTQ file for the experiemtnal alignments
- input_metadata (list) –
Returns: array – First element is a list of the index files. Second element is a list of the matching metadata
Return type: list
-
MACS2¶
-
class
tool.macs2.
macs2
(configuration=None)[source]¶ Tool for peak calling for ChIP-seq data
-
static
get_macs2_params
(params)[source]¶ Function to handle to extraction of commandline parameters and formatting them for use in the aligner for BWA ALN
Parameters: params (dict) – Returns: Return type: list
-
macs2_peak_calling
(**kwargs)[source]¶ Function to run MACS2 for peak calling on aligned sequence files and normalised against a provided background set of alignments.
Parameters: - name (str) – Name to be used to identify the files
- bam_file (str) – Location of the aligned FASTQ files as a bam file
- bai_file (str) – Location of the bam index file
- bam_file_bgd (str) – Location of the aligned FASTQ files as a bam file representing background values for the cell
- bai_file_bgd (str) – Location of the background bam index file
- narrowpeak (str) – Location of the output narrowpeak file
- summits_bed (str) – Location of the output summits bed file
- broadpeak (str) – Location of the output broadpeak file
- gappedpeak (str) – Location of the output gappedpeak file
- chromosome (str) – If the tool is to be run over a single chromosome the matching chromosome name should be specified. If None then the whole bam file is analysed
Returns: - narrowPeak (file) – BED6+4 file - ideal for transcription factor binding site identification
- summitPeak (file) – BED4+1 file - Contains the peak summit locations for everypeak
- broadPeak (file) – BED6+3 file - ideal for histone binding site identification
- gappedPeak (file) – BED12+3 file - Contains a merged set of the broad and narrow peak files
- Definitions defined for each of these files have come from the MACS2
- documentation described in the docs at https (//github.com/taoliu/MACS)
-
macs2_peak_calling_nobgd
(**kwargs)[source]¶ Function to run MACS2 for peak calling on aligned sequence files without a background dataset for normalisation.
Parameters: - name (str) – Name to be used to identify the files
- bam_file (str) – Location of the aligned FASTQ files as a bam file
- bai_file (str) – Location of the bam index file
- narrowpeak (str) – Location of the output narrowpeak file
- summits_bed (str) – Location of the output summits bed file
- broadpeak (str) – Location of the output broadpeak file
- gappedpeak (str) – Location of the output gappedpeak file
- chromosome (str) – If the tool is to be run over a single chromosome the matching chromosome name should be specified. If None then the whole bam file is analysed
Returns: - narrowPeak (file) – BED6+4 file - ideal for transcription factor binding site identification
- summitPeak (file) – BED4+1 file - Contains the peak summit locations for everypeak
- broadPeak (file) – BED6+3 file - ideal for histone binding site identification
- gappedPeak (file) – BED12+3 file - Contains a merged set of the broad and narrow peak files
- Definitions defined for each of these files have come from the MACS2
- documentation described in the docs at https (//github.com/taoliu/MACS)
-
run
(input_files, input_metadata, output_files)[source]¶ The main function to run MACS 2 for peak calling over a given BAM file and matching background BAM file.
Parameters: - input_files (dict) – List of input bam file locations where 0 is the bam data file and 1 is the matching background bam file
- metadata (dict) –
Returns: - output_files (dict) – List of locations for the output files.
- output_metadata (dict) – List of matching metadata dict objects
-
static
Hi-C Parsing¶
The following tools are a split out of the Hi-C pipelines generated to use the TADbit library.
FASTQ mapping¶
-
class
tool.tb_full_mapping.
tbFullMappingTool
[source]¶ Tool for mapping fastq paired end files to the GEM index files
-
run
(input_files, input_metadata, output_files)[source]¶ The main function to map the FASTQ files to the GEM file over different window sizes ready for alignment
Parameters: - input_files (list) –
- gem_file : str
- Location of the genome GEM index file
- fastq_file_bgd : str
- Location of the FASTQ file
- metadata (dict) –
- windows : list
- List of lists with the window sizes to be computed
- enzyme_name : str
- Restriction enzyme used [OPTIONAL]
Returns: - output_files (list) – List of locations for the output files.
- output_metadata (list) – List of matching metadata dict objects
- input_files (list) –
-
tb_full_mapping_frag
(**kwargs)[source]¶ Function to map the FASTQ files to the GEM file based on fragments derived from the restriction enzyme that was used.
Parameters: - gem_file (str) – Location of the genome GEM index file
- fastq_file_bgd (str) – Location of the FASTQ file
- enzyme_name (str) – Restriction enzyme name (MboI)
- windows (list) – List of lists with the window sizes to be computed
- window_file (str) – Location of the first window index file
Returns: window_file – Location of the window index file
Return type: str
-
tb_full_mapping_iter
(**kwargs)[source]¶ Function to map the FASTQ files to the GEM file over different window sizes ready for alignment
Parameters: - gem_file (str) – Location of the genome GEM index file
- fastq_file_bgd (str) – Location of the FASTQ file
- windows (list) – List of lists with the window sizes to be computed
- window1 (str) – Location of the first window index file
- window2 (str) – Location of the second window index file
- window3 (str) – Location of the third window index file
- window4 (str) – Location of the fourth window index file
Returns: - window1 (str) – Location of the first window index file
- window2 (str) – Location of the second window index file
- window3 (str) – Location of the third window index file
- window4 (str) – Location of the fourth window index file
-
Map Parsing¶
-
class
tool.tb_parse_mapping.
tbParseMappingTool
[source]¶ Tool for parsing the mapped reads and generating the list of paired ends that have a match at both ends.
-
run
(input_files, input_metadata, output_files)[source]¶ The main function to map the aligned reads and return the matching pairs. Parsing of the mappings can be either iterative of fragment based. If it is to be iteractive then the locations of 4 output file windows for each end of the paired end window need to be provided. If it is fragment based, then only 2 window locations need to be provided along within an enzyme name.
Parameters: - input_files (list) –
- genome_file : str
- Location of the genome FASTA file
- window1_1 : str
- Location of the first window index file
- window1_2 : str
- Location of the second window index file
- window1_3 : str
- [OPTIONAL] Location of the third window index file
- window1_4 : str
- [OPTIONAL] Location of the fourth window index file
- window2_1 : str
- Location of the first window index file
- window2_2 : str
- Location of the second window index file
- window2_3 : str
- [OPTIONAL] Location of the third window index file
- window2_4 : str
- [OPTIONAL] Location of the fourth window index file
- metadata (dict) –
- windows : list
- List of lists with the window sizes to be computed
- enzyme_name : str
- Restricture enzyme name
- mapping : list
- The mapping function used. The options are iter or frag.
Returns: - output_files (list) – List of locations for the output files.
- output_metadata (dict) – Dict of matching metadata dict objects
Example
Iterative:
from tool import tb_parse_mapping genome_file = 'genome.fasta' root_name_1 = "/tmp/data/expt_source_1".split root_name_2 = "/tmp/data/expt_source_2".split windows = [[1,25], [1,50], [1,75], [1,100]] windows1 = [] windows2 = [] for w in windows: tail = "_full_" + w[0] + "-" + w[1] + ".map" windows1.append('/'.join(root_name_1) + tail) windows2.append('/'.join(root_name_2) + tail) files = [genome_file] + windows1 + windows2 tpm = tb_parse_mapping.tb_parse_mapping() metadata = {'enzyme_name' : 'MboI', 'mapping' : ['iter', 'iter'], 'expt_name' = 'test'} tpm_files, tpm_meta = tpm.run(files, metadata)
Fragment based mapping:
from tool import tb_parse_mapping genome_file = 'genome.fasta' root_name_1 = "/tmp/data/expt_source_1".split root_name_2 = "/tmp/data/expt_source_2".split windows = [[1,100]] start = windows[0][0] end = windows[0][1] window1_1 = '/'.join(root_name_1) + "_full_" + start + "-" + end + ".map" window1_2 = '/'.join(root_name_1) + "_frag_" + start + "-" + end + ".map" window2_1 = '/'.join(root_name_2) + "_full_" + start + "-" + end + ".map" window2_2 = '/'.join(root_name_2) + "_frag_" + start + "-" + end + ".map" files = [ genome_file, window1_1, window1_2, window2_1, window2_2, ] tpm = tb_parse_mapping.tb_parse_mapping() metadata = {'enzyme_name' : 'MboI', 'mapping' : ['frag', 'frag'], 'expt_name' = 'test'} tpm_files, tpm_meta = tpm.run(files, metadata)
- input_files (list) –
-
tb_parse_mapping_frag
(**kwargs)[source]¶ Function to map the aligned reads and return the matching pairs
Parameters: - genome_seq (dict) – Object containing the sequence of each of the chromosomes
- enzyme_name (str) – Name of the enzyme used to digest the genome
- window1_full (str) – Location of the first window index file
- window1_frag (str) – Location of the second window index file
- window2_full (str) – Location of the first window index file
- window2_frag (str) – Location of the second window index file
- reads (str) – Location of the reads thats that has a matching location at both ends of the paired reads
Returns: reads – Location of the intersection of mapped reads that have matching reads in both pair end files
Return type: str
-
tb_parse_mapping_iter
(**kwargs)[source]¶ Function to map the aligned reads and return the matching pairs
Parameters: - genome_seq (dict) – Object containing the sequence of each of the chromosomes
- enzyme_name (str) – Name of the enzyme used to digest the genome
- window1_1 (str) – Location of the first window index file
- window1_2 (str) – Location of the second window index file
- window1_3 (str) – Location of the third window index file
- window1_4 (str) – Location of the fourth window index file
- window2_1 (str) – Location of the first window index file
- window2_2 (str) – Location of the second window index file
- window2_3 (str) – Location of the third window index file
- window2_4 (str) – Location of the fourth window index file
- reads (str) – Location of the reads thats that has a matching location at both ends of the paired reads
Returns: reads – Location of the intersection of mapped reads that have matching reads in both pair end files
Return type: str
-
Filter Aligned Reads¶
-
class
tool.tb_filter.
tbFilterTool
(configuration=None)[source]¶ Tool for filtering out experimetnal artifacts from the aligned data
-
run
(input_files, input_metadata, output_files)[source]¶ The main function to filter the reads to remove experimental artifacts
Parameters: - input_files (list) –
- reads : str
- Location of the reads thats that has a matching location at both ends of the paired reads
- metadata (dict) –
- conservative : bool
- Level of filtering to apply [DEFAULT : True]
Returns: - output_files (list) – List of locations for the output files.
- output_metadata (list) – List of matching metadata dict objects
- input_files (list) –
-
tb_filter
(**kwargs)[source]¶ Function to filter out expoerimental artifacts
Parameters: - reads (str) – Location of the reads thats that has a matching location at both ends of the paired reads
- filtered_reads_file (str) – Location of the filtered reads
- conservative (bool) – Level of filtering [DEFAULT : True]
Returns: filtered_reads – Location of the filtered reads
Return type: str
-
Identify TADs and Compartments¶
-
class
tool.tb_segment.
tbSegmentTool
[source]¶ Tool for finding tads and compartments in an adjacency matrix
-
run
(input_files, input_metadata, output_files)[source]¶ The main function to the predict TAD sites and compartments for a given resolution from the Hi-C matrix
Parameters: - input_files (list) –
- bamin : str
- Location of the tadbit bam paired reads
- biases : str
- Location of the pickle hic biases
- metadata (dict) –
- resolution : int
- Resolution of the Hi-C
- workdir : str
- Location of working directory
- ncpus : int
- Number of cpus to use
Returns: - output_files (list) – List of locations for the output files.
- output_metadata (list) – List of matching metadata dict objects
- input_files (list) –
-
tb_segment
(**kwargs)[source]¶ Function to find tads and compartments in the Hi-C matrix
Parameters: - bamin (str) – Location of the tadbit bam paired reads
- biases (str) – Location of the pickle hic biases
- resolution (int) – Resolution of the Hi-C
- callers (str) – 1 for ta calling, 2 for compartment calling
- workdir (str) – Location of working directory
- ncpus (int) – Number of cpus to use
Returns: - compartments (str) – Location of tsv file with compartment definition
- tads (str) – Location of tsv file with tad definition
- filtered_bins (str) – Location of filtered_bins png
-
Normalize paired end reads file¶
-
class
tool.tb_normalize.
tbNormalizeTool
[source]¶ Tool for normalizing an adjacency matrix
-
run
(input_files, input_metadata, output_files)[source]¶ The main function for the normalization of the Hi-C matrix to a given resolution
Parameters: - input_files (list) –
- bamin : str
- Location of the tadbit bam paired reads
- metadata (dict) –
- normalization: str
- normalization(s) to apply. Order matters. Choices: [Vanilla, oneD]
- resolution : str
- Resolution of the Hi-C
- min_perc : str
- lower percentile from which consider bins as good.
- max_perc : str
- upper percentile until which consider bins as good.
- workdir : str
- Location of working directory
- ncpus : str
- Number of cpus to use
- min_count : str
- minimum number of reads mapped to a bin (recommended value could be 2500). If set this option overrides the perc_zero
- fasta: str
- Location of the fasta file with genome sequence, to compute GC content and number of restriction sites per bin. Required for oneD normalization
- mappability: str
- Location of the file with mappability, required for oneD normalization
- rest_enzyme: str
- For oneD normalization. Name of the restriction enzyme used to do the Hi-C experiment
Returns: - output_files (list) – List of locations for the output files.
- output_metadata (list) – List of matching metadata dict objects
- input_files (list) –
-
tb_normalize
(**kwargs)[source]¶ Function to normalize to a given resolution the Hi-C matrix
Parameters: - bamin (str) – Location of the tadbit bam paired reads
- normalization (str) – normalization(s) to apply. Order matters. Choices: [Vanilla, oneD]
- resolution (str) – Resolution of the Hi-C
- min_perc (str) – lower percentile from which consider bins as good.
- max_perc (str) – upper percentile until which consider bins as good.
- workdir (str) – Location of working directory
- ncpus (str) – Number of cpus to use
- min_count (str) – minimum number of reads mapped to a bin (recommended value could be 2500). If set this option overrides the perc_zero
- fasta (str) – Location of the fasta file with genome sequence, to compute GC content and number of restriction sites per bin. Required for oneD normalization
- mappability (str) – Location of the file with mappability, required for oneD normalization
- rest_enzyme (str) – For oneD normalization. Name of the restriction enzyme used to do the Hi-C experiment
Returns: - hic_biases (str) – Location of HiC biases pickle file
- interactions (str) – Location of interaction decay vs genomic distance pdf
- filtered_bins (str) – Location of filtered_bins png
-
Extract binned matrix from paired end reads file¶
-
class
tool.tb_bin.
tbBinTool
[source]¶ Tool for binning an adjacency matrix
-
run
(input_files, input_metadata, output_files)[source]¶ The main function to the predict TAD sites for a given resolution from the Hi-C matrix
Parameters: - input_files (list) –
- bamin : str
- Location of the tadbit bam paired reads
- biases : str
- Location of the pickle hic biases
- input_metadata (dict) –
- resolution : int
- Resolution of the Hi-C
- coord1 : str
- Coordinate of the region to retrieve. By default all genome, arguments can be either one chromosome name, or the coordinate in the form: “-c chr3:110000000-120000000”
- coord2 : str
- Coordinate of a second region to retrieve the matrix in the intersection with the first region.
- norm : str
- [[‘raw’]] normalization(s) to apply. Order matters. Choices: [norm, decay, raw]
- workdir : str
- Location of working directory
- ncpus : int
- Number of cpus to use
Returns: - output_files (list) – List of locations for the output files.
- output_metadata (list) – List of matching metadata dict objects
- input_files (list) –
-
tb_bin
(**kwargs)[source]¶ Function to bin to a given resolution the Hi-C matrix
Parameters: - bamin (str) – Location of the tadbit bam paired reads
- biases (str) – Location of the pickle hic biases
- resolution (int) – Resolution of the Hi-C
- coord1 (str) – Coordinate of the region to retrieve. By default all genome, arguments can be either one chromosome name, or the coordinate in the form: “-c chr3:110000000-120000000”
- coord2 (str) – Coordinate of a second region to retrieve the matrix in the intersection with the first region.
- norm (list) – [[‘raw’]] normalization(s) to apply. Order matters. Choices: [norm, decay, raw]
- workdir (str) – Location of working directory
- ncpus (int) – Number of cpus to use
Returns: - hic_contacts_matrix_raw (str) – Location of HiC raw matrix in text format
- hic_contacts_matrix_nrm (str) – Location of HiC normalized matrix in text format
- hic_contacts_matrix_raw_fig (str) – Location of HiC raw matrix in png format
- hic_contacts_matrix_norm_fig (str) – Location of HiC normalized matrix in png format
-
Save Matrix to HDF5 File¶
-
class
tool.tb_save_hdf5_matrix.
tbSaveAdjacencyHDF5Tool
[source]¶ Tool for filtering out experimetnal artifacts from the aligned data
-
run
(input_files, output_files, metadata=None)[source]¶ The main function save the adjacency list from Hi-C into an HDF5 index file at the defined resolutions.
Parameters: - input_files (list) –
- adj_list : str
- Location of the adjacency list
- hdf5_file : str
- Location of the HDF5 output matrix file
- metadata (dict) –
- resolutions : list
- Levels of resolution for the adjacency list to be daved at
- assembly : str
- Assembly of the aligned sequences
- normalized : bool
- Whether the dataset should be normalised before saving
Returns: - output_files (list) – List of locations for the output files.
- output_metadata (list) – List of matching metadata dict objects
- input_files (list) –
-
tb_matrix_hdf5
(**kwargs)[source]¶ Function to the Hi-C matrix into an HDF5 file
This has to be run sequentially as it is not possible for multiple streams to write to the same HDF5 file. This is a run once and leave operatation. There also needs to be a check that no other process is writing to the HDF5 file at the same time. This should be done at the stage and unstaging level to prevent to file getting written to by multiple processes and generating conflicts.
This needs to include attributes for the chromosomes for each resolution - See the mg-rest-adjacency hdf5_reader for further details about the requirement. This prevents the need for secondary storage details outside of the HDF5 file.
Parameters: - hic_data (hic_data) – Hi-C data object
- hdf5_file (str) – Location of the HDF5 output matrix file
- resolution (int) – Resolution to read teh Hi-C adjacency list at
- chromosomes (list) – List of listsd of the chromosome names and their size in the order that they are presented for indexing
Returns: hdf5_file – Location of the HDF5 output matrix file
Return type: str
-
Generate TAD Predictions¶
-
class
tool.tb_generate_tads.
tbGenerateTADsTool
[source]¶ Tool for taking the adjacency lists and predicting TADs
-
run
(input_files, output_files, metadata=None)[source]¶ The main function to the predict TAD sites for a given resolution from the Hi-C matrix
Parameters: - input_files (list) –
- adj_list : str
- Location of the adjacency list
- metadata (dict) –
- resolutions : list
- Levels of resolution for the adjacency list to be daved at
- assembly : str
- Assembly of the aligned sequences
Returns: - output_files (list) – List of locations for the output files.
- output_metadata (list) – List of matching metadata dict objects
- input_files (list) –
-
tb_generate_tads
(**kwargs)[source]¶ Function to the predict TAD sites for a given resolution from the Hi-C matrix
Parameters: - expt_name (str) – Location of the adjacency list
- matrix_file (str) – Location of the HDF5 output matrix file
- resolution (int) – Resolution to read the Hi-C adjacency list at
- tad_file (str) – Location of the output TAD file
Returns: tad_file – Location of the output TAD file
Return type: str
-
Generate 3D models from binned interaction matrix¶
-
class
tool.tb_model.
tbModelTool
[source]¶ Tool for normalizing an adjacency matrix
-
run
(input_files, input_metadata, output_files)[source]¶ The main function for the normalization of the Hi-C matrix to a given resolution
Parameters: - input_files (list) –
- hic_contacts_matrix_norm : str
- Location of the tab-separated normalized matrix
- metadata (dict) –
- optimize_only: bool
- True if only optimize, False for computing the models and stats
- gen_pos_chrom_name : str
- Coordinates of the genomic region to model.
- resolution : str
- Resolution of the Hi-C
- gen_pos_begin : int
- Genomic coordinate from which to start modeling.
- gen_pos_end : int
- Genomic coordinate where to end modeling.
- num_mod_comp : int
- Number of models to compute for each optimization step.
- num_mod_comp : int
- Number of models to keep.
- max_dist : str
- Range of numbers for optimal maxdist parameter, i.e. 400:1000:100; or just a single number e.g. 800; or a list of numbers e.g. 400 600 800 1000.
- upper_bound : int
- Range of numbers for optimal upfreq parameter, i.e. 0:1.2:0.3; or just a single number e.g. 0.8; or a list of numbers e.g. 0.1 0.3 0.5 0.9.
- lower_bound : int
- Range of numbers for optimal low parameter, i.e. -1.2:0:0.3; or just a single number e.g. -0.8; or a list of numbers e.g. -0.1 -0.3 -0.5 -0.9.
- cutoff : str
- Range of numbers for optimal cutoff distance. Cutoff is computed based on the resolution. This cutoff distance is calculated taking as reference the diameter of a modeled particle in the 3D model. i.e. 1.5:2.5:0.5; or just a single number e.g. 2; or a list of numbers e.g. 2 2.5.
- workdir : str
- Location of working directory
- ncpus : str
- Number of cpus to use
Returns: - output_files (list) – List of locations for the output files.
- output_metadata (list) – List of matching metadata dict objects
- input_files (list) –
-
tb_model
(**kwargs)[source]¶ Function to normalize to a given resolution the Hi-C matrix
Parameters: - optimize_only (bool) – True if only optimize, False for computing the models and stats
- hic_contacts_matrix_norm (str) – Location of the tab-separated normalized matrix
- resolution (str) – Resolution of the Hi-C
- gen_pos_chrom_name (str) – Coordinates of the genomic region to model.
- gen_pos_begin (int) – Genomic coordinate from which to start modeling.
- gen_pos_end (int) – Genomic coordinate where to end modeling.
- num_mod_comp (int) – Number of models to compute for each optimization step.
- num_mod_comp – Number of models to keep.
- max_dist (str) – Range of numbers for optimal maxdist parameter, i.e. 400:1000:100; or just a single number e.g. 800; or a list of numbers e.g. 400 600 800 1000.
- upper_bound (int) – Range of numbers for optimal upfreq parameter, i.e. 0:1.2:0.3; or just a single number e.g. 0.8; or a list of numbers e.g. 0.1 0.3 0.5 0.9.
- lower_bound (int) – Range of numbers for optimal low parameter, i.e. -1.2:0:0.3; or just a single number e.g. -0.8; or a list of numbers e.g. -0.1 -0.3 -0.5 -0.9.
- cutoff (str) – Range of numbers for optimal cutoff distance. Cutoff is computed based on the resolution. This cutoff distance is calculated taking as reference the diameter of a modeled particle in the 3D model. i.e. 1.5:2.5:0.5; or just a single number e.g. 2; or a list of numbers e.g. 2 2.5.
- workdir (str) – Location of working directory
- ncpus (str) – Number of cpus to use
Returns: - tadkit_models (str) – Location of TADkit json file
- modeling_stats (str) – Location of the folder with the modeling files and stats
-