Utility Functions¶
Common Functions¶
The following functions are ones that have been used across multiple tools for transformations of the data when requried.
Alignment Utilities¶
-
class
tool.aligner_utils.
alignerUtils
[source]¶ Functions for downloading and processing N-seq FastQ files. Functions provided allow for the downloading and indexing of the genome assemblies.
-
static
bowtie2_align_reads
(genome_file, bam_loc, params, reads_file_1, reads_file_2=None)[source]¶ Map the reads to the genome using BWA.
Parameters: - genome_file (str) – Location of the assembly file in the file system
- reads_file (str) – Location of the reads file in the file system
- bam_loc (str) – Location of the output file
-
bowtie2_untar_index
(genome_name, tar_file, bt2_1_file, bt2_2_file, bt2_3_file, bt2_4_file, bt2_rev1_file, bt2_rev2_file)[source]¶ Extracts the BWA index files from the genome index tar file.
Parameters: - genome_file_name (str) – Location string of the genome fasta file
- tar_file (str) – Location of the Bowtie2 index file
- bt2_1_file (str) – Location of the amb index file
- bt2_2_file (str) – Location of the ann index file
- bt2_3_file (str) – Location of the bwt index file
- bt2_4_file (str) – Location of the pac index file
- bt2_rev1_file (str) – Location of the sa index file
- bt2_rev2_file (str) – Location of the sa index file
Returns: Boolean indicating if the task was successful
Return type: bool
-
static
bowtie_index_genome
(genome_file)[source]¶ Create an index of the genome FASTA file with Bowtie2. These are saved alongside the assembly file.
Parameters: genome_file (str) – Location of the assembly file in the file system
-
bwa_aln_align_reads_paired
(genome_file, reads_file_1, reads_file_2, bam_loc, params)[source]¶ Map the reads to the genome using BWA.
Parameters: - genome_file (str) – Location of the assembly file in the file system
- reads_file (str) – Location of the reads file in the file system
- bam_loc (str) – Location of the output file
-
bwa_aln_align_reads_single
(genome_file, reads_file, bam_loc, params)[source]¶ Map the reads to the genome using BWA. :param genome_file: Location of the assembly file in the file system :type genome_file: str :param reads_file: Location of the reads file in the file system :type reads_file: str :param bam_loc: Location of the output file :type bam_loc: str
-
static
bwa_index_genome
(genome_file)[source]¶ Create an index of the genome FASTA file with BWA. These are saved alongside the assembly file. If the index has already been generated then the locations of the files are returned
Parameters: genome_file (str) – Location of the assembly file in the file system Returns: - amb_file (str) – Location of the amb file
- ann_file (str) – Location of the ann file
- bwt_file (str) – Location of the bwt file
- pac_file (str) – Location of the pac file
- sa_file (str) – Location of the sa file
Example
1 2 3 4 5
from tool.aligner_utils import alignerUtils au_handle = alignerUtils() indexes = au_handle.bwa_index_genome('/<data_dir>/human_GRCh38.fa.gz') print(indexes)
-
static
bwa_mem_align_reads
(genome_file, bam_loc, params, reads_file_1, reads_file_2=None)[source]¶ Map the reads to the genome using BWA.
Parameters: - genome_file (str) – Location of the assembly file in the file system
- reads_file (str) – Location of the reads file in the file system
- bam_loc (str) – Location of the output file
-
bwa_untar_index
(genome_name, tar_file, amb_file, ann_file, bwt_file, pac_file, sa_file)[source]¶ Extracts the BWA index files from the genome index tar file.
Parameters: - genome_file_name (str) – Location string of the genome fasta file
- genome_idx (str) – Location of the BWA index file
- amb_file (str) – Location of the amb index file
- ann_file (str) – Location of the ann index file
- bwt_file (str) – Location of the bwt index file
- pac_file (str) – Location of the pac index file
- sa_file (str) – Location of the sa index file
Returns: Boolean indicating if the task was successful
Return type: bool
-
static
Bam Utilities¶
-
class
tool.bam_utils.
bamUtils
[source]¶ Tool for handling bam files
-
static
bam_copy
(bam_in, bam_out)[source]¶ Wrapper function to copy from one bam file to another
Parameters: - bam_in (str) – Location of the input bam file
- bam_out (str) – Location of the output bam file
-
static
bam_count_reads
(bam_file, aligned=False)[source]¶ Wrapper to count the number of (aligned) reads in a bam file
-
static
bam_filter
(bam_file, bam_file_out, filter_name)[source]¶ Wrapper for filtering out reads from a bam file
Parameters: - bam_file (str) –
- bam_file_out (str) –
- filter (str) –
- One of:
- duplicate - Read is PCR or optical duplicate (1024) supplementary - Reads that are chimeric, fusion or non linearly aligned (2048) unmapped - Read is unmapped or not the primary alignment (260)
-
static
bam_index
(bam_file, bam_idx_file)[source]¶ Wrapper for the pysam SAMtools index function
Parameters: - bam_file (str) – Location of the bam file that is to be indexed
- bam_idx_file (str) – Location of the bam index file (.bai)
-
static
bam_list_chromosomes
(bam_file)[source]¶ Wrapper to list the chromosome names that are present within the bam file
Parameters: bam_file (str) – Location of the bam file Returns: List of the names of the chromosomes that are present in the bam file Return type: list
-
static
bam_merge
(*args)[source]¶ Wrapper for the pysam SAMtools merge function
Parameters: - bam_file_1 (str) – Location of the bam file to merge into
- bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1
-
static
bam_sort
(bam_file)[source]¶ Wrapper for the pysam SAMtools sort function
Parameters: bam_file (str) – Location of the bam file to sort
-
static
bam_split
(bam_file_in, bai_file, chromosome, bam_file_out)[source]¶ Wrapper to extract a single chromosomes worth of reading into a new bam file
Parameters: - bam_file_in (str) – Location of the input bam file
- bai_file (str) – Location of the bam index file. This needs to be in the same directory as the bam_file_in
- chromosome (str) – Name of the chromosome whose alignments are to be extracted
- bam_file_out (str) – Location of the output bam file
-
static
bam_stats
(bam_file)[source]¶ Wrapper for the pysam SAMtools flagstat function
Parameters: bam_file (str) – Location of the bam file Returns: list – qc_passed : int qc_failed : int description : str Return type: dict
-
static
-
class
tool.bam_utils.
bamUtilsTask
[source]¶ Wrappers so that the function above can be used as part of a @task within COMPSs avoiding the files being copied around the infrastructure too many times
-
bam_copy
(**kwargs)[source]¶ Wrapper function to copy from one bam file to another
Parameters: - bam_in (str) – Location of the input bam file
- bam_out (str) – Location of the output bam file
-
bam_filter
(**kwargs)[source]¶ Wrapper for filtering out reads from a bam file
Parameters: - bam_file (str) –
- bam_file_out (str) –
- filter (str) –
- One of:
- duplicate - Read is PCR or optical duplicate (1024) unmapped - Read is unmapped or not the primary alignment (260)
-
bam_index
(**kwargs)[source]¶ Wrapper for the pysam SAMtools merge function
Parameters: - bam_file (str) – Location of the bam file that is to be indexed
- bam_idx_file (str) – Location of the bam index file (.bai)
-
bam_list_chromosomes
(**kwargs)[source]¶ Wrapper to get the list of chromosomes in a given bam file
Parameters: bam_file (str) – Location of the bam file Returns: chromosome_list – List of the chromosomes in the bam file Return type: list
-
bam_merge
(in_bam_job_files)[source]¶ Wrapper task taking any number of bam files and merging them into a single bam file.
Parameters: bam_job_files (list) – List of the locations of the separate bam files that are to be merged The first file in the list will be taken as the output file name
-
bam_merge_10
(**kwargs)[source]¶ Wrapper for the pysam SAMtools merge function
Parameters: - bam_file_1 (str) – Location of the bam file to merge into
- bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1
- bam_file_3 (str) – Location of the bam file that is to get merged into bam_file_1
- bam_file_4 (str) – Location of the bam file that is to get merged into bam_file_1
- bam_file_5 (str) – Location of the bam file that is to get merged into bam_file_1
- bam_file_6 (str) – Location of the bam file that is to get merged into bam_file_1
- bam_file_7 (str) – Location of the bam file that is to get merged into bam_file_1
- bam_file_8 (str) – Location of the bam file that is to get merged into bam_file_1
- bam_file_9 (str) – Location of the bam file that is to get merged into bam_file_1
- bam_file_10 (str) – Location of the bam file that is to get merged into bam_file_1
-
bam_merge_2
(**kwargs)[source]¶ Wrapper for the pysam SAMtools merge function
Parameters: - bam_file_1 (str) – Location of the bam file to merge into
- bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1
-
bam_merge_3
(**kwargs)[source]¶ Wrapper for the pysam SAMtools merge function
Parameters: - bam_file_1 (str) – Location of the bam file to merge into
- bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1
- bam_file_3 (str) – Location of the bam file that is to get merged into bam_file_1
-
bam_merge_4
(**kwargs)[source]¶ Wrapper for the pysam SAMtools merge function
Parameters: - bam_file_1 (str) – Location of the bam file to merge into
- bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1
- bam_file_3 (str) – Location of the bam file that is to get merged into bam_file_1
- bam_file_4 (str) – Location of the bam file that is to get merged into bam_file_1
-
bam_merge_5
(**kwargs)[source]¶ Wrapper for the pysam SAMtools merge function
Parameters: - bam_file_1 (str) – Location of the bam file to merge into
- bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1
- bam_file_3 (str) – Location of the bam file that is to get merged into bam_file_1
- bam_file_4 (str) – Location of the bam file that is to get merged into bam_file_1
- bam_file_5 (str) – Location of the bam file that is to get merged into bam_file_1
-
bam_paired_reads
(**kwargs)[source]¶ Wrapper for the pysam SAMtools view function to identify if a bam file contains paired end reads
Parameters: bam_file (str) – Location of the bam file that is to be indexed Returns: True if the bam file contains paired end reads Return type: bool
-
bam_sort
(**kwargs)[source]¶ Wrapper for the pysam SAMtools sort function
Parameters: bam_file (str) – Location of the bam file to sort
-
FASTQ Functions¶
The following functions are ones that are used for the manipulation of FASTQ files.
Reading¶
The following functions are to provide easy access for iterating through entries within a FASTQ file(s) both single and paired.
-
class
tool.fastqreader.
fastqreader
[source]¶ Module for reading single end and paired end FASTQ files
-
createOutputFiles
(tag='')[source]¶ Create and open the file handles for the output files
Parameters: tag (str) – Tag to identify the output files (DEFAULT: ‘’)
-
eof
(side=1)[source]¶ Indicate if the end of the file has been reached
Parameters: side (int) – 1 or 2
-
incrementOutputFiles
()[source]¶ Increment the counter and create new files for splitting the original FastQ paired end files.
-
next
(side=1)[source]¶ Get the next read element for the specific FastQ file pair
Parameters: side (int) – 1 or 2 to get the element from the relevant end (DEFAULT: 1) Returns: - id : str
- Sequence ID
- seq : str
- Called sequence
- add : str
- Plus sign
- score : str
- Base call score
Return type: dict
-
Splitting¶
This tool has been created to aid in splitting FASTQ files into manageable chunks for parallel processing. It is able to work on single and paired end files.
-
class
tool.fastq_splitter.
fastq_splitter
(configuration=None)[source]¶ Script for splitting up FASTQ files into manageable chunks
-
paired_splitter
(**kwargs)[source]¶ Function to divide the paired-end FastQ files into separte sub files of 1000000 sequences so that the aligner can run in parallel.
Parameters: - in_file1 (str) – Location of first paired end FASTQ file
- in_file2 (str) – Location of second paired end FASTQ file
- tag (str) – DEFAULT = tmp Tag used to identify the files. Useful if this is getting run manually on a single machine multiple times to prevent collisions of file names
Returns: - Returns (Returns a list of lists of the files that have been generated.) – Each sub list containing the two paired end files for that subset.
- paired_files (list) – List of lists of pair end files. Each sub list containing the two paired end files for that subset.
-
run
(input_files, input_metadata, output_files)[source]¶ The main function to run the splitting of FASTQ files (single or paired) so that they can aligned in a distributed manner
Parameters: - input_files (dict) – List of input fastq file locations
- metadata (dict) –
- output_files (dict) –
Returns: - output_file (str) – Location of compressed (.tar.gz) of the split FASTQ files
- output_names (list) – List of file names in the compressed file
-
single_splitter
(**kwargs)[source]¶ Function to divide the FastQ files into separate sub files of 1000000 sequences so that the aligner can run in parallel.
Parameters: - in_file1 (str) – Location of first FASTQ file
- tag (str) – DEFAULT = tmp Tag used to identify the files. Useful if this is getting run manually on a single machine multiple times to prevent collisions of file names
Returns: - Returns (Returns a list of the files that have been generated.) – Each sub list containing the two paired end files for that subset.
- paired_files (list) – List of lists of pair end files. Each sub list containing the two paired end files for that subset.
-
Entry Functions¶
The following functions allow for manipulating FASTQ files.
-
class
tool.fastq_utils.
fastqUtils
[source]¶ Set of methods to help with the management of FastQ files.
-
static
fastq_match_paired_ends
(fastq_1, fastq_2)[source]¶ Take 2 fastq files and remove ends that don’t have a matching pair. Requires that the FastQ files are ordered correctly.
Mismatches can occur if there is a filtering step that removes one of the paired ends
Parameters: - fastq_1 (str) – Location of FastQ file
- fastq_2 (str) – Location of FastQ file
-
static
fastq_randomise
(fastq, output=None)[source]¶ Randomising the order of reads withim a FastQ file
Parameters: - fastq (str) – Location of the FastQ file to randomise
- output (str) – [OPTIONAL] Location of the output FastQ file. If left blank then the randomised output is saved to the same location as fastq
-
static