Utility Functions¶

Common Functions¶

The following functions are ones that have been used across multiple tools for transformations of the data when requried.

class tool.common.cd(newpath)[source]¶: Context manager for changing the current working directory

Alignment Utilities¶

class tool.aligner_utils.alignerUtils[source]¶

Functions for downloading and processing N-seq FastQ files. Functions provided allow for the downloading and indexing of the genome assemblies.

static bowtie2_align_reads(genome_file, bam_loc, params, reads_file_1, reads_file_2=None)[source]¶

Map the reads to the genome using BWA.

Parameters:	genome_file (str) – Location of the assembly file in the file system reads_file (str) – Location of the reads file in the file system bam_loc (str) – Location of the output file

bowtie2_untar_index(genome_name, tar_file, bt2_1_file, bt2_2_file, bt2_3_file, bt2_4_file, bt2_rev1_file, bt2_rev2_file)[source]¶

Extracts the BWA index files from the genome index tar file.

Parameters:	genome_file_name (str) – Location string of the genome fasta file tar_file (str) – Location of the Bowtie2 index file bt2_1_file (str) – Location of the amb index file bt2_2_file (str) – Location of the ann index file bt2_3_file (str) – Location of the bwt index file bt2_4_file (str) – Location of the pac index file bt2_rev1_file (str) – Location of the sa index file bt2_rev2_file (str) – Location of the sa index file
Returns:	Boolean indicating if the task was successful
Return type:	bool

static bowtie_index_genome(genome_file)[source]¶

Create an index of the genome FASTA file with Bowtie2. These are saved alongside the assembly file.

Parameters:	genome_file (str) – Location of the assembly file in the file system

bwa_aln_align_reads_paired(genome_file, reads_file_1, reads_file_2, bam_loc, params)[source]¶

Map the reads to the genome using BWA.

Parameters:	genome_file (str) – Location of the assembly file in the file system reads_file (str) – Location of the reads file in the file system bam_loc (str) – Location of the output file

bwa_aln_align_reads_single(genome_file, reads_file, bam_loc, params)[source]¶: Map the reads to the genome using BWA. :param genome_file: Location of the assembly file in the file system :type genome_file: str :param reads_file: Location of the reads file in the file system :type reads_file: str :param bam_loc: Location of the output file :type bam_loc: str

static bwa_index_genome(genome_file)[source]¶

Create an index of the genome FASTA file with BWA. These are saved alongside the assembly file. If the index has already been generated then the locations of the files are returned

Parameters:	genome_file (str) – Location of the assembly file in the file system
Returns:	amb_file (str) – Location of the amb file ann_file (str) – Location of the ann file bwt_file (str) – Location of the bwt file pac_file (str) – Location of the pac file sa_file (str) – Location of the sa file

Example

from tool.aligner_utils import alignerUtils
au_handle = alignerUtils()

indexes = au_handle.bwa_index_genome('/<data_dir>/human_GRCh38.fa.gz')
print(indexes)

static bwa_mem_align_reads(genome_file, bam_loc, params, reads_file_1, reads_file_2=None)[source]¶

Map the reads to the genome using BWA.

Parameters:	genome_file (str) – Location of the assembly file in the file system reads_file (str) – Location of the reads file in the file system bam_loc (str) – Location of the output file

bwa_untar_index(genome_name, tar_file, amb_file, ann_file, bwt_file, pac_file, sa_file)[source]¶

Extracts the BWA index files from the genome index tar file.

Parameters:	genome_file_name (str) – Location string of the genome fasta file genome_idx (str) – Location of the BWA index file amb_file (str) – Location of the amb index file ann_file (str) – Location of the ann index file bwt_file (str) – Location of the bwt index file pac_file (str) – Location of the pac index file sa_file (str) – Location of the sa index file
Returns:	Boolean indicating if the task was successful
Return type:	bool

static gem_index_genome(genome_file, index_name=None)[source]¶

Create an index of the genome FASTA file with GEM. These are saved alongside the assembly file.

Parameters:	genome_file (str) – Location of the assembly file in the file system

static replaceENAHeader(file_path, file_out)[source]¶: The ENA header has pipes in the header as part of the stable_id. This function removes the ENA stable_id and replaces it with the final section after splitting the stable ID on the pipe.

Bam Utilities¶

class tool.bam_utils.bamUtils[source]¶

Tool for handling bam files

static bam_copy(bam_in, bam_out)[source]¶

Wrapper function to copy from one bam file to another

Parameters:	bam_in (str) – Location of the input bam file bam_out (str) – Location of the output bam file

static bam_count_reads(bam_file, aligned=False)[source]¶: Wrapper to count the number of (aligned) reads in a bam file

static bam_filter(bam_file, bam_file_out, filter_name)[source]¶

Wrapper for filtering out reads from a bam file

Parameters:	bam_file (str) – bam_file_out (str) – filter (str) – One of: duplicate - Read is PCR or optical duplicate (1024) supplementary - Reads that are chimeric, fusion or non linearly aligned (2048) unmapped - Read is unmapped or not the primary alignment (260)

static bam_index(bam_file, bam_idx_file)[source]¶

Wrapper for the pysam SAMtools index function

Parameters:	bam_file (str) – Location of the bam file that is to be indexed bam_idx_file (str) – Location of the bam index file (.bai)

static bam_list_chromosomes(bam_file)[source]¶

Wrapper to list the chromosome names that are present within the bam file

Parameters:	bam_file (str) – Location of the bam file
Returns:	List of the names of the chromosomes that are present in the bam file
Return type:	list

static bam_merge(*args)[source]¶

Wrapper for the pysam SAMtools merge function

Parameters:	bam_file_1 (str) – Location of the bam file to merge into bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1

static bam_paired_reads(bam_file)[source]¶: Wrapper to test if a bam file contains paired end reads

static bam_sort(bam_file)[source]¶

Wrapper for the pysam SAMtools sort function

Parameters:	bam_file (str) – Location of the bam file to sort

static bam_split(bam_file_in, bai_file, chromosome, bam_file_out)[source]¶

Wrapper to extract a single chromosomes worth of reading into a new bam file

Parameters:	bam_file_in (str) – Location of the input bam file bai_file (str) – Location of the bam index file. This needs to be in the same directory as the bam_file_in chromosome (str) – Name of the chromosome whose alignments are to be extracted bam_file_out (str) – Location of the output bam file

static bam_stats(bam_file)[source]¶

Wrapper for the pysam SAMtools flagstat function

Parameters:	bam_file (str) – Location of the bam file
Returns:	list – qc_passed : int qc_failed : int description : str
Return type:	dict

static bam_to_bed(bam_file, bed_file)[source]¶: Function for converting bam files to bed files

static check_header(bam_file)[source]¶

Wrapper for the pysam SAMtools for checking if a bam file is sorted

Parameters:	bool – True if the file has been sorted

static sam_to_bam(sam_file, bam_file)[source]¶: Function for converting sam files to bam files

class tool.bam_utils.bamUtilsTask[source]¶

Wrappers so that the function above can be used as part of a @task within COMPSs avoiding the files being copied around the infrastructure too many times

bam_copy(**kwargs)[source]¶

Wrapper function to copy from one bam file to another

Parameters:	bam_in (str) – Location of the input bam file bam_out (str) – Location of the output bam file

bam_filter(**kwargs)[source]¶

Wrapper for filtering out reads from a bam file

Parameters:	bam_file (str) – bam_file_out (str) – filter (str) – One of: duplicate - Read is PCR or optical duplicate (1024) unmapped - Read is unmapped or not the primary alignment (260)

bam_index(**kwargs)[source]¶

Wrapper for the pysam SAMtools merge function

Parameters:	bam_file (str) – Location of the bam file that is to be indexed bam_idx_file (str) – Location of the bam index file (.bai)

bam_list_chromosomes(**kwargs)[source]¶

Wrapper to get the list of chromosomes in a given bam file

Parameters:	bam_file (str) – Location of the bam file
Returns:	chromosome_list – List of the chromosomes in the bam file
Return type:	list

bam_merge(in_bam_job_files)[source]¶

Wrapper task taking any number of bam files and merging them into a single bam file.

Parameters:	bam_job_files (list) – List of the locations of the separate bam files that are to be merged The first file in the list will be taken as the output file name

bam_merge_10(**kwargs)[source]¶

Wrapper for the pysam SAMtools merge function

Parameters:

bam_file_1 (str) – Location of the bam file to merge into
bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1
bam_file_3 (str) – Location of the bam file that is to get merged into bam_file_1
bam_file_4 (str) – Location of the bam file that is to get merged into bam_file_1
bam_file_5 (str) – Location of the bam file that is to get merged into bam_file_1
bam_file_6 (str) – Location of the bam file that is to get merged into bam_file_1
bam_file_7 (str) – Location of the bam file that is to get merged into bam_file_1
bam_file_8 (str) – Location of the bam file that is to get merged into bam_file_1
bam_file_9 (str) – Location of the bam file that is to get merged into bam_file_1
bam_file_10 (str) – Location of the bam file that is to get merged into bam_file_1

bam_merge_2(**kwargs)[source]¶

Wrapper for the pysam SAMtools merge function

Parameters:	bam_file_1 (str) – Location of the bam file to merge into bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1

bam_merge_3(**kwargs)[source]¶

Wrapper for the pysam SAMtools merge function

Parameters:	bam_file_1 (str) – Location of the bam file to merge into bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1 bam_file_3 (str) – Location of the bam file that is to get merged into bam_file_1

bam_merge_4(**kwargs)[source]¶

Wrapper for the pysam SAMtools merge function

Parameters:	bam_file_1 (str) – Location of the bam file to merge into bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1 bam_file_3 (str) – Location of the bam file that is to get merged into bam_file_1 bam_file_4 (str) – Location of the bam file that is to get merged into bam_file_1

bam_merge_5(**kwargs)[source]¶

Wrapper for the pysam SAMtools merge function

Parameters:

bam_file_1 (str) – Location of the bam file to merge into
bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1
bam_file_3 (str) – Location of the bam file that is to get merged into bam_file_1
bam_file_4 (str) – Location of the bam file that is to get merged into bam_file_1
bam_file_5 (str) – Location of the bam file that is to get merged into bam_file_1

bam_paired_reads(**kwargs)[source]¶

Wrapper for the pysam SAMtools view function to identify if a bam file contains paired end reads

Parameters:	bam_file (str) – Location of the bam file that is to be indexed
Returns:	True if the bam file contains paired end reads
Return type:	bool

bam_sort(**kwargs)[source]¶

Wrapper for the pysam SAMtools sort function

Parameters:	bam_file (str) – Location of the bam file to sort

bam_stats(**kwargs)[source]¶

Wrapper for the pysam SAMtools flagstat function

Parameters:	bam_file (str) – Location of the bam file that is to be indexed bam_idx_file (str) – Location of the bam index file (.bai)

check_header(**kwargs)[source]¶

Wrapper for the pysam SAMtools merge function

Parameters:	bam_file_1 (str) – Location of the bam file to merge into bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1

FASTQ Functions¶

The following functions are ones that are used for the manipulation of FASTQ files.

Reading¶

The following functions are to provide easy access for iterating through entries within a FASTQ file(s) both single and paired.

class tool.fastqreader.fastqreader[source]¶

Module for reading single end and paired end FASTQ files

closeFastQ()[source]¶: Close file handles for the FastQ files.

closeOutputFiles()[source]¶: Close the output file handles

createOutputFiles(tag='')[source]¶

Create and open the file handles for the output files

Parameters:	tag (str) – Tag to identify the output files (DEFAULT: ‘’)

eof(side=1)[source]¶

Indicate if the end of the file has been reached

Parameters:	side (int) – 1 or 2

incrementOutputFiles()[source]¶: Increment the counter and create new files for splitting the original FastQ paired end files.

next(side=1)[source]¶

Get the next read element for the specific FastQ file pair

Parameters:	side (int) – 1 or 2 to get the element from the relevant end (DEFAULT: 1)
Returns:	id : str Sequence ID seq : str Called sequence add : str Plus sign score : str Base call score
Return type:	dict

openFastQ(file1, file2=None)[source]¶

Create file handles for reading the FastQ files

Parameters:	file1 (str) – Location of the first FASTQ file file2 (str) – Location of a paired end FASTQ file.

writeOutput(read, side=1)[source]¶

Writer to print the extracted lines

Parameters:	read (dict) – Read is the dictionary object returned from self.next() side (int) – The side that the read has coe from (DEFAULT: 1)
Returns:	False if a value other than 1 or 2 is entered for the side.
Return type:	bool

Splitting¶

This tool has been created to aid in splitting FASTQ files into manageable chunks for parallel processing. It is able to work on single and paired end files.

class tool.fastq_splitter.fastq_splitter(configuration=None)[source]¶

Script for splitting up FASTQ files into manageable chunks

paired_splitter(**kwargs)[source]¶

Function to divide the paired-end FastQ files into separte sub files of 1000000 sequences so that the aligner can run in parallel.

Parameters:

in_file1 (str) – Location of first paired end FASTQ file
in_file2 (str) – Location of second paired end FASTQ file
tag (str) – DEFAULT = tmp Tag used to identify the files. Useful if this is getting run manually on a single machine multiple times to prevent collisions of file names

Returns:

Returns (Returns a list of lists of the files that have been generated.) – Each sub list containing the two paired end files for that subset.
paired_files (list) – List of lists of pair end files. Each sub list containing the two paired end files for that subset.

run(input_files, input_metadata, output_files)[source]¶

The main function to run the splitting of FASTQ files (single or paired) so that they can aligned in a distributed manner

Parameters:

input_files (dict) – List of input fastq file locations
metadata (dict) –
output_files (dict) –

Returns:

output_file (str) – Location of compressed (.tar.gz) of the split FASTQ files
output_names (list) – List of file names in the compressed file

single_splitter(**kwargs)[source]¶

Function to divide the FastQ files into separate sub files of 1000000 sequences so that the aligner can run in parallel.

Parameters:

in_file1 (str) – Location of first FASTQ file
tag (str) – DEFAULT = tmp Tag used to identify the files. Useful if this is getting run manually on a single machine multiple times to prevent collisions of file names

Returns:

Returns (Returns a list of the files that have been generated.) – Each sub list containing the two paired end files for that subset.
paired_files (list) – List of lists of pair end files. Each sub list containing the two paired end files for that subset.

Entry Functions¶

The following functions allow for manipulating FASTQ files.

class tool.fastq_utils.fastqUtils[source]¶

Set of methods to help with the management of FastQ files.

static fastq_match_paired_ends(fastq_1, fastq_2)[source]¶

Take 2 fastq files and remove ends that don’t have a matching pair. Requires that the FastQ files are ordered correctly.

Mismatches can occur if there is a filtering step that removes one of the paired ends

Parameters:	fastq_1 (str) – Location of FastQ file fastq_2 (str) – Location of FastQ file

static fastq_randomise(fastq, output=None)[source]¶

Randomising the order of reads withim a FastQ file

Parameters:	fastq (str) – Location of the FastQ file to randomise output (str) – [OPTIONAL] Location of the output FastQ file. If left blank then the randomised output is saved to the same location as fastq

static fastq_sort_file(fastq, output=None)[source]¶

Sorting of a FastQ file

Parameters:	fastq (str) – Location of the FastQ file to sort output (str) – [OPTIONAL] Location of the output FastQ file. If left blank then the sorted output is saved to the same location as fastq