Utility Functions

Common Functions

The following functions are ones that have been used across multiple tools for transformations of the data when requried.

class tool.common.cd(newpath)[source]

Context manager for changing the current working directory

Alignment Utilities

class tool.aligner_utils.alignerUtils[source]

Functions for downloading and processing N-seq FastQ files. Functions provided allow for the downloading and indexing of the genome assemblies.

static bowtie2_align_reads(genome_file, bam_loc, params, reads_file_1, reads_file_2=None)[source]

Map the reads to the genome using BWA.

Parameters:
  • genome_file (str) – Location of the assembly file in the file system
  • reads_file (str) – Location of the reads file in the file system
  • bam_loc (str) – Location of the output file
bowtie2_untar_index(genome_name, tar_file, bt2_1_file, bt2_2_file, bt2_3_file, bt2_4_file, bt2_rev1_file, bt2_rev2_file)[source]

Extracts the BWA index files from the genome index tar file.

Parameters:
  • genome_file_name (str) – Location string of the genome fasta file
  • tar_file (str) – Location of the Bowtie2 index file
  • bt2_1_file (str) – Location of the amb index file
  • bt2_2_file (str) – Location of the ann index file
  • bt2_3_file (str) – Location of the bwt index file
  • bt2_4_file (str) – Location of the pac index file
  • bt2_rev1_file (str) – Location of the sa index file
  • bt2_rev2_file (str) – Location of the sa index file
Returns:

Boolean indicating if the task was successful

Return type:

bool

static bowtie_index_genome(genome_file)[source]

Create an index of the genome FASTA file with Bowtie2. These are saved alongside the assembly file.

Parameters:genome_file (str) – Location of the assembly file in the file system
bwa_aln_align_reads_paired(genome_file, reads_file_1, reads_file_2, bam_loc, params)[source]

Map the reads to the genome using BWA.

Parameters:
  • genome_file (str) – Location of the assembly file in the file system
  • reads_file (str) – Location of the reads file in the file system
  • bam_loc (str) – Location of the output file
bwa_aln_align_reads_single(genome_file, reads_file, bam_loc, params)[source]

Map the reads to the genome using BWA. :param genome_file: Location of the assembly file in the file system :type genome_file: str :param reads_file: Location of the reads file in the file system :type reads_file: str :param bam_loc: Location of the output file :type bam_loc: str

static bwa_index_genome(genome_file)[source]

Create an index of the genome FASTA file with BWA. These are saved alongside the assembly file. If the index has already been generated then the locations of the files are returned

Parameters:genome_file (str) – Location of the assembly file in the file system
Returns:
  • amb_file (str) – Location of the amb file
  • ann_file (str) – Location of the ann file
  • bwt_file (str) – Location of the bwt file
  • pac_file (str) – Location of the pac file
  • sa_file (str) – Location of the sa file

Example

1
2
3
4
5
from tool.aligner_utils import alignerUtils
au_handle = alignerUtils()

indexes = au_handle.bwa_index_genome('/<data_dir>/human_GRCh38.fa.gz')
print(indexes)
static bwa_mem_align_reads(genome_file, bam_loc, params, reads_file_1, reads_file_2=None)[source]

Map the reads to the genome using BWA.

Parameters:
  • genome_file (str) – Location of the assembly file in the file system
  • reads_file (str) – Location of the reads file in the file system
  • bam_loc (str) – Location of the output file
bwa_untar_index(genome_name, tar_file, amb_file, ann_file, bwt_file, pac_file, sa_file)[source]

Extracts the BWA index files from the genome index tar file.

Parameters:
  • genome_file_name (str) – Location string of the genome fasta file
  • genome_idx (str) – Location of the BWA index file
  • amb_file (str) – Location of the amb index file
  • ann_file (str) – Location of the ann index file
  • bwt_file (str) – Location of the bwt index file
  • pac_file (str) – Location of the pac index file
  • sa_file (str) – Location of the sa index file
Returns:

Boolean indicating if the task was successful

Return type:

bool

static gem_index_genome(genome_file, index_name=None)[source]

Create an index of the genome FASTA file with GEM. These are saved alongside the assembly file.

Parameters:genome_file (str) – Location of the assembly file in the file system
static replaceENAHeader(file_path, file_out)[source]

The ENA header has pipes in the header as part of the stable_id. This function removes the ENA stable_id and replaces it with the final section after splitting the stable ID on the pipe.

Bam Utilities

class tool.bam_utils.bamUtils[source]

Tool for handling bam files

static bam_copy(bam_in, bam_out)[source]

Wrapper function to copy from one bam file to another

Parameters:
  • bam_in (str) – Location of the input bam file
  • bam_out (str) – Location of the output bam file
static bam_count_reads(bam_file, aligned=False)[source]

Wrapper to count the number of (aligned) reads in a bam file

static bam_filter(bam_file, bam_file_out, filter_name)[source]

Wrapper for filtering out reads from a bam file

Parameters:
  • bam_file (str) –
  • bam_file_out (str) –
  • filter (str) –
    One of:
    duplicate - Read is PCR or optical duplicate (1024) supplementary - Reads that are chimeric, fusion or non linearly aligned (2048) unmapped - Read is unmapped or not the primary alignment (260)
static bam_index(bam_file, bam_idx_file)[source]

Wrapper for the pysam SAMtools index function

Parameters:
  • bam_file (str) – Location of the bam file that is to be indexed
  • bam_idx_file (str) – Location of the bam index file (.bai)
static bam_list_chromosomes(bam_file)[source]

Wrapper to list the chromosome names that are present within the bam file

Parameters:bam_file (str) – Location of the bam file
Returns:List of the names of the chromosomes that are present in the bam file
Return type:list
static bam_merge(*args)[source]

Wrapper for the pysam SAMtools merge function

Parameters:
  • bam_file_1 (str) – Location of the bam file to merge into
  • bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1
static bam_paired_reads(bam_file)[source]

Wrapper to test if a bam file contains paired end reads

static bam_sort(bam_file)[source]

Wrapper for the pysam SAMtools sort function

Parameters:bam_file (str) – Location of the bam file to sort
static bam_split(bam_file_in, bai_file, chromosome, bam_file_out)[source]

Wrapper to extract a single chromosomes worth of reading into a new bam file

Parameters:
  • bam_file_in (str) – Location of the input bam file
  • bai_file (str) – Location of the bam index file. This needs to be in the same directory as the bam_file_in
  • chromosome (str) – Name of the chromosome whose alignments are to be extracted
  • bam_file_out (str) – Location of the output bam file
static bam_stats(bam_file)[source]

Wrapper for the pysam SAMtools flagstat function

Parameters:bam_file (str) – Location of the bam file
Returns:list – qc_passed : int qc_failed : int description : str
Return type:dict
static bam_to_bed(bam_file, bed_file)[source]

Function for converting bam files to bed files

static check_header(bam_file)[source]

Wrapper for the pysam SAMtools for checking if a bam file is sorted

Parameters:bool – True if the file has been sorted
static sam_to_bam(sam_file, bam_file)[source]

Function for converting sam files to bam files

class tool.bam_utils.bamUtilsTask[source]

Wrappers so that the function above can be used as part of a @task within COMPSs avoiding the files being copied around the infrastructure too many times

bam_copy(**kwargs)[source]

Wrapper function to copy from one bam file to another

Parameters:
  • bam_in (str) – Location of the input bam file
  • bam_out (str) – Location of the output bam file
bam_filter(**kwargs)[source]

Wrapper for filtering out reads from a bam file

Parameters:
  • bam_file (str) –
  • bam_file_out (str) –
  • filter (str) –
    One of:
    duplicate - Read is PCR or optical duplicate (1024) unmapped - Read is unmapped or not the primary alignment (260)
bam_index(**kwargs)[source]

Wrapper for the pysam SAMtools merge function

Parameters:
  • bam_file (str) – Location of the bam file that is to be indexed
  • bam_idx_file (str) – Location of the bam index file (.bai)
bam_list_chromosomes(**kwargs)[source]

Wrapper to get the list of chromosomes in a given bam file

Parameters:bam_file (str) – Location of the bam file
Returns:chromosome_list – List of the chromosomes in the bam file
Return type:list
bam_merge(in_bam_job_files)[source]

Wrapper task taking any number of bam files and merging them into a single bam file.

Parameters:bam_job_files (list) – List of the locations of the separate bam files that are to be merged The first file in the list will be taken as the output file name
bam_merge_10(**kwargs)[source]

Wrapper for the pysam SAMtools merge function

Parameters:
  • bam_file_1 (str) – Location of the bam file to merge into
  • bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1
  • bam_file_3 (str) – Location of the bam file that is to get merged into bam_file_1
  • bam_file_4 (str) – Location of the bam file that is to get merged into bam_file_1
  • bam_file_5 (str) – Location of the bam file that is to get merged into bam_file_1
  • bam_file_6 (str) – Location of the bam file that is to get merged into bam_file_1
  • bam_file_7 (str) – Location of the bam file that is to get merged into bam_file_1
  • bam_file_8 (str) – Location of the bam file that is to get merged into bam_file_1
  • bam_file_9 (str) – Location of the bam file that is to get merged into bam_file_1
  • bam_file_10 (str) – Location of the bam file that is to get merged into bam_file_1
bam_merge_2(**kwargs)[source]

Wrapper for the pysam SAMtools merge function

Parameters:
  • bam_file_1 (str) – Location of the bam file to merge into
  • bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1
bam_merge_3(**kwargs)[source]

Wrapper for the pysam SAMtools merge function

Parameters:
  • bam_file_1 (str) – Location of the bam file to merge into
  • bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1
  • bam_file_3 (str) – Location of the bam file that is to get merged into bam_file_1
bam_merge_4(**kwargs)[source]

Wrapper for the pysam SAMtools merge function

Parameters:
  • bam_file_1 (str) – Location of the bam file to merge into
  • bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1
  • bam_file_3 (str) – Location of the bam file that is to get merged into bam_file_1
  • bam_file_4 (str) – Location of the bam file that is to get merged into bam_file_1
bam_merge_5(**kwargs)[source]

Wrapper for the pysam SAMtools merge function

Parameters:
  • bam_file_1 (str) – Location of the bam file to merge into
  • bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1
  • bam_file_3 (str) – Location of the bam file that is to get merged into bam_file_1
  • bam_file_4 (str) – Location of the bam file that is to get merged into bam_file_1
  • bam_file_5 (str) – Location of the bam file that is to get merged into bam_file_1
bam_paired_reads(**kwargs)[source]

Wrapper for the pysam SAMtools view function to identify if a bam file contains paired end reads

Parameters:bam_file (str) – Location of the bam file that is to be indexed
Returns:True if the bam file contains paired end reads
Return type:bool
bam_sort(**kwargs)[source]

Wrapper for the pysam SAMtools sort function

Parameters:bam_file (str) – Location of the bam file to sort
bam_stats(**kwargs)[source]

Wrapper for the pysam SAMtools flagstat function

Parameters:
  • bam_file (str) – Location of the bam file that is to be indexed
  • bam_idx_file (str) – Location of the bam index file (.bai)
check_header(**kwargs)[source]

Wrapper for the pysam SAMtools merge function

Parameters:
  • bam_file_1 (str) – Location of the bam file to merge into
  • bam_file_2 (str) – Location of the bam file that is to get merged into bam_file_1

FASTQ Functions

The following functions are ones that are used for the manipulation of FASTQ files.

Reading

The following functions are to provide easy access for iterating through entries within a FASTQ file(s) both single and paired.

class tool.fastqreader.fastqreader[source]

Module for reading single end and paired end FASTQ files

closeFastQ()[source]

Close file handles for the FastQ files.

closeOutputFiles()[source]

Close the output file handles

createOutputFiles(tag='')[source]

Create and open the file handles for the output files

Parameters:tag (str) – Tag to identify the output files (DEFAULT: ‘’)
eof(side=1)[source]

Indicate if the end of the file has been reached

Parameters:side (int) – 1 or 2
incrementOutputFiles()[source]

Increment the counter and create new files for splitting the original FastQ paired end files.

next(side=1)[source]

Get the next read element for the specific FastQ file pair

Parameters:side (int) – 1 or 2 to get the element from the relevant end (DEFAULT: 1)
Returns:
id : str
Sequence ID
seq : str
Called sequence
add : str
Plus sign
score : str
Base call score
Return type:dict
openFastQ(file1, file2=None)[source]

Create file handles for reading the FastQ files

Parameters:
  • file1 (str) – Location of the first FASTQ file
  • file2 (str) – Location of a paired end FASTQ file.
writeOutput(read, side=1)[source]

Writer to print the extracted lines

Parameters:
  • read (dict) – Read is the dictionary object returned from self.next()
  • side (int) – The side that the read has coe from (DEFAULT: 1)
Returns:

False if a value other than 1 or 2 is entered for the side.

Return type:

bool

Splitting

This tool has been created to aid in splitting FASTQ files into manageable chunks for parallel processing. It is able to work on single and paired end files.

class tool.fastq_splitter.fastq_splitter(configuration=None)[source]

Script for splitting up FASTQ files into manageable chunks

paired_splitter(**kwargs)[source]

Function to divide the paired-end FastQ files into separte sub files of 1000000 sequences so that the aligner can run in parallel.

Parameters:
  • in_file1 (str) – Location of first paired end FASTQ file
  • in_file2 (str) – Location of second paired end FASTQ file
  • tag (str) – DEFAULT = tmp Tag used to identify the files. Useful if this is getting run manually on a single machine multiple times to prevent collisions of file names
Returns:

  • Returns (Returns a list of lists of the files that have been generated.) – Each sub list containing the two paired end files for that subset.
  • paired_files (list) – List of lists of pair end files. Each sub list containing the two paired end files for that subset.

run(input_files, input_metadata, output_files)[source]

The main function to run the splitting of FASTQ files (single or paired) so that they can aligned in a distributed manner

Parameters:
  • input_files (dict) – List of input fastq file locations
  • metadata (dict) –
  • output_files (dict) –
Returns:

  • output_file (str) – Location of compressed (.tar.gz) of the split FASTQ files
  • output_names (list) – List of file names in the compressed file

single_splitter(**kwargs)[source]

Function to divide the FastQ files into separate sub files of 1000000 sequences so that the aligner can run in parallel.

Parameters:
  • in_file1 (str) – Location of first FASTQ file
  • tag (str) – DEFAULT = tmp Tag used to identify the files. Useful if this is getting run manually on a single machine multiple times to prevent collisions of file names
Returns:

  • Returns (Returns a list of the files that have been generated.) – Each sub list containing the two paired end files for that subset.
  • paired_files (list) – List of lists of pair end files. Each sub list containing the two paired end files for that subset.

Entry Functions

The following functions allow for manipulating FASTQ files.

class tool.fastq_utils.fastqUtils[source]

Set of methods to help with the management of FastQ files.

static fastq_match_paired_ends(fastq_1, fastq_2)[source]

Take 2 fastq files and remove ends that don’t have a matching pair. Requires that the FastQ files are ordered correctly.

Mismatches can occur if there is a filtering step that removes one of the paired ends

Parameters:
  • fastq_1 (str) – Location of FastQ file
  • fastq_2 (str) – Location of FastQ file
static fastq_randomise(fastq, output=None)[source]

Randomising the order of reads withim a FastQ file

Parameters:
  • fastq (str) – Location of the FastQ file to randomise
  • output (str) – [OPTIONAL] Location of the output FastQ file. If left blank then the randomised output is saved to the same location as fastq
static fastq_sort_file(fastq, output=None)[source]

Sorting of a FastQ file

Parameters:
  • fastq (str) – Location of the FastQ file to sort
  • output (str) – [OPTIONAL] Location of the output FastQ file. If left blank then the sorted output is saved to the same location as fastq