Test Data for ChIP-seq pipeline

The following document is for the preparation of data set required for testing the ChIP-seq pipeline. The document has been written with macOS Sierra in mind, although many of the commands are cross platform (*nix) compliant.

You would need to have the tools listed in “Prerequisites” installed on your system. For more details on installing the tools for this pipeline please refer to

http://multiscale-genomics.readthedocs.io/projects/mg-process-fastq/en/latest/full_installation.html

If you already have certain packages installed feel free to skip over certain steps. Likewise the bin, lib and code directories are relative to the home dir; if this is not the case for your system then make the required changes when running these commands.

Prerequisites

  • BWA
  • MACS 2
  • Biobambam
  • Samtools

Data set for genome file

Filtering for required coverage

Download the genome file from

wget "http://www.ebi.ac.uk/ena/data/view/CM000663.2,CM000664.2,CM000665.2,CM000666.2,CM000667.2,CM000668.2,CM000669.2,CM000670.2,CM000671.2,CM000672.2,CM000673.2,CM000674.2,CM000675.2,CM000676.2,CM000677.2,CM000678.2,CM000679.2,CM000680.2,CM000681.2,CM000682.2,CM000683.2,CM000684.2,CM000685.2,CM000686.2,J01415.2&display=fasta&download=fasta&filename=entry.fasta" -O GCA_000001405.22.fasta

Checkout https://github.com/Multiscale-Genomics/mg-misc-scripts/blob/master/ChIPSeq_Scripts/extract_chromosomeForChIP.py and extract chromosome 22 from the above file using the following command.

python extract_chromosomeForChIP.py path/to/your/input/file path/to/output/file

Download the fastq file from

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR000/DRR000150/DRR000150.fastq.gz

Unzip this file.

gunzip DRR000150.fastq.gz

Index the fasta file

bwa index GCA_000001405.22.chr22.fa.fasta

Align the fastq file

bwa aln GCA_000001405.22.chr22.fa.fasta DRR000150.chr22.fastq >GCA_000001405.22.chr22.sai

And make the sam file

bwa samse GCA_000001405.22.chr22.fa.fasta GCA_000001405.22.chr22.sai DRR000150.chr22.fastq >GCA_000001405.22.chr22.sam

Sort the sam file

samtools sort GCA_000001405.22.chr22.sam >GCA_000001405.22.chr22.sorted.sam

Find the depths of coverage from the sorted file

samtools depth GCA_000001405.22.chr22.sorted.sam >GCA_000001405.22.chr22.dp

From the depth file, find regions with >= 70 depth, spanning over >=55 base pairs. You may get the script for this from https://github.com/Multiscale-Genomics/mg-misc-scripts/blob/master/ChIPSeq_Scripts/traverseForCoverageRegion_ChIP.py. Run it using:

python traverseForCoverageRegion_ChIP.py path/to/GCA_000001405.22.chr22.dp

Running this script would print the spanning regions. If it is a continuous region, you may only take the first starting base pair and the last ending base pair, as inputs for the next step. (Take out 1000 and add in 1000 to these respectively to get upstream and downstream spanning bases)

Extract the corresponding fasta sequence from the chromosome file for the positions retrieved from the above step. Checkout file from https://github.com/Multiscale-Genomics/mg-misc-scripts/blob/master/ChIPSeq_Scripts/extractChromosomalRegion.py and run using command:

python extractChromosomalRegion.py path/to/original/fasta/file path/to/output/file/for/region/macs2.Human.GCA_000001405.22.fasta starting_base_position ending_base_position

Making the Fastq file

Index the fasta file for the selected region

bwa index macs2.Human.GCA_000001405.22.fasta

Align the fastq file

bwa aln macs2.Human.GCA_000001405.22.fasta DRR000150.chr22.fastq >macs2.Human.GCA_000001405.22.sai

And make the sam file

bwa samse macs2.Human.GCA_000001405.22.fasta macs2.Human.GCA_000001405.22.sai DRR000150.chr22.fastq >macs2.Human.GCA_000001405.22.sam

Filter this sam file for the reads which aligned with chromosome 22 using the following command:

awk '$3 == chr22' macs2.Human.GCA_000001405.22.sam >macs2.Human.GCA_000001405.22.22.sam

From the filtered reads from the above output file, extract the corresponding entries in fastq file. You may do this using the file at :

https://github.com/Multiscale-Genomics/mg-misc-scripts/blob/master/ChIPSeq_Scripts/makeFastQFiles.py

and running it via command line:

python makeFastQFiles.py --samfile macs2.Human.GCA_000001405.22.22.sam --fastQfile /path/to/DRR000150.chr22.fastq --pathToOutput /path/to/save/output/fastq/file/to/ --fastqOut macs2.Human.DRR000150.22.fastq

The fastq file in the above step and fasta file macs2.Human.GCA_000001405.22.fasta together make up the data set for ChIP-seq pipeline