This guide walks you through the nitty-gritty of how to install each prerequisite package.
PEPATAC
pipelineTo begin, we need to get the PEPATAC
pipeline itself. The pipeline is hosted on github. If you don't have git installed, follow the git installation instructions, and here is a brief introduction to git. To install PEPATAC, you can use one of the following methods:
git clone [email protected]:databio/pepatac.git
git clone https://github.com/databio/pepatac.git
We'll use SSH in this example. From an open terminal, let's first create a directory we'll use to run through this guide:
mkdir pepatac_tutorial
Let's move into our newly created directory and create a few more folders that we'll use later.
cd pepatac_tutorial/
mkdir data
mkdir genomes
mkdir processed
mkdir templates
mkdir tools
cd tools/
Time to get PEPATAC!
git clone [email protected]:databio/pepatac.git
Success! If you had any issues, feel free to reach out to us with questions. Otherwise, let's move on to installing additional software.
You have two options for installing the software prerequisites: 1) use a container, in which case you need only either docker
or singularity
; or 2) install all prerequisites natively.
To use PEPATAC
, we need the following software:
Python packages. The pipeline uses pypiper
to run a single sample, looper
to handle multi-sample projects (for either local or cluster computation), and pararead
for parallel processing sequence reads. For peak calling, the pipeline uses MACS2
as the default. You can do a user-specific install of these like this:
pip install --user numpy pandas piper loopercli \
pararead MACS2
Required executables. We will need some common bioinformatics tools installed. The complete list (including optional tools) is specified in the pipeline configuration file (pipelines/pepatac.yaml) tools section. The following tools are used by the pipeline:
We'll install each of these pieces of software before moving forward. Let's start right at the beginning and install bedtools
. We're going to install from source, but if you would prefer to install from a package manager, you can follow the instructions in the bedtools' installation guide.
cd tools/
wget https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz
tar -zxvf bedtools-2.25.0.tar.gz
rm bedtools-2.25.0.tar.gz
cd bedtools2
make
Now, let's add bedtools
to our PATH
environment variable. Look here to learn more about the concept of environment variables if you are unfamiliar.
export PATH="$PATH:/path/to/pepatac_tutorial/tools/bedtools2/bin/"
Next, let's install bowtie2
.
cd ../
wget https://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.3.4.1/bowtie2-2.3.4.1-source.zip
unzip bowtie2-2.3.4.1-source.zip
rm bowtie2-2.3.4.1-source.zip
cd bowtie2-2.3.4.1
make
Again, let's add bowtie2
to our PATH
environment variable:
export PATH="$PATH:/path/to/pepatac_tutorial/tools/bowtie2-2.3.4.1/"
Great! On to the next one. Let's get FastQC
. Reminder, you will need to have java
installed to use FastQC
. At the command prompt, you can type java -version
, press enter, and if you don't see an error you should be alright. You'll need a version greater than 1.6 to work with FastQC
. Read more from the FastQC
installation instructions.
cd ../
wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.7.zip
unzip fastqc_v0.11.7.zip
rm fastqc_v0.11.7.zip
We also need to make the FastQC
wrapper executable. To learn more about this, check out this introduction to chmod
.
chmod 755 FastQC/fastqc
Add FastQC
to our PATH
environment variable:
export PATH="$PATH:/path/to/pepatac_tutorial/tools/FastQC/"
Now we'll get samblaster
. For a full guide, check out the samblaster
installation instructions.
git clone git://github.com/GregoryFaust/samblaster.git
cd samblaster/
make
export PATH="$PATH:/path/to/pepatac_tutorial/tools/samblaster/"
Next up, samtools
.
wget https://sourceforge.net/projects/samtools../files/samtools/1.9/samtools-1.9.tar.bz2
tar xvfj samtools-1.9.tar.bz2
rm samtools-1.9.tar.bz2
cd samtools-1.9
/configure
Alternatively, if you do not have the ability to install samtools
to the default location, you can specify using the --prefix=/install/destination/dir/
option. Learn more about the --prefix
option here.
make
make install
As for our other tools, add samtools
to our PATH
environment variable:
export PATH="$PATH:/path/to/pepatac_tutorial/tools/samtools-1.9/"
Time to add skewer
to the collection.
cd ../
wget https://downloads.sourceforge.net/project/skewer/Binaries/skewer-0.2.2-linux-x86_64
mv skewer-0.2.2-linux-x86_64 skewer
chmod 755 skewer
Finally, we need a few of the UCSC utilities. You can install the entire set of tools should you choose, but here we'll just grab the subset that we need.
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedGraphToBigWig
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/wigToBigWig
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bigWigCat
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed
chmod 755 bedGraphToBigWig
chmod 755 wigToBigWig
chmod 755 bigWigCat
chmod 755 bedToBigBed
Add our tools/
directory to our PATH
environment variable.
export PATH="$PATH:/path/to/pepatac_tutorial/tools/"
That should do it! Now we'll install some optional packages. Of course, these are not required, but for the purposes of this tutorial we're going to be completionists.
PEPATAC
uses R
to generate quality control plots. These are optional and the pipeline will run without them, but you would not get any QC plots. If you need to don't have R installed, you can follow these instructions. We'll use and install the necessary packages in this example. Here is the list of required packages:
To install the needed packages, enter the following command:
Rscript -e "install.packages(c('argparser','devtools', 'data.table', \
'ggplot2', 'gplots', 'gtable', 'scales'), \
repos='http://cran.us.r-project.org/'); \
source('https://bioconductor.org/biocLite.R'); \
biocLite('GenomicRanges'); \
devtools::install_github(c('pepkit/pepr', 'databio/GenomicDistributions'))"
To extract files quicker, PEPATAC
can also utilize pigz
in place of gzip
if you have it installed. Let's go ahead and do that now. It's not required, but it can help speed everything up when you have many samples to process.
cd /path/to/pepatac_tutorial/tools/
wget http://zlib.net/pigz/pigz-2.4.tar.gz
tar xvfz pigz-2.4.tar.gz
rm pigz-2.4.tar.gz
cd pigz-2.4/
make
Don't forget to add this to your PATH
too!
export PATH="$PATH:/path/to/pepatac_tutorial/tools/pigz-2.4/"
That's it! Everything we need to run PEPATAC
to its full potential should be installed. If you are interested and have experience using containers, you can check out the alternate installation methods.
We also need to create some environment variables to help point looper
to where we keep our data files and our tools. You may either set the environment variables up, like we're going to do now, or you may simply hard code the necessary locations in our configuration files.
First, let's create a PROCESSED
variable that represents the location where we want to save output.
export PROCESSED="/path/to/pepatac_tutorial/processed/"
Second, we'll create a variable representing the root path to all our tools named CODEBASE
.
export CODEBASE="/path/to/pepatac_tutorial/tools/"
(Add these environment variables to your .bashrc
or .profile
so you don't have to always do this step).
Fantastic! Now that we have the pipeline and its requirements installed, we're ready to get our reference genome(s).
Before we analyze anything, we also need a reference genome. PEPATAC
uses refgenie
genomes. For the purposes of this tutorial, we'll just download pre-built genomes. Follow the 'refgenie
instructions if you'd like to build your own reference genome. First, let's change into our genomes/
folder.
cd /path/to/pepatac_tutorial/genomes/
wget http://big.databio.org/refgenomes/hg38.tgz
wget http://cloud.databio.org.s3.amazonaws.com/refgenomes/human_repeats_170502.tgz
wget http://cloud.databio.org.s3.amazonaws.com/refgenomes/rCRSd_170502.tgz
tar xvfz hg38.tgz
tar xvfz human_repeats_170502.tgz
tar xvfz rCRSd_170502.tgz
rm hg38.tgz
rm human_repeats_170502.tgz
rm rCRSd_170502.tgz
## 4: Point the pipeline to your Refgenie assemblies
Let's also create another environment variable that points to our genomes.
export GENOMES="/path/to/pepatac_tutorial/genomes/
(Don't forget to add this to your .bashrc
or .profile
to ensure it persists).
To calculate TSS enrichments, you will need a TSS annotation file in your reference genome directory. If a pre-built version for your genome of interest isn't present, you can quickly create that file yourself. In the reference genome directory, you can perform the following commands for in this example, hg38
:
wget -O hg38_TSS_full.txt.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz \
zcat hg38_TSS_full.txt.gz | \
awk '{if($4=="+"){print $3"\t"$5"\t"$5"\t"$4"\t"$13}else{print $3"\t"$6"\t"$6"\t"$4"\t"$13}}' | \
LC_COLLATE=C sort -k1,1 -k2,2n -u > hg38_TSS.tsv
We also have downloadable pre-built genome annotation files for hg38
, hg19
, mm10
, and mm9
that you can use to annotate the reads and peaks. These files annotate 3' and 5' UTR, Exonic, Intronic, Intergenic, Promoter, and Promoter Flanking Regions of the corresponding genome as indicated in Ensembl or UCSC. Simply move the corresponding genome annotation file into the pepatac/anno
folder. Once present in the pepatac/anno
folder you don't need to do anything else as the pipeline will look there automatically. Alternatively, you can use the --anno-name
pipeline option to directly point to this file when running. You can also learn how to create a custom annotation file to calculate coverage using your own features of interest.
Alright! Time to setup the pipeline configuration files and run our sample.