Detailed installation instructions
This guide walks you through the nitty-gritty of how to install each prerequisite package.
1. Install required software
Python packages. The pipeline uses pypiper
to run a single sample, looper
to handle multi-sample projects (for either local or cluster computation), and pararead
for parallel processing sequence reads. For peak calling, the pipeline uses MACS2
as the default. You can do a user-specific install using the included requirements.txt file in the pipeline directory:
pip install --user --upgrade -r requirements.txt
Required executables. We will need some common bioinformatics tools installed. The complete list (including optional tools) is specified in the pipeline configuration file tools section. The following tools are used by the pipeline:
- bedtools (v2.25.0+)
- bowtie2 (v2.2.9+)
- preseq (v2.0+)
- samblaster (v0.1.24+)
- samtools (v1.7)
- skewer (v0.1.126+)
- UCSC tools (v3.5.1)
bedtools
We'll install each of these pieces of software before moving forward. Let's start right at the beginning and install bedtools
. We're going to install from source, but if you would prefer to install from a package manager, you can follow the instructions in the bedtools' installation guide.
cd tools/
wget https://github.com/arq5x/bedtools2/releases/download/v2.29.2/bedtools-2.29.2.tar.gz
tar -zxvf bedtools-2.29.2.tar.gz
rm bedtools-2.29.2.tar.gz
cd bedtools2
make
Now, let's add bedtools
to our PATH
environment variable. Look here to learn more about the concept of environment variables if you are unfamiliar.
export PATH="$PATH:/path/to/pepatac_tutorial/tools/bedtools2/bin/"
bowtie2
Next, let's install bowtie2
.
cd ../
wget https://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.4.1/bowtie2-2.4.1-source.zip
unzip bowtie2-2.4.1-source.zip
rm bowtie2-2.4.1-source.zip
cd bowtie2-2.4.1
make
Note: you may need to install libtbb-dev
if make
fails, e.g. using apt install libtbb-dev
Again, let's add bowtie2
to our PATH
environment variable:
export PATH="$PATH:/path/to/pepatac_tutorial/tools/bowtie2-2.4.1/"
preseq
The pipeline uses preseq
to calculate library complexity. Check out the author's page for more instruction.
Note: If receiving the following error later in the tutorial: preseq: error while loading shared libraries: libgsl.so.0: cannot open shared object file: No such file or directory
you may need to install libgsl-dev
using: apt install libgsl-dev
and either:
1. export LD_LIBRARY_PATH=/usr/local/lib
2. link libgsl.so.0
to an existing libgsl
, e.g. libgsl.so.27
More info can be found here: https://www.gnu.org/software/gsl/doc/html/usage.html#shared-libraries
wget http://smithlabresearch.org/downloads/preseq_linux_v2.0.tar.bz2
tar xvfj preseq_linux_v2.0.tar.bz2
Add to PATH
!
export PATH="$PATH:/path/to/pepatac_tutorial/tools/preseq_v2.0/"
samblaster
Now we'll get samblaster
. For a full guide, check out the samblaster
installation instructions.
git clone git://github.com/GregoryFaust/samblaster.git
cd samblaster/
make
export PATH="$PATH:/path/to/pepatac_tutorial/tools/samblaster/"
samtools
Next up, samtools
.
wget https://github.com/samtools/samtools/releases/download/1.10/samtools-1.10.tar.bz2
tar xvfj samtools-1.10.tar.bz2
rm samtools-1.10.tar.bz2
cd samtools-1.10
./configure
Alternatively, if you do not have the ability to install samtools
to the default location, you can specify using the --prefix=/install/destination/dir/
option. Learn more about the --prefix
option here.
make
make install
As for our other tools, add samtools
to our PATH
environment variable:
export PATH="$PATH:/path/to/pepatac_tutorial/tools/samtools-1.10/"
skewer
Time to add skewer
to the collection.
cd ../
wget https://downloads.sourceforge.net/project/skewer/Binaries/skewer-0.2.2-linux-x86_64
mv skewer-0.2.2-linux-x86_64 skewer
chmod 755 skewer
UCSC utilities
Finally, we need a few of the UCSC utilities. You can install the entire set of tools should you choose, but here we'll just grab the subset that we need.
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/wigToBigWig
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bigWigCat
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed
chmod 755 wigToBigWig
chmod 755 bigWigCat
chmod 755 bedToBigBed
Add our tools/
directory to our PATH
environment variable.
export PATH="$PATH:/path/to/pepatac_tutorial/tools/"
That should do it! Now we'll install some optional packages. Of course, these are not required, but for the purposes of this tutorial we're going to be completionists.
2. Install optional software
PEPATAC
uses R
to generate quality control and read/peak annotation plots, so you'll need to have R functional if you want these outputs. We have packaged all the R
code into a supporting package called PEPATACr. The PEPATAC
package relies on a few additional packages which can be installed at the command line as follows:
Note: if given error regarding devtools
try: apt install r-cran-devtools
before proceeding with installation.
Note: if receiving an error for GenomicDistributionsData_0.0.2.tar.gz, download the file manually and install directly using install.packages("local/path/to/GenomicDistributionsData_0.0.2.tar.gz", repos=NULL)
Rscript -e 'install.packages('argparser')'
Rscript -e 'install.packages("devtools")'
Rscript -e 'devtools::install_github("pepkit/pepr")'
Rscript -e 'install.packages("BiocManager")'
Rscript -e 'BiocManager::install("GenomicRanges")'
Rscript -e 'devtools::install_github("databio/GenomicDistributions")'
Rscript -e 'BiocManager::install(c("BSgenome", "GenomicFeatures", "ensembldb"))'
Rscript -e 'install.packages("http://big.databio.org/GenomicDistributionsData/GenomicDistributionsData_0.0.1.tar.gz", repos=NULL)'
Then, install the PEPATAC
package. From the pepatac/
directory:
Rscript -e 'devtools::install(file.path("PEPATACr/"), dependencies=TRUE, repos="https://cloud.r-project.org/")'
Optionally, PEPATAC
can mix and match tools for adapter removal, deduplication, and signal track generation. FastQC
, if present, will be automatically run on input fastq files. seqOutBias
can be used with the --sob
argument to take into account mappability at a given read length, the Tn5 sequence bias, and to scale the sample signal tracks by the expected over observed cut frequency.
Optional tools:
- fastqc
- picard
- pigz (v2.3.4+)
- seqOutBias: necessitates the following UCSC tools
- trimmomatic
fastqc
You will need to have java
installed to use FastQC
. At the command prompt, you can type java -version
, press enter, and if you don't see an error you should be alright. You'll need a version greater than 1.6 to work with FastQC
. Read more from the FastQC
installation instructions.
cd /path/to/pepatac_tutorial/tools/
wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
unzip fastqc_v0.11.9.zip
rm fastqc_v0.11.9.zip
We also need to make the FastQC
wrapper executable. To learn more about this, check out this introduction to chmod
.
chmod 755 FastQC/fastqc
Add FastQC
to our PATH
environment variable:
export PATH="$PATH:/path/to/pepatac_tutorial/tools/FastQC/"
picard
PEPATAC
can alternatively use picard MarkDuplicates
for duplicate identification and removal. Read the picard
installation guide for more assistance.
wget https://github.com/broadinstitute/picard/releases/download/2.20.3/picard.jar
chmod +x picard.jar
Create an environmental variable pointing to the picard.jar
file called $PICARD
. Alternatively, update the pepatac.yaml
file with the full PATH to the picard.jar
file.
export PICARD="/path/to/pepatac_tutorial/tools/picard.jar"
pigz
To extract files quicker, PEPATAC
can also utilize pigz
in place of gzip
if you have it installed. Let's go ahead and do that now. It's not required, but it can help speed everything up when you have many samples to process.
cd /path/to/pepatac_tutorial/tools/
wget https://zlib.net/pigz/pigz-2.8.tar.gz
tar xvfz pigz-2.8.tar.gz
rm pigz-2.8.tar.gz
cd pigz-2.8/
make
Don't forget to add this to your PATH
too!
export PATH="$PATH:/path/to/pepatac_tutorial/tools/pigz-2.4/"
That's it! Everything we need to run PEPATAC
to its full potential should be installed. If you are interested and have experience using containers, you can check out the alternate installation methods.
3. Create environment variables
We also need to create some environment variables to help point looper
to where we keep our data files and our tools. You may either set the environment variables up, like we're going to do now, or you may simply hard code the necessary locations in our configuration files.
First, let's create a PROCESSED
variable that represents the location where we want to save output.
export PROCESSED="/path/to/pepatac_tutorial/processed/"
Second, we'll create a variable representing the root path to all our tools named CODEBASE
.
export CODEBASE="/path/to/pepatac_tutorial/tools/"
(Add these environment variables to your .bashrc
or .profile
so you don't have to always do this step).
Fantastic! Now that we have the pipeline and its requirements installed, we're ready to get our reference genome(s).
4. Download a reference genome
Before we analyze anything, we also need a reference genome. You can use our recommended approach, refgenie
, or download the assets manually.
4a: Initialize refgenie
and download assets
PEPATAC
can utilize refgenie
assets. Because assets are user-dependent, these files must still be available natively. Therefore, we need to install and initialize a refgenie config file.. For example:
pip install refgenie
export REFGENIE=/path/to/your_genome_folder/genome_config.yaml
refgenie init -c $REFGENIE
Add the export REFGENIE
line to your .bashrc
or .profile
to ensure it persists.
Next, pull the assets you need. Replace hg38
in the example below if you need to use a different genome assembly. If these assets are not available automatically for your genome of interest, then you'll need to build them. Download these required assets with this command:
refgenie pull hg38/fasta hg38/bowtie2_index hg38/refgene_anno hg38/ensembl_gtf hg38/ensembl_rb
refgenie build hg38/feat_annotation
PEPATAC
also requires a bowtie2_index
asset for any pre-alignment genomes:
refgenie pull rCRSd/fasta
refgenie pull rCRSd/bowtie2_index
4b: Download assets manually
If you prefer not to use refgenie
, you can also download and construct assets manually. The minimum required assets for a genome includes:
- a FASTA file for the genome of interest
- a chromosome sizes file: a text file containing "chr" and "size" columns.
- a bowtie2
genome index.
Optional assets include:
- a TSS annotation file: a BED file containing "chr", "start", "end", "gene name", "score", and "strand" columns.
- a region blacklist: e.g. the ENCODE blacklist
- a genomic feature annotation file
5: Confirm installation
After setting up your environment to run PEPATAC
, you can confirm which means of running the pipeline are now executable using the included checkinstall
script. This can either be run directly from the pepatac/
repository:
./checkinstall
or from the web:
curl -sSL https://raw.githubusercontent.com/databio/pepatac/checkinstall | bash