Clone the pipeline using SSH:
git clone [email protected]:databio/pepatac.git
You have two options for software prerequisites: 1) use a container, or 2) install all prerequisites natively. If you want to use a container, you need only either
singularity -- please see instructions in how to run PEPATAC in a container. Otherwise, follow these instructions to install the requirements natively:
PEPATAC uses several packages under the hood. Make sure you're up-to-date with a user-specific install:
cd pepatac pip install --user -r requirements.txt
We will need some common bioinformatics tools installed: bedtools (v2.25.0+), bowtie2 (v2.2.9+), fastqc (v0.11.5+), samblaster (v0.1.24+), samtools (v1.7+), skewer (v0.1.126+), UCSC tools (bedGraphToBigWig, wigToBigWig, bigWigCat, bedToBigBed), pigz (v2.3.4+). You should follow instructions to install each individual program. If you need help installing these, see the detailed installation instructions.
R to generate quality control and read/peak annotation plots, so you'll need to have R functional if you want these outputs. We have packaged all the
R code into a supporting package called pepatacr. Install it with:
Rscript -e "install.packages('pepatacr')"
That's it! Everything we need to run
PEPATAC to its full potential should be installed.
Whether using the container or native version, you will need to provide reference genome assemblies produced by refgenie. Any prealignments you want to use will also require refgenie assemblies. You may download pre-indexed references for common genomes, or you may index your own (see refgenie instructions). For this example, let's grab the
hg38 genome to use as our primary assembly, and for prealignments we'll use
rCRSd (Revised Cambridge Reference Sequence for human mtDNA).
wget http://big.databio.org/refgenomes/hg38.tgz wget http://big.databio.org/refgenomes/human_repeats_170502.tgz wget http://big.databio.org/refgenomes/rCRSd_170502.tgz
At this point, you could choose to extend
PEPATAC by adding a few additional files into your refgenie assembly. For more details, see how to create a custom annotation file to explore using your own features of interest.
Once you've obtained assemblies for all genomes you wish to use, you must point the pipeline to where you store them. You can do this by adjusting the
resources.genomes attribute in the pipeline config file. By default, this points to the shell variable
$GENOMES, so all you have to do is set an environment variable to the location of your refgenie genomes:
The pipeline at its core is just a python script, and you can run it on the command line for a single sample (see command-line usage), which you can also get on the command line by running
pipelines/pepatac.py --help. You just need to pass a few command-line parameters to specify sample name, reference genome, input files, etc. Here's the basic command to run a small test example through the pipeline:
pipelines/pepatac.py --single-or-paired paired \ --prealignments rCRSd human_repeats \ --genome hg38 \ --sample-name test1 \ --input examples/data/test1_r1.fastq.gz \ --input2 examples/data/test1_r2.fastq.gz \ --genome-size hs \ -O $HOME/pepatac_test
This example should take about 15 minutes to complete. See other example commands that use test data.
This is just the beginning. For your next step, take a look at one of these user guides:
Any questions? Feel free to reach out to us. Otherwise, go analyze some ATAC-seq!