Run in a container.
A popular approach is installing all dependencies in a container and just use that single container. This container can be used with either docker
or singularity
. You can run PEPATAC
as an individual pipeline on a single sample using these containers by directly calling docker run
or singularity exec
. Or, you can rely on looper
, which is already set up to run any pipeline in existing containers using the divvy
templating system.
Running PEPATAC
using a single, monolithic container.
1: Clone the PEPATAC
pipeline
git clone https://github.com/databio/pepatac.git
2: Get genome assets
We recommend refgenie
to manage all required and optional genome assets. However, PEPATAC
can also accept file paths to any of the assets.
2a: Initialize refgenie
and download assets
PEPATAC
can use refgenie
assets for alignment and annotation. Because assets are user-dependent, these files must still exist outside of a container system. We need to install and initialize a refgenie config file.. For example:
pip install refgenie
export REFGENIE=/path/to/your_genome_folder/genome_config.yaml
refgenie init -c $REFGENIE
Add the export REFGENIE
line to your .bashrc
or .profile
to ensure it persists.
Next, pull the assets you need. Replace hg38
in the example below if you need to use a different genome assembly. If these assets are not available automatically for your genome of interest, then you'll need to build them.
refgenie pull hg38/fasta hg38/bowtie2_index hg38/refgene_anno hg38/ensembl_gtf hg38/ensembl_rb
refgenie build hg38/feat_annotation
PEPATAC
also requires a bowtie2_index
asset for any pre-alignment genomes:
refgenie pull rCRSd/bowtie2_index
2b: Download assets manually
If you prefer not to use refgenie
, you can also download and construct assets manually. Again, because these are user-defined assets, they must exist outside of any container system. The minimum required assets for a genome includes:
- a chromosome sizes file: a text file containing "chr" and "size" columns.
- a bowtie2
genome index.
Optional assets include:
- a TSS annotation file: a BED file containing "chr", "start", "end", "gene name", "score", and "strand" columns.
- a region blacklist: e.g. the ENCODE blacklist
- a genomic feature annotation file
You can obtain the minimally required pre-constructed --chrom-sizes
and --genome-index
files from the refgenie
servers. Refgenie
uses algorithmically derived genome digests under-the-hood to unambiguously define genomes. That's what you'll see being used in the example below when we manually download these assets. Therefore, 2230c535660fb4774114bfa966a62f823fdb6d21acf138d4
is the digest for the human readable alias, "hg38", and 94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4
is the digest for "rCRSd."
wget -O hg38.fasta.tgz http://rg.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/fasta?tag=default
wget -O hg38.bowtie2_index.tgz http://rg.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/bowtie2_index?tag=default
wget -O rCRSd.bowtie2_index.tgz http://refgenomes.databio.org/v3/assets/archive/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index?tag=default
Then, extract these files:
tar xvf hg38.fasta.tgz
tar xvf hg38.bowtie2_index.tgz
tar xvf rCRSd.bowtie2_index.tgz
3. Pull the container image.
Docker: You can pull the docker databio/pepatac image from dockerhub
like this:
docker pull databio/pepatac
Or build the image using the included Dockerfile
(you can use a recipe in the included Makefile
in the pepatac/
repository):
make docker
Singularity: You can download the singularity
image or build it from the docker image using the Makefile
:
make singularity
Now you'll need to tell the pipeline where you saved the singularity image. You can either create an environment variable called $SIMAGES
that points to the folder where your image is stored, or you can tweak the pipeline_interface.yaml
file so that the compute.singularity_image
attribute is pointing to the right location on disk.
6. Confirm installation
After setting up your environment to run PEPATAC
using containers, you can confirm the pipeline is now executable with your container system using the included checkinstall
script. This can either be run directly from the pepatac/
repository...
./checkinstall
or from the web:
curl -sSL https://raw.githubusercontent.com/databio/pepatac/checkinstall | bash
4. Run individual samples in a container
Individual jobs can be run in a container by simply running the pepatac.py
command through docker run
or singularity exec
. You can run containers either on your local computer, or in an HPC environment, as long as you have docker
or singularity
installed. You will need to include any volumes that contain data required by the pipeline. For example, to utilize refgenie
assets you'll need to ensure the volume containing those files is available. In the following example, we are including an environment variable ($GENOMES
) which points to such a directory.
For example, run it locally in singularity
like this:
singularity exec --bind $GENOMES $SIMAGES/pepatac pipelines/pepatac.py --help
With docker
, you can use:
docker run --rm -it databio/pepatac pipelines/pepatac.py --help
Be sure to mount the volumes you need with --volume
. If you're utilizing any environment variables (e.g. $REFGENIE
), don't forget to include those in your docker command with the -e
option.
5. Running multiple samples in a container with looper
To run multiple samples in a container, you simply need to configure looper
to use a container-compatible template. The looper documentation has instructions for running jobs in containers.
Container details
Using docker
The pipeline has been successfully run in both a Linux
and MacOS
environment. With docker
you need to bind mount your volume that contains the pipeline and your genome assets locations, as well as provide the container the same environment variables your host environment is using.
In the first example, we're mounting our home user directory (/home/jps3ag/
) which contains the parent directories to our genome assets and to the pipeline itself. We'll also provide the pipeline environment variables, such as $HOME
.
Here's that example command in a Linux environment to run the test example through the pipeline (using the manually downloaded genome assets):
docker run --rm -it --volume /home/jps3ag/:/home/jps3ag/ \
-e HOME='/home/jps3ag/' \
databio/pepatac \
/home/jps3ag/src/pepatac/pipelines/pepatac.py --single-or-paired paired \
--prealignment-index rCRSd=default/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4 \
--genome hg38 \
--genome-index /home/jps3ag/src/pepatac/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4 \
--chrom-sizes /home/jps3ag/src/pepatac/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.chrom.sizes \
--sample-name test1 \
--input /home/jps3ag/src/pepatac/examples/data/test1_r1.fastq.gz \
--input2 /home/jps3ag/src/pepatac/examples/data/test1_r2.fastq.gz \
--genome-size hs \
-O $HOME/pepatac_test
In this second example, we'll perform the same command in a MacOS
environment using Docker
for Mac
.
This necessitates a few minor changes to run that same example:
- replace
/home/
with/Users/
format - e.g.
--volume /Users/jps3ag/:/Users/jps3ag/
Remember to allocate sufficient memory (6-8GB should generally be adequate) in Docker for Mac.
docker run --rm -it --volume /Users/jps3ag/:/Users/jps3ag/ \
-e HOME="/Users/jps3ag/" \
databio/pepatac \
/Users/jps3ag/src/pepatac/pipelines/pepatac.py --single-or-paired paired \
--prealignment-index rCRSd=default/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4 \
--genome hg38 \
--genome-index /Users/jps3ag/src/pepatac/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4 \
--chrom-sizes /Users/jps3ag/src/pepatac/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.chrom.sizes \
--sample-name test1 \
--input /Users/jps3ag/src/pepatac/examples/data/test1_r1.fastq.gz \
--input2 /Users/jps3ag/src/pepatac/examples/data/test1_r2.fastq.gz \
--genome-size hs \
-O $HOME/pepatac_test
Using singularity
First, build a singularity container from the docker image and create a running instance:
singularity build pepatac docker://databio/pepatac:latest
singularity instance start -B /home/jps3ag/:/home/jps3aq/ pepatac pepatac_instance
Second, run your command.
singularity exec instance://pepatac_instance \
/home/jps3ag/src/pepatac/pipelines/pepatac.py --single-or-paired paired \
--prealignment-index rCRSd=/Users/jps3ag/src/pepatac/default/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4 \
--genome hg38 \
--genome-index /Users/jps3ag/src/pepatac/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4 \
--chrom-sizes /Users/jps3ag/src/pepatac/default/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4.chrom.sizes \
--sample-name test1 \
--input /home/jps3ag/src/pepatac/examples/data/test1_r1.fastq.gz \
--input2 /home/jps3ag/src/pepatac/examples/data/test1_r2.fastq.gz \
--genome-size hs \
-O $HOME/pepatac_test
Third, close your instance when finished.
singularity instance stop pepatac_instance