Genome assets

PEPATAC can use either manually constructed or refgenie managed assets. Refgenie streamlines sample processing, where once assets are built by refgenie there is minimal argument calls to PEPATAC to use all assets. Pipeline assets include:

Required

PEPATAC argument refgenie asset name Description
--genome-index bowtie2_index A genome index file constructed from bowtie2-build
bwa_index A genome index file constructed from bwa index. Required when using bwa (optional) for alignment.
--chrom-sizes With refgenie, this asset is built automatically when you build/pull the fasta asset. A text file containing "chr" and "size" columns.

Optional

PEPATAC argument refgenie asset name Description
--prealignment-names Human readable genome alias(es) for refgenie managed bowtie2_index asset(s). A space-delimited list of genome names. e.g. ["rCRSd", "human_repeats"]
--prealignment-index bowtie2_index A genome index file constructed from bowtie2-build. Used for manually pointing to prealignment genome indices when using bowtie2 (default) for alignment.
bwa_index A genome index file constructed from bwa index. Used for manually pointing to prealignment genome indices when using bwa for alignment.
--TSS-name refgene_anno. refgenie build/pull the TSS annotation file with this asset. Transcription start site (TSS) annotations. e.g. refGene.txt.gz
--blacklist blacklist A region blacklist. e.g. the ENCODE blacklist
--anno-name feat_annotation A BED-style file with "chr", "start", "end", "genomic feature name", "score" and "strand" columns.
--search-file tallymer_index The search_file is built from this refgenie asset. File used to search an index of k-mers in the genome of the same size as input read lengths. Only required for --sob argument (i.e. using seqOutBias for enzyme bias correction).

Using refgenie managed assets

PEPATAC can utilize refgenie assets. Because assets are user-dependent, these files must be available natively. Therefore, you need to install and initialize a refgenie config file.. For example:

pip install refgenie
export REFGENIE=/path/to/your_genome_folder/genome_config.yaml
refgenie init -c $REFGENIE

Add the export REFGENIE line to your .bashrc or .profile to ensure it persists.

Next, pull the assets you need. Replace hg38 in the example below if you need to use a different genome assembly. If these assets are not available automatically for your genome of interest, then you'll need to build them. Download all standard assets for hg38 like so:

refgenie pull hg38/fasta hg38/bowtie2_index hg38/refgene_anno hg38/ensembl_gtf hg38/ensembl_rb hg38/blacklist
refgenie build hg38/feat_annotation

PEPATAC also requires a bowtie2_index asset for any prealignment genomes:

refgenie pull rCRSd/fasta rCRSd/bowtie2_index human_repeats/fasta human_repeats/bowtie2_index

If you prefer bwa for alignment, you would use the refgenie bwa_index instead.

Furthermore, you can learn more about using seqOutBias and the required tallymer_index here.

Example using refgenie managed assets

When using refgenie, you only need to provide the --genome and --prealignment-names argument to provide the pipeline with every required index and optional annotation file that exists for those genomes. This means, the TSS file, feature annotation file, and blacklist will all be used without needing to directly specify the paths to these files.

From the pepatac/ repository directory:

looper run examples/test_project/test_config_refgenie.yaml

Using manually managed assets

Assets may also be managed manually and specified directly to the pipeline. While this frees you from needing refgenie installed and initialized, it does require a few more arguments to be specified.

Custom blacklisted regions may be specified using the --blacklist </path/to/your_blacklist.bed.gz>. The blacklisted region file must simply be a BED formatted file to function correctly. The refgenie blacklist asset uses the ENCODE blacklists by default.

The TSS annotation file may be specified using --TSS-name </path/to/your_TSS_annotations.bed>. This file is also a BED formatted file.

The feat_annotation asset may also be directly specified using --anno-name </path/to/your_custom_feature_annotations.bed.gz>. Read more about using custom reference data.

Example using manually managed assets

Even when not using refgenie, you can still grab premade --chrom-sizes and --genome-index files from the refgenie servers. Refgenie uses algorithmically derived genome digests under-the-hood to unambiguously define genomes. That's what you'll see being used in the example below when we manually download these assets. Therefore, 2230c535660fb4774114bfa966a62f823fdb6d21acf138d4 is the digest for the human readable alias, "hg38", and 94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4 is the digest for "rCRSd."

wget -O hg38.fasta.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/fasta?tag=default
wget  -O hg38.bowtie2_index.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/bowtie2_index?tag=default

wget -O rCRSd.fasta.tgz http://refgenomes.databio.org/v3/assets/archive/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/fasta?tag=default
wget  -O rCRSd.bowtie2_index.tgz http://refgenomes.databio.org/v3/assets/archive/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index?tag=default

Then, extract these files:

tar xvf hg38.fasta.tgz
tar xvf hg38.bowtie2_index.tgz
tar xvf rCRSd.fasta.tgz
tar xvf rCRSd.bowtie2_index.tgz

From the pepatac/ repository folder (using the manually downloaded genome assets):

looper run examples/test_project/test_config.yaml