Genome assets

PEPATAC can use either manually constructed or refgenie managed assets. Refgenie streamlines sample processing, where once assets are built by refgenie there is minimal argument calls to PEPATAC to use all assets. Pipeline assets include:

Required

`PEPATAC` argument	`refgenie` asset name	Description
`--genome-index`	`bowtie2_index`	A genome index file constructed from `bowtie2-build`
	`bwa_index`	A genome index file constructed from `bwa index`. Required when using `bwa` (optional) for alignment.
`--chrom-sizes`	With `refgenie`, this asset is built automatically when you build/pull the `fasta` asset.	A text file containing "chr" and "size" columns.

Optional

`PEPATAC` argument	`refgenie` asset name	Description
`--prealignment-names`	Human readable genome alias(es) for `refgenie` managed `bowtie2_index` asset(s).	A space-delimited list of genome names. e.g. ["rCRSd", "human_repeats"]
`--prealignment-index`	`bowtie2_index`	A genome index file constructed from `bowtie2-build`. Used for manually pointing to prealignment genome indices when using `bowtie2` (default) for alignment.
	`bwa_index`	A genome index file constructed from `bwa index`. Used for manually pointing to prealignment genome indices when using `bwa` for alignment.
`--TSS-name`	`refgene_anno`. `refgenie` `build/pull` the TSS annotation file with this asset.	Transcription start site (TSS) annotations. e.g. refGene.txt.gz
`--blacklist`	`blacklist`	A region blacklist. e.g. the ENCODE blacklist
`--anno-name`	`feat_annotation`	A BED-style file with "chr", "start", "end", "genomic feature name", "score" and "strand" columns.
`--search-file`	`tallymer_index` The `search_file` is built from this `refgenie` asset.	File used to search an index of k-mers in the genome of the same size as input read lengths. Only required for `--sob` argument (i.e. using `seqOutBias` for enzyme bias correction).

Using `refgenie` managed assets

PEPATAC can utilize refgenie assets. Because assets are user-dependent, these files must be available natively. Therefore, you need to install and initialize a refgenie config file.. For example:

pip install refgenie
export REFGENIE=/path/to/your_genome_folder/genome_config.yaml
refgenie init -c $REFGENIE

Add the export REFGENIE line to your .bashrc or .profile to ensure it persists.

Next, pull the assets you need. Replace hg38 in the example below if you need to use a different genome assembly. If these assets are not available automatically for your genome of interest, then you'll need to build them. Download all standard assets for hg38 like so:

refgenie pull hg38/fasta hg38/bowtie2_index hg38/refgene_anno hg38/ensembl_gtf hg38/ensembl_rb hg38/blacklist
refgenie build hg38/feat_annotation

PEPATAC also requires a bowtie2_index asset for any prealignment genomes:

refgenie pull rCRSd/fasta rCRSd/bowtie2_index human_repeats/fasta human_repeats/bowtie2_index

If you prefer bwa for alignment, you would use the refgenie bwa_index instead.

Furthermore, you can learn more about using seqOutBias and the required tallymer_index here.

Example using `refgenie` managed assets

When using refgenie, you only need to provide the --genome and --prealignment-names argument to provide the pipeline with every required index and optional annotation file that exists for those genomes. This means, the TSS file, feature annotation file, and blacklist will all be used without needing to directly specify the paths to these files.

From the pepatac/ repository directory:

looper run examples/test_project/test_config_refgenie.yaml

Using manually managed assets

Assets may also be managed manually and specified directly to the pipeline. While this frees you from needing refgenie installed and initialized, it does require a few more arguments to be specified.

Custom blacklisted regions may be specified using the --blacklist </path/to/your_blacklist.bed.gz>. The blacklisted region file must simply be a BED formatted file to function correctly. The refgenie blacklist asset uses the ENCODE blacklists by default.

The TSS annotation file may be specified using --TSS-name </path/to/your_TSS_annotations.bed>. This file is also a BED formatted file.

The feat_annotation asset may also be directly specified using --anno-name </path/to/your_custom_feature_annotations.bed.gz>. Read more about using custom reference data.

Example using manually managed assets

Even when not using refgenie, you can still grab premade --chrom-sizes and --genome-index files from the refgenie servers. Refgenie uses algorithmically derived genome digests under-the-hood to unambiguously define genomes. That's what you'll see being used in the example below when we manually download these assets. Therefore, 2230c535660fb4774114bfa966a62f823fdb6d21acf138d4 is the digest for the human readable alias, "hg38", and 94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4 is the digest for "rCRSd."

wget -O hg38.fasta.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/fasta?tag=default
wget  -O hg38.bowtie2_index.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/bowtie2_index?tag=default

wget -O rCRSd.fasta.tgz http://refgenomes.databio.org/v3/assets/archive/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/fasta?tag=default
wget  -O rCRSd.bowtie2_index.tgz http://refgenomes.databio.org/v3/assets/archive/94e0d21feb576e6af61cd2a798ad30682ef2428bb7eabbb4/bowtie2_index?tag=default

Then, extract these files:

tar xvf hg38.fasta.tgz
tar xvf hg38.bowtie2_index.tgz
tar xvf rCRSd.fasta.tgz
tar xvf rCRSd.bowtie2_index.tgz

From the pepatac/ repository folder (using the manually downloaded genome assets):

looper run examples/test_project/test_config.yaml

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

Genome assets

Using refgenie managed assets

Example using refgenie managed assets

Using manually managed assets

Example using manually managed assets

Using `refgenie` managed assets

Example using `refgenie` managed assets