Download raw data from SRA for use in PEPATAC

This guide walks you through downloading data from SRA that can go directly into PEPATAC.

1: Install geofetch

To download data from the Sequence Read Archive (SRA), we'll use some convenient companion software called geofetch, which can be installed from PyPI:

pip install --user --upgrade geofetch

2: Install NCBI SRA Toolkit

To use geofetch you'll need to have the NCBI SRA Toolkit installed (see complete SRA Toolkit documentation).

The following will perform a basic installation of the toolkit. For users without root access and for custom installation procedures check out the NCBI SRA toolkit wiki.

mkdir ncbi
cd ncbi/
git clone [email protected]:ncbi/sra-tools.git
git clone https://github.com/ncbi/ngs.git
git clone https://github.com/ncbi/ncbi-vdb.git
cd ngs/
./configure
make -C ngs-sdk
make -C ngs-java
make -C ngs-python
cd ../ncbi-vdb/
./configure
make
make install
cd ../ngs/      
make -C ngs-sdk install
make -C ngs-java install
make -C ngs-python install
cd ../sra-tools/
./configure
make
make install
cd ../../

Make sure you place sra-tools in your PATH.

export PATH="$PATH:/path/to/sra-tools/bin/"

So you only have to do this the first time through, add the updates to PATH to your .bashrc or .profile.

3: Download data

Now that all our requirements for downloading data are set. Let's actually get some ATAC-seq reads.

3.1: Get metadata, configuration files, and .sra files

To automatically download sample metadata and generate configuration files that will allow us to convert the .sra files into .bam files, use the following:

geofetch -i GSE### -m /path/to/metadata/folder -n PROJECT_NAME

3.2: Convert .sra files to .bam

Next we're going to convert those downloaded .sra files using looper. If you haven't installed looper, do that now before moving forward (see looper docs).

Looper requires a few variables and configuration files to work for a specific user. One of those is an environment variable called DIVCFG that points to the looper environment configuration file. For more detailed information regarding this file, check out the looper docs.

Create a compute_config.yaml file and edit this file for your own setup (see looper docs for more information).

Paste the following into compute_config.yaml and save your changes:

compute:
  default:
    submission_template: templates/localhost_template.sub
    submission_command: sh

Create an environment variable that points to this file:

export DIVCFG="/path/to/pepatac_tutorial/compute_config.yaml"

(Remember to add DIVCFG to your .bashrc or .profile to ensure it persists).

The looper environment configuration file points to submission template(s) in order to know how to run a sample or series of samples. You can read more about the DIVCFG configuration file and submission templates here. We're going to simply setup a local template for the purposes of this tutorial. You can also easily create templates for cluster or container use as well!

nano localhost_template.sub

Paste the following into the localhost_template.sub:

#!/bin/bash

echo 'Compute node:' `hostname`
echo 'Start time:' `date +'%Y-%m-%d %T'`

{
{CODE}
} | tee {LOGFILE} --ignore-interrupts

We also need to create additional environment variables to help point looper to where we want to download and convert our .sra files. These variables are part of the configuration file that geofetch produced earlier in the metadata/ folder. You may either set the environment variables or you simply hard code the necessary locations in the configuration file.

Create a PROCESSED variable that represents the location where we want to save output:

export PROCESSED="/path/to/pepatac_tutorial/processed/"

Create a variable representing the location all our tools are stored named CODEBASE:

export CODEBASE="/path/to/pepatac_tutorial/tools/"

Create a variable representing the location we want to save our .sra files called SRARAW:

export SRARAW="/path/to/pepatac_tutorial/data/sra/"

(Add these environment variables to your .bashrc or .profile so you don't have to always do this step). Finally, convert the .sra files!

looper run /path/to/metadata/PROJECT_NAME/PROJECT_NAME_config.yaml \
  --sp sra_convert \
  --lump 10

Fantastic! Now we downloaded and converted a SRA file into .bam, which can go directly into PEPATAC.