ATAC-Seq Analysis


ATAC-Seq analysis overview

In combination with other BD Rhapsody™ assays, optional protocols and products enable the generation of sequencing single-nuclear ATAC-Seq fragments. When enabled, the BD Rhapsody™ Sequence Analysis Pipeline can use the reads from these libraries to evaluate open chromatin regions. The fragments are then analyzed to identify nucleosome-free regions, peak regions, transcription start site (TSS) enrichment, read and fragment counts, and putative cells. Prebuilt reference genomes for ATAC-Seq analysis are available for human and mouse.

Major ATAC-Seq pipeline steps

  1. Read Quality Filter
  2. Annotate I2 Cell Label
  3. Align R1 and R2 Reads
  4. Generate Fragments
  5. Call Peaks
  6. Generate Cell-by-Peak Matrix
  7. Putative Cell Calling
  8. Immune Cell Type Classification (Experimental))
  9. Dimensionality Reduction

Read Quality Filter

The following filtering criteria are applied to the I2, R1 and R2 reads:

Minimum I2 Read LengthMinimum Read 1 LengthMinimum Read 2 LengthMinimum Mean Base QualityI2 Single Nucleotide FrequencyR1 Single Nucleotide FrequencyR2 Single Nucleotide Frequency
433030200.550.80.8

The filtering is performed identically as previously described in the Filtering Criteria for processing RNA reads.

Annotate I2 Cell Label

We utilize the same method as previously described in the Cell label section for processing RNA reads.

The bead cell-label sequence is on the I2, and its read structure is as follows:

Diversity InsertCLS1Linker1CLS2Linker2CLS3UMICapture Sequence
None9bpAATG9bpCCAC9bpNNNNNNNNGTGGAGTCGTGATTATA

The pipeline supports I2 read in either forward or reverse direction.

Align R1 and R2 Reads

Read fragments that pass quality filters and have a valid cell label on I2 have their R1 and R2 read sequences aligned to a reference using bwa-mem2. The bwa-mem2 reference is prebuilt and provided in the Reference_Archive input.

Generate Fragments

The program sinto is used to process the alignment BAM file and identify the ATAC-Seq genomic fragments indicated by the aligned read pairs. Read pairs that indicate the same genomic fragments are treated as duplicate support for one genomic fragment, because duplication is more likely than two identical fragments in a diploid genome. However, sinto also does differentiate identical fragments that have different cell indexes, reporting the fragment separately for each associated cell index.

Call Peaks

In order to maximize the amount of information used by MACS2 when identifying regions of the genome with higher than background levels of transposase activity, a BED file containing each end of the fragments identified by sinto is generated and used as input to MACS2. The narrowPeak output of MACS2 is used in this pipeline, for greater precision.

Generate Cell-by-Peak Matrix

The R package Signac has built-in functionality that takes as input the identified peaks regions and the transposase sites (with associated cell indexes), and calculates the cell-by-peak count matrix. Cell indexes that do not have any associated transposase sites within peak regions are discarded.

Putative Cell Calling

The ATAC-Seq putative cell-calling algorithms are described in the Determine Putative Cells section.

Immune Cell Type Classification (Experimental)

The pipeline will automatically try to classify immune cell types based on the ATAC-Seq data. If the cells of the experiment are not human PBMCs, this result should be ignored. When putative call calling is performed using only ATAC-Seq data, immune cell types are determined using the ATAC-Seq data (see ATAC-Seq Cell Classification below). When putative cell calling is done with mRNA and ATAC-Seq data, cell types are determined by jointly using the two datasets (see Joint Cell Classification below). In both cases, cell classification takes the same approach used in annotating mRNA/Abseq cell types; the approach utilizes a machine learning model trained with human PBMCs and predicts a cell type for each cell index (also described in the TCR and BCR Analysis Cell analysis - type and quality section). However, both ATAC-Seq and joint cell classification need additional preprocessing steps before applying the machine learning model, as detailed below.

ATAC-Seq Cell Classification

A cell-by-peak matrix is converted into a cell-by-gene matrix using the GeneActivity function in Signac with the default parameters. The GeneActivity function counts the number of fragments overlapping with a promoter region and a gene body of each gene. The resulting ATAC-Seq cell-by-gene matrix is used to predict cell types.

Joint Cell Classification

When both mRNA and ATAC-Seq data are used to call putative cells, joint cell type classification is performed using an ATAC-Seq cell-by-gene matrix (described above), and an mRNA cell-by-genes matrix. The common genes are selected from both matrices. The two matrices are normalized to have the same total counts and then summed up to create a joint cell-by-gene matrix. This resulting joint matrix is used to predict cell types.

Dimensionality Reduction

To visualize cell indexes in low dimensional space, dimensionality reduction is performed using t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) methods, resulting in two separate sets of coordinates. Preprocessing steps and tSNE/UMAP calculation are both performed using Seurat and Signac. When putative cell calling is done using only ATAC-Seq data, dimensionality reduction is performed on the ATAC-Seq data (see ATAC-Seq Dimensionality Reduction below). When putative cell calling is done with mRNA and ATAC-Seq data, joint dimensionality reduction is performed on the two datasets (see Joint Dimensionality Reduction) below). For both tSNE and UMAP, when more than 100,000 putative cells exist, the cells are randomly subsampled down to 100,000 cells for the purposes of display on the pipeline report, and coordinates for all putative cells are not generated.

ATAC-Seq Dimensionality Reduction

The Term Frequency-Inverse Document Frequency (TF-IDF) method is applied to an ATAC-Seq cell-by-peak matrix, and peaks with less than 10 transposase sites in peaks are filtered out. Then Singular Value Decomposition (SVD) is performed, and the first SVD component is filtered out, because it usually has a high correlation with the number of transposase sites in peaks and could interfere with biologically-driven dimensionality reduction and clustering (also described in the Signac vignette). tSNE and UMAP are calculated using the second to 30th component of SVD.

Joint Dimensionality Reduction

Joint dimensionality reduction is performed using an ATAC-Seq cell-by-peak matrix and an mRNA cell-by-gene matrix. The ATAC-Seq cell-by-peak matrix follows the same preprocessing steps ((TF-IDF, peak filtering, and SVD) as described above in the ATAC-Seq Dimensionality Reduction section. The mRNA cell-by-gene matrix undergoes sctransform normalization and Principal Component Analysis (PCA). The nearest neighbor of each cell index is determined using the FindMultiModalNeighbors function from Seurat, using the second to 30th component of the SVD and the first 50 components of the PCA result. The resulting nearest-neighbor matrix is used to run tSNE/UMAP to get joint dimensionality reduction.