Determine Putative Cells
The BD Rhapsody™ Sequence Analysis Pipeline provides the following options to determine putative cells:
- Basic Algorithm for mRNA, AbSeq, and ATAC-Seq
- Refined Algorithm for mRNA and AbSeq
- Refined Algorithm for ATAC-Seq
Ideally, the number of unique cell labels detected by the BD Rhapsody™ Sequence Analysis Pipeline should be similar to the number of cells captured and amplified by the BD Rhapsody™ workflow. However, various processes throughout the workflow can introduce noise that contributes to extra cell labels seen during sequencing analysis, including:
- Hybridization of polyadenylated (polyA) oligonucleotides to non-cell beads residing in neighboring wells when the cell lysis step is too long
- Under-loading beads in BD Rhapsody™ cartridges, resulting in cells in a well without a bead, and thus the RNA from the cells diffusing to adjacent wells
- Low-level oligo contamination during bead synthesis
- Errors generated during the PCR amplification steps of the workflow
To distinguish cell labels associated with true putative cells from those associated with noise, the basic cell calling algorithm is used by default for all assays. When improved sensitivity or specificity of detecting putative cells is desired, refined algorithms are available for mRNA/AbSeq and ATAC-Seq.
Basic algorithm for putative cell identification using second derivative analysis
The principle of the basic cell calling algorithm is that cell labels from actual cell capture events should have many more reads associated with them than noise cell labels. For mRNA/Abseq data, the basic algorithm is performed using read counts from mRNA molecules (by default) or from Abseq. All reads associated with RSEC adjusted molecules from the selected bioproduct type (mRNA by default) are taken into account. For ATAC-Seq, the basic algorithm is performed using the number of transposase sites in peaks.
Demonstrated in the figure below, the number of mRNA/AbSeq reads of each cell is plotted on a log10-transformed cumulative curve, with cells sorted by the number of reads in descending order (left image). Additionally, the rate of change of the cumulative count is calculated with the second derivative (right image). In a typical experiment, a distinct inflection point is observed in the second derivative, indicating a division between signal cell labels and noise cell labels (red vertical line). The algorithm finds the minimum second derivative along the cumulative read curve as the inflection point. Cell labels to the left of the red vertical line are most likely derived from a cell capture event and are considered as signal. The remaining cell labels to the right of the red line are noise. This is the basic algorithm result and is the default cell calling algorithm.
If every cell in the sample is well represented by molecules from library preparation, there is only one inflection point. The number of reads of the putative cells is a single distribution well separated from the noise distribution.
There are situations, however, when a sample contains cells with a very wide range of number of molecules. If sub-populations of cells with high and low mRNA content are considerably large, multiple inflection points can be observed. Example scenarios include biological samples such as peripheral blood mononuclear cells (PBMCs) with myeloid cells being much larger and active carrying thousands of molecules compared to lymphocytes being smaller and less active carrying tens of molecules (A. below), or artificial mixtures of cell line cells and primary cells (B. below). Targeted panels with an uneven representation of genes expressed in different cell types of a sample can also lead to multiple inflection points.
Inflection points are considered valid if:
- The second derivative minimum corresponding to the inflection point is at least half as deep as the global minimum and is ≤–0.3
- The inflection point is within a dynamically determined range of number of putative cells. The lower threshold is
fixed at 25 cells. The upper threshold on the number of putative cells is:
- For V1 and Enhanced beads: 25% of the number of total cell labels with read counts >=10
- For Enhanced V2/V3 beads: 40% of the number of total cell labels with read counts >= 10
- The upper threshold will not go below 50,000 cells
The smoothing window of the second derivative curve increases until there are two valid inflection points. By default,
the valid inflection point corresponding to the larger cell number is deemed the better one. When the
Expected_Cell_Count
pipeline parameter is set, the selected inflection point is whichever one is nearest to the
expected cell count value. Usually this can be set to the number of cells loaded into the BD Rhapsody™ cartridge.
Refined algorithm for adjusting putative cell counts for mRNA and AbSeq
In some cases, the basic implementation of the second derivative analysis might include small numbers of false positive and false negative cell labels. If the refined algorithm is selected, additional refinement steps are run to attempt to identify these false positive and false negative cell labels.
Removing false positives
Consider the case where the chosen inflection point includes the populations of cell labels with wide ranges of number of reads per cell label. Then, the signal population with lower reads per cell label might also include noise cell labels derived from residual mRNA molecules from the cells with very high mRNA content. The number of reads associated with these noise cell labels derived from high-expressing cells can be indistinguishable from low-expressing cells, which have similar reads per cell.
Since these false positive cells can be hard to identify with reads alone, the relative expression profile of cell labels can be used to identify them. For example, a false positive cell label that is derived from a high mRNA-expressing, true positive cell label would likely have a similar expression profile but with a lower read signal. Therefore, a second derivative analysis is done on the most variable genes to identify these false positive cell labels.
The most variable gene expression is defined by a process similar to that described by Macosko, EZ, et al. (see References) :
-
Log-transform read counts of each gene within each cell to get the gene expression: log10 (count + 1).
-
Calculate the mean expression and dispersion (defined as variance/mean) for each gene.
-
Place genes into 20 bins based on their average expression.
-
Within each bin, calculate the mean and standard deviation of the dispersion measure of all genes, and then calculate the normalized dispersion measure of each gene using the following equation:
Normalized dispersion = (dispersion – mean)/(standard deviation)
-
Apply a cutoff value for the normalized dispersion to identify genes for which expression values are highly variable even when compared to genes with similar average expression.
A second derivative analysis is applied on variable gene sets defined by a different cutoff value for the normalized dispersion to derive the cell label filtered set B. For each dispersion cutoff, the noise cell labels are determined as A–B. For instance, for three cutoff values, noise cell labels are N1 = A–B1, N2 = A–B2, and N3 = A–B3, where the minus sign represents the set difference. The common noise cell labels detected among N1, N2, and N3 are subtracted from cell labels set A (those identified by the basic algorithm). The resultant set is denoted as cell label filtered set C = A – intersection(N1, N2, N3).
Recovering false negatives
Cells with low numbers of molecules might be missed by the basic implementation of the second derivative analysis algorithm, because a cell subset might express very few of the genes in the (Targeted assay) gene list. The cell labels carry a very low number of reads, and the size of the cell population is small enough that their cell labels do not form a distinct second inflection point. These cell labels might be mistaken as noise.
If there are genes specific to the false negative cell label subset (for example, marker genes), they can be identified by comparing the number of reads for each gene from all detected cell labels to those from cell labels deemed as signal. The assumption is that the relative abundance of reads for each gene from all of the noise cell labels should be no different than that from all of the cell labels considered as signal. If a specific cell subset is missed initially, there is a set of genes that appears as enriched in the noise cell labels in the basic implementation.
Left and right images below: Detecting genes enriched in noise as determined by the basic implementation of the second derivative analysis. Each dot represents a gene. This enriched set of genes is detected by the following steps:
- For each gene, calculate the total read counts from all detected cell labels and from cell labels in set C.
- Identify the genes that have the biggest discrepancy in representation by cell labels in set C versus all cell labels. This is done by plotting and finding the line of best fit to detect the genes with the largest residuals at least one standard deviation away from the median of residuals of all genes.
The two red dashed lines correspond to one standard deviation above and below the median (red solid line). In this example, 53 genes are enriched in the noise population.
The second derivative analysis algorithm is run again with this enriched set of genes. The recovered cell labels (cell label filtered set D) are combined with cell labels in set C to form set E. As a final cleanup step, cell labels carrying less than the dynamically determined minimum threshold number of molecules are removed. The number of cell labels in the final set is the number of putative cells from the refined algorithm.
Refined algorithm for adjusting putative cell counts for ATAC-Seq
In some cases, the basic implementation of the second derivative analysis might include small numbers of false positive and false negative cell labels. If the data quality is not ideal, it may be difficult to separate the signal from the noise using only the number of transposase sites in peaks. If the ATAC-Seq refined algorithm is selected, a Guassian Mixture Model (GMM) approach is used to determine putative cells using two metrics: the number of transposase sites in peaks and the fraction of transposase sites in peaks.
The ATAC-Seq refined algorithm has five main steps:
- Filter out low-quality cell labels using the fraction of transposase sites in peaks
- Filter out low-quality cell labels using the number of transposase sites in peaks
- Identify a putative cell cluster using the number of transposase sites in peaks - the first GMM
- Identify a putative cell cluster using the number of transposase sites in peaks and fraction of transposase sites in peaks - the second GMM
- Refine the boundary of the putative cell cluster
The first two steps filter out low-quality cells, the third step identifies a putative cell cluster using the number of transposase sites in peaks (the first GMM), the fourth step identifies a putative cell cluster using both the number of transposase sites in peaks and the fraction of transposase sites in peaks (the second GMM), and the last step refines the boundary of the putative cell cluster. The first two steps filter out low-quality cell labels using the fraction of transposase sites in peaks and the number of transposase sites in peaks, respectively. The filtering in steps 1 and 2 improve the fit of the first and second GMMs in steps 3 and 4 where the boundary of the putative cell cluster is determined. The last step further refines the boundary of the putative cell cluster, either by recovering false negative cell labels or by filtering out false positive cell labels.
Filter out low-quality cell labels using the fraction of transposase sites in peaks
The cell labels with a low fraction of transposase sites in peaks show low specificity in targeting open chromatin sites. Those cell labels might have been generated due to several reasons such as inefficient transposase activity or non-ideal cell conditions. The threshold for the fraction of transposase sites in peaks is 0.1. Any cell labels with a fraction of transposase sites in peaks below 0.1 are filtered out.
Filter out low-quality cell labels using the number of transposase sites in peaks
The cell labels with a low number of transposase sites in peaks show low sensitivity in the assay. The threshold for the minimum number of transposase sites in peaks is dynamically selected for each dataset as follows. The bottom and top 1% of the log10-transformed number of transposase sites in peaks are calculated and used as the lower and upper bounds of the dynamic threshold testing range. If the lower bound value is less than 1, meaning less than 10 transposase sites in peaks, the lower bound is set to 1. This adjustment is to avoid selecting an extremely small value for the threshold and to ensure that cell labels with low sensitivity are filtered out. Each possible threshold from the lower bound to the upper bound at a 0.1 interval is tested using the GMM approach. For each tested threshold, the cell labels having transposase sites in peaks below the threshold are filtered out, and the remaining cell labels are fitted into two clusters using the GMM approach. The difference of the GMM weights of the two clusters is calculated by subtracting the weight of the putative cell cluster (the cluster with a higher 90% confidence interval) from the weight of the non-cell cluster. If the 90% confidence intervals of the two clusters completely overlap, the tested threshold is considered invalid and is not selected for the final number of transposase sites in peaks threshold. After weight differences of all tested threshold are calculated, a stability score is calculated for each tested threshold by counting the number of times the weight difference value decreases in a row after that threshold.
For good-quality data (left figure below), the weight difference usually decreases as the tested threshold value increases, because more cell labels from the non-cell cluster are filtered out. Then, the weight difference starts to increase when the majority of non-cell labels are filtered out and therefore the GMM starts to detect two sub-clusters within the putative cell cluster. This trend results in the tested threshold with the highest weight difference having the highest stability score. The figures below show an ideal GMM fit (right figure) when low-quality cells are properly filtered out using the final number of transposase sites in peaks threshold determined (left figure, red vertical line).
However, low-quality data with lots of noisy cell labels shows a different trend: as the tested threshold value of the number of transposase sites in peaks increases, the weight difference initially increases and then later follows the trend of good-quality data (left figure below). That initial increase is due to the abundance of noisy cell labels which causes the GMM to miss the putative cell cluster and instead identify two clusters in the non-cell labels, as shown in the right figure below. However, after the majority of non-cell labels are filtered out, the GMM starts to detect a putative cell cluster, causing the weight difference to plateau and then decrease. When the majority of non-cell labels are filtered out, the weight difference increases, similar to the good-quality data shown above. In this case, as shown for the good-quality data above, the tested threshold with the highest weight difference has the highest stability score.
In most cases, the tested threshold with the highest weight difference has the highest stability score (first row of figures below; the red line for the final threshold using the stability score). However, if data is too sparse, then the tested threshold with the highest weight difference may not have the highest stability score. With sparse data, the weight difference of a non-cell cluster and a cell cluster in an ideal GMM fit is not as significant as with non-sparse data. Thus, using the weight difference alone does not provide a stable measure of detecting a whole non-cell cluster. In this case, using the stability score results in a more robust selection of the threshold of transposase sites in peaks (second row of figures below; the red line for the final threshold using the stability score and the gray dotted line for the threshold chosen using weight difference).
With the reasoning above, among the valid thresholds, the threshold with the highest stability score is selected for the threshold of the number of transposase sites in peaks (red vertical line in images above) and used to filter out the low-quality cell labels. If there are multiple tested thresholds with the same stability score, the threshold with the highest weight difference is chosen. The rationale is that a high stability score of a tested threshold means the weight of the non-cell label cluster decreases sequentially as the threshold value increases, and that suggests a reliable removal of non-cell labels when higher thresholds are tested. That means the threshold with the highest stability score reliably detects the non-cell label cluster as a whole. Keeping a whole non-cell label cluster (excluding low-quality cells) is ideal. Losing many non-cell labels could lead to undesirable GMM fitting in the downstream steps where both clusters could be fit to the putative cell cluster rather than fitting one cluster to the putative cell cluster and the other to the non-cell clusters. The non-ideal GMM fit could result in false negative cell labels that are missed. The weight difference is also a rough estimate that shows when both a non-cell cluster and a putative cell cluster are detectable without losing many non-cell labels. If the weight difference is small, many non-cell labels may be filtered out. If the weight difference is large, then the low-quality cell labels are filtered out while preserving many non-cell labels, leading to the desirable GMM fit in the downstream steps. When there are multiple tested thresholds with the same highest stability score and the same largest weight difference, the largest threshold is chosen.
Identify a putative cell cluster using the number of transposase sites in peaks - the first GMM
After filtering out low-quality cell labels, the first GMM is performed on all remaining cell labels using the number of transposase sites in peaks. Two clusters are identified (a putative cell cluster and a non-cell cluster), and a cluster assignment is determined for each cell label. When the putative cell cluster and the non-cell cluster do not have good separation, the cell labels at the lower end of the number of transposase sites in peaks are sometimes assigned to the putative cell cluster. This happens due to a high variance estimated for the putative cell cluster. To better define a putative cell cluster using the number of transposase sites in peaks, the maximum value of the number of transposase sites in peaks from the non-cell cluster is calculated and used as the threshold to identify the putative cell cluster. The cell labels in the putative cell cluster proceed to the next step, the second GMM.
Identify a putative cell cluster using the number of transposase sites in peaks and fraction of transposase sites in peaks - the second GMM
The cell labels in the putative cell cluster from the first GMM are evaluated using the second GMM, using both the number of transposase sites in peaks and fraction of transposase sites in peaks. The goal of the second GMM is to better define the boundary of the putative cell cluster in the two-dimensional space. The 90% confidence intervals (ellipses) of the two GMM clusters are determined and checked if they overlap. If the two confidence intervals overlap (left figure below), that indicates a lack of evidence that two distinct clusters exist and thus the second GMM result is ignored. In contrast, when the two intervals do not overlap (right figure below), that suggests that two distinct clusters exist and the boundary of the putative cell cluster is redefined; the cluster with a higher mean value for the number of transposase sites in peaks is considered a redefined putative cell cluster, and the cell labels assigned to that redefined putative cell cluster are identified as putative cells.
Refine the boundary of the putative cell cluster
This final refitting step is aimed at improving the boundary identification of the putative cell cluster, either by removing false positives or recovering false negatives. If the previous second GMM result is ignored, the putative cell calling results could contain a small amount of false positives. If the previous second GMM result is applied, the putative cell calling results could be missing a small amount of false negatives. To further refine the boundary of the putative cell cluster in the two-dimensional space, the non-cell labels and the putative cell labels are fitted into two separate two-dimensional Gaussian distributions. Here, the non-cell labels are the cell labels filtered out from the first and second GMM, not including low-quality cell labels. Using the two resulting Gaussian distributions, a log likelihood ratio of the Gaussian distribution from the putative cell labels to the Gaussian distribution from the non-cell labels is calculated for each cell label. The threshold of the log likelihood ratio is dynamically determined for each dataset using the Basic Cell Calling Algorithm, but instead of using reads or transposase sites in peaks, the log likelihood ratios are used. Also, here, the half depth criteria is not used to select the inflection point. The figure below shows the second derivative curve (left) and the histogram of log likelihood ratios (right) with the red line being the selected threshold from the basic algorithm.
When the result from the second GMM in the previous step is ignored, the cell calling results could contain a small amount of false positive cell labels. The false positive cell labels are defined as cell labels having a log likelihood ratio less than the dynamically determined threshold from above, and are filtered out in this step. The figure below shows the putative cell and non-cell clusters before the false positives are removed (left) and after (right).
In contrast, when the result from the second GMM in the previous step is used to filter out non-cell labels, the cell calling results could be missing a small amout of false negative cell labels. The false negative cell labels are defined as cell labels having a log likelihood ratio greater than or equal to the dynamically determined threshold from above, and are recovered in this step. The figure below shows the putative cell and non-cell clusters before the false negatives are recovered (left) and after (right).
This step improves the robustness of the putative cell calling result, especially when the result from the second GMM in the previous step had a minimal overlap or a minimal gap between the two confidence intervals. This improved robustness comes from the log odds ratio threshold being dynamically chosen. If there was a minimal overlap in the confidence intervals from the second GMM, then the second GMM result was ignored, and the log likelihood ratio threshold chosen here would remove false positive cell labels. If there was a minimal gap between the two confidence intervals from the second GMM, then the second GMM result was used to filter out non-cell labels, and the log likelihood ratio threshold chosen here would recover false negative cell labels.
Identify Protein Aggregates from AbSeq Read Counts
When identifying putative cells using the AbSeq read counts, the basic implementation may include a small number of
false positive cell labels due to protein aggregates. Putative cells identified with high expression across most AbSeq
targets are considered protein aggregates. The protein aggregate status for each putative cell can be found in the
<sample_name>_Protein_Aggregates_Experimental.csv
file. The cell label is marked True if it is considered a potential
protein aggregate and False if not.