Read Quality Filter


Read overlap detection

First, read 1 and read 2 are tested to see if they overlap, so that read 1 content can be removed from read 2. This will prevent downstream mis-alignment and mis-assembly of any cell label sequences present in read 2. An overlap detection percent metric is calculated and may help troubleshoot PCR cleanup and library preparation steps. This overlap step does not remove any read pairs from subsequent steps.

Read 1 artifacts are removed from read 2 with the following steps:

  • Read 1 and 2 are compared with a modified Knuth-Morris-Pratt substring search algorithm that allows for a variable number of mismatches. The maximum mismatch rate is set to 9% by default with a minimum overlap length of 25 bases. Read 1 is scanned right to left on the reverse complement of read 2. The closest offset from the end of the reverse complement of read 2 with the lowest number of mismatches (below the maximum mismatch rate threshold) is considered to be the best fit overlap.
  • The merged read will be split back into a read pair. The merged read will be split according to the bead specific R1 minimum length (described in Annotate cell label and UMI). The bases at the beginning of the merged read up to the R1 minimum length, plus the length of the bead capture sequence, will be assigned to read 1, and the rest will be assigned to read 2.

Read trimming

Then, read 1 and read 2 are trimmed in these ways:

Read 1:

  1. Remove sequence longer than necessary for cell label and UMI identification (length kept depends on bead version)
  2. Remove bead sequence - 5' TCR/BCR primer: ACAGGAAACTCATGGTGCGT

Read 2:

  1. Remove 3' poor quality bases if quality scores go below 20, using the BWA Quality Trimming Algorithm
  2. Remove bead sequence - 5' TCR/BCR primer: ACAGGAAACTCATGGTGCGT and template switch oligo (TSO): TATGCGTAGTAGGTA or GTGGAGTCGTGATTATA

Filtering criteria

Finally, the following filtering criteria are applied to each read pair:

BeadMinimum Read 1 LengthMinimum Read 2 LengthMinimum Mean Base QualityR1 Single Nucleotide FrequencyR2 Single Nucleotide Frequency
Original V16040200.550.8
Enhanced 3'4340200.550.8
Enhanced TCR/BCR6340200.550.8
  • Read length: If the length of the R1 read is less than the bead specific R1 minimum length (described in Annotate cell label and UMI) or the R2 read is <40 bp, the R1/R2 read pair is dropped.
  • Mean base quality score of the read: If the mean base quality score of either the R1 read or the R2 read is <20, the read pair is dropped.
  • Highest Single Nucleotide Frequency (SNF) observed across the bases of the read: If the SNF is ≥0.55 for the R1 read or the SNF is ≥0.80 for the R2 read, the read pair is dropped. This criterion removes reads with low complexity such as strings of identical bases and tandem repeats.

The thresholds for each filter are determined empirically.

Reads are tested against each filter in the following order: Read length, Single nucleotide frequency, and Mean base quality. Reads that fail one filter are removed and not tested in subsequent filters.