Annotate R1 Cell Label and UMI
R1 structure
The quality-filtered R1 reads are analyzed to identify the cell label sequences (CLS), common linker sequences (L), and Unique Molecular Identifier (UMI) sequence.
Read 1 structure by bead version:
Bead | TCR/BCR Handle | Diversity Insert | CLS1 | Linker1 | CLS2 | Linker2 | CLS3 | UMI | Capture Sequence |
---|---|---|---|---|---|---|---|---|---|
Original V1 | None | None | 9bp | ACTGGCCTGCGA | 9bp | GGTAGCGGTGACA | 9bp | NNNNNNNN | 18 dT |
Enhanced 3' | None | None, A, GT, or TCA | 9bp | GTGA | 9bp | GACA | 9bp | NNNNNNNN | 25 dT |
Enhanced TCR/BCR | ACAGGAAACTCATGGTGCGT | None | 9bp | AATG | 9bp | CCAC | 9bp | NNNNNNNN | TATGCGTAGTAGGTATG or GTGGAGTCGTGATTATA |
Cell label
Information of the cell label is captured by bases in three sections (CLS1, CLS2, CLS3) along each R1 read. Two common sequences (L1, L2) separate the three CLSs, and the presence of L1 and L2 relates to the way the capture oligo nucleotide probes on the beads are constructed. By design, each CLS has one of either 96 or 384 predefined sequences (depending on bead version), which has a Hamming distance of at least four bases and an edit distance of at least two bases apart. A cell label is defined by the unique combination of predefined sequences in the three CLSs. Thus, the maximum possible number of cell labels is either 963 or 3843. In the final data tables, the three part cell label is converted to a single integer index between 1-3843.
Reads are first checked for perfect matches in all three pre-designed CLS sequences at the expected locations, and reads with perfect matches are kept.
The remaining reads are subjected to another round of filtering to recover reads with base substitutions, insertions, and deletions caused by sequencing errors, PCR errors, or errors in oligonucleotide synthesis.
UMI
By design, the UMI is a string of eight randomers immediately downstream of CLS3. For reads with insertions or deletions within the CLSs, the UMI sequence is eight bases immediately following the end of the identified CLS3.
Cell label sequences and utility functions
Cell label structure, cell label sequences, bead sequences, and python utility functions are available for download here:
The single integer cell index and cooresponding assembled R1 bead sequence is available in fasta format for each bead version:
-
Original V1 beads: Rhapsody_cellBarcodeV1_IndexToSequence.fasta.zip
-
Enhanced beads: Rhapsody_cellBarcodeEnh_IndexToSequence.fasta.zip
-
Enhanced V2/V3 beads: Rhapsody_cellBarcodeEnhV2_IndexToSequence.fasta.zip