Annotate R1 Cell Label and UMI


R1 structure

The quality-filtered R1 reads are analyzed to identify the cell label sequences (CLS), common linker sequences (L), and Unique Molecular Identifier (UMI) sequence.

Read 1 structure by bead version:

BeadTCR/BCR HandleDiversity InsertCLS1Linker1CLS2Linker2CLS3UMICapture Sequence
Original V1NoneNone9bpACTGGCCTGCGA9bpGGTAGCGGTGACA9bpNNNNNNNN18 dT
Enhanced 3'NoneNone, A, GT, or TCA9bpGTGA9bpGACA9bpNNNNNNNN25 dT
Enhanced TCR/BCRACAGGAAACTCATGGTGCGTNone9bpAATG9bpCCAC9bpNNNNNNNNTATGCGTAGTAGGTATG or GTGGAGTCGTGATTATA

Cell label

Information of the cell label is captured by bases in three sections (CLS1, CLS2, CLS3) along each R1 read. Two common sequences (L1, L2) separate the three CLSs, and the presence of L1 and L2 relates to the way the capture oligo nucleotide probes on the beads are constructed. By design, each CLS has one of either 96 or 384 predefined sequences (depending on bead version), which has a Hamming distance of at least four bases and an edit distance of at least two bases apart. A cell label is defined by the unique combination of predefined sequences in the three CLSs. Thus, the maximum possible number of cell labels is either 963 or 3843. In the final data tables, the three part cell label is converted to a single integer index between 1-3843.

Reads are first checked for perfect matches in all three pre-designed CLS sequences at the expected locations, and reads with perfect matches are kept.

The remaining reads are subjected to another round of filtering to recover reads with base substitutions, insertions, and deletions caused by sequencing errors, PCR errors, or errors in oligonucleotide synthesis.

UMI

By design, the UMI is a string of eight randomers immediately downstream of CLS3. For reads with insertions or deletions within the CLSs, the UMI sequence is eight bases immediately following the end of the identified CLS3.

Cell label sequences and utility functions

Cell label structure, cell label sequences, bead sequences, and python utility functions are available for download here:

rhapsody_cell_label.py.txt

The single integer cell index and cooresponding assembled R1 bead sequence is available in fasta format for each bead version: