Annotate R1 Cell Label and UMI

R1 structure

The quality-filtered R1 reads are analyzed to identify the cell label sequences (CLS), common linker sequences (L), and Unique Molecular Identifier (UMI) sequence.

Read 1 structure by bead version:

Bead Strand	CC PCR Handle	Diversity Insert	CLS1	Linker1	CLS2	Linker2	CLS3	UMI	Capture Sequence
Original V1	None	None	9bp	ACTGGCCTGCGA	9bp	GGTAGCGGTGACA	9bp	NNNNNNNN	18 dT
Enhanced dT 3'	None	None, A, GT, or TCA	9bp	GTGA	9bp	GACA	9bp	NNNNNNNN	25 dT
Enhanced CC TCR/BCR	ACAGGAAACTCATGGTGCGT	None	9bp	AATG	9bp	CCAC	9bp	NNNNNNNN	TATGCGTAGTAGGTATG or GTGGAGTCGTGATTATA

Information of the cell label is captured by bases in three sections (CLS1, CLS2, CLS3) along each R1 read. Two common sequences (L1, L2) separate the three CLSs, and the presence of L1 and L2 relates to the way the capture oligo nucleotide probes on the beads are constructed. By design, each CLS has one of either 96 or 384 predefined sequences (depending on bead version), which has a Hamming distance of at least three bases and an edit distance of at least two bases apart. A cell label is defined by the unique combination of predefined sequences in the three CLSs. Thus, the maximum possible number of cell labels is either 96³ or 384³. In the final data tables, the three part cell label is converted to a single integer index between 1-384³.

Reads are first checked for perfect matches in all three pre-designed CLS sequences at the expected locations, and reads with perfect matches are kept.

The remaining reads are subjected to another round of filtering to recover reads with base substitutions, insertions, and deletions caused by sequencing errors, PCR errors, or errors in oligonucleotide synthesis.

UMI

By design, the UMI is a string of eight randomers immediately downstream of CLS3. For reads with insertions or deletions within the CLSs, the UMI sequence is eight bases immediately following the end of the identified CLS3.

Cell label sequences and utility functions

Cell label structure, cell label sequences, bead sequences, and python utility functions are available for download here:

rhapsody_cell_label.py.txt

The single integer cell index and cooresponding assembled R1 bead sequence is available in fasta format for each bead version:

Original V1 beads: Rhapsody_cellBarcodeV1_IndexToSequence.fasta.zip
Enhanced beads: Rhapsody_cellBarcodeEnh_IndexToSequence.fasta.zip
Enhanced V2/V3 beads: Rhapsody_cellBarcodeEnhV2_IndexToSequence.fasta.zip