Reference Files
Introduction
For targeted mRNA assays, FASTA reference files are used to store the sequences of gene targets.
For whole transcriptome assays (WTA), the reference files archive is a compressed tarball that contains the STAR index files and the GTF transcriptome annotation corresponding to the species of cells used in the BD® WTA experiment.
For ATAC-Seq or Multiomic ATAC-Seq (WTA+ATAC-Seq) assays, the reference files archive is a compressed tarball that contains all the contents as described above for a WTA assay, an additional index for bwa-mem2, a text file containing the mitochondrial contig names and the Transcription Factor Motif PFM file (if provided during reference archive generation).
The AbSeq Reference is a FASTA file for BD® AbSeq Ab-Oligos used in a BD Rhapsody™ experiment.
If additional transgene sequences are used in the experiment, an additional FASTA file containing the sequences can be used as the Supplemental Reference.
Obtaining pre-designed targeted mRNA panels, WTA, or Multiomic WTA+ATAC-Seq reference files
Obtain the targeted FASTA references from the Seven Bridges demo project, or by contacting BD Biosciences customer support at scomix@bdscomix.bd.com.
For WTA assays, obtain a pre-built reference genome archive file for human or mouse from the Seven Bridges demo project, or by downloading from the following link: bd-rhapsody-public.s3-website-us-east-1.amazonaws.com/Rhapsody-WTA/
For ATAC-Seq and Multiomic WTA+ATAC-Seq assays, obtain a pre-built reference genome archive file for human or mouse from the Seven Bridges demo project, or by downloading from the following link: bd-rhapsody-public.s3-website-us-east-1.amazonaws.com/Rhapsody-WTA-ATAC/
Pre-built WTA reference gene biotypes
The GTF file in the pre-built WTA reference archive has been preprocessed to contain only the following gene types:
protein_coding, protein_coding_LOF, lncRNA, lincRNA, antisense, IG_LV_gene, IG_V_gene, IG_V_pseudogene, IG_D_gene, IG_J_gene, IG_J_ pseudogene, IG_C_gene, IG_C_pseudogene, TR_V_gene, TR_V_pseudogene, TR_D_gene, TR_J_gene, TR_J_ pseudogene, and TR_C_gene
Designing custom Targeted mRNA panels
By providing a list of genes to BD Biosciences customer support, we can design custom mRNA targeted panels. Contact BD Biosciences customer support at scomix@bdscomix.bd.com.
AbSeq reference files
If your experiment contains BD® AbSeq Ab-Oligos, you are required to have an AbSeq reference file. To prepare the AbSeq reference file, you can use the BD AbSeq Panel Generator (abseq-ref-gen.genomics.bd.com) or follow the instructions below.
-
Download the FASTA file containing all of the BD Ab-Oligo (AbO) sequence. Go to bd-rhapsody-public.s3-website-us-east-1.amazonaws.com/AbSeq-references/BDAbSeq_allReference_latest.fasta.
-
Use a text editor such as Microsoft® Notepad or TextEdit to delete the sequence header and sequence pairs that will not be used in the experiment.
Do not use a word processor such as Microsoft® Word, which can add unintended special characters to the file.
-
Ensure that the AbSeq reference file follows these rules:
-
File extension is
.faor.fasta -
Two line fasta format. Format example:
>CD103|ITGAE|AHS0001|pAbO AAATAGTATCGAGCGTAGTTAAGTTGCGTAGCCGTT >CD161:DX12|KLRB1|AHS0002|pAbO GTTATGGTTGTCGGTAGAGTATCGTGTTGCGTTAGTNote: BD Biosciences uses this format for its sequence header:
<AntibodyName>|<GeneSymbol>|<SeqID>|pAbO.
Building a custom WTA only or Multiomic WTA+ATAC-Seq reference archive
The WTA reference archive is a tar.gz file with the following internal structure:
BD_Rhapsody_Reference_Files/ # top level folder
star_index/ # sub-folder containing STAR index
[files created with STAR --runMode genomeGenerate]
GTF for gene-transcript-annotation e.g. "gencode.v49.primary_assembly.annotation.gtf"
The WTA+ATAC-Seq reference archive is a tar.gz file with the following internal structure:
BD_Rhapsody_Reference_Files/ # top level folder
star_index/ # sub-folder containing STAR index
[files created with STAR --runMode genomeGenerate]
GTF for gene-transcript-annotation e.g. "gencode.v49.primary_assembly.annotation.gtf"
mitochondrial_contigs.txt # mitochondrial contigs in the reference genome - one contig name per line. e.g. chrMT or
chrM, etc.
JASPAR2024_CORE_vertebrates_non-redundant_pfms_jaspar.pfm # Transcription factor Motif PFM file from JASPAR
bwa-mem2_index/ # sub-folder containing bwa-mem2 index
[files created with bwa-mem2 index]
The same docker image used for running the BD Rhapsody™ Sequence Analysis Pipeline can be used for generating a WTA only or WTA+ATAC-Seq reference archive with the following steps:
-
Go to bitbucket.org/CRSwDev/cwl and download the Extra_Utilities file:
make_rhap_reference_<version>.cwl -
Gather a matching set of genome sequence in FASTA format and GTF with gene, transcript, and exon annotations: for example, from gencodegenes.org.. Chromosome names need to match exactly between the FASTA file and the GTF. These features of the GTF are important:
- Each gene and exon line in the GTF must have a "gene_id" and a "gene_name" attribute. If the "gene_name" attribute is not unique for each "gene_id" attribute, it will be modified to include the gene location. The "gene_name" attribute will be used as the identifier for bioproducts in the pipeline output. It is fine if the "gene_name" is the same as the "gene_id".
- The value of the "strand" column of these lines should be "+" or "-", and must not be ".".
- These lines should include a "gene_type" or a "gene_biotype" attribute, or they will be filtered out by the pipeline. Some GTF files do not include these attributes on non-gene features, so be sure to check the input.
- By default, the
make_rhap_referencetool will remove any gene lines in which the "gene_type" or "gene_biotype" attribute is not in the following list: "protein_coding", "protein_coding_LOF", "lncRNA", "lincRNA", "antisense", "IG_LV_gene", "IG_V_gene", "IG_V_pseudogene", "IG_D_gene", "IG_J_gene", "IG_J_pseudogene", "IG_C_gene", "IG_C_pseudogene", "TR_V_gene", "TR_V_pseudogene", "TR_D_gene", "TR_J_gene", "TR_J_pseudogene", or "TR_C_gene". This filtering can be turned off with theFiltering_offparameter. - If ATAC analysis is intended, the "gene" features should also have associated "transcript" features.
-
If you are creating a Reference Archive for ATAC-Seq datasets, and you want Transcription Factor Motif analysis to be run by the pipeline, also gather a JASPAR-formatted text file of Transcription Factor Motif Position Frequency Matrices (PFMs) appropriate for your organism, such as one of the clades available from the JASPAR site. If you exclude this file, the Rhapsody pipeline will not run Transcription Factor Motif analysis, and some ATAC analysis in Cellismo will not be possible (e.g., differential analysis using motifs).
-
Run
cwl-runnerlike the following example :cwl-runner make_rhap_reference_3.0.cwl --Genome_fasta GRCh38.primary_assembly.genome.fa --Gtf gencode.v49.primary_assembly.annotation.gtf --Transcription_Factor_Motif_PFM JASPAR2024_CORE_vertebrates_non-redundant_pfms_jaspar.txt --Filter_PARs --Archive_prefix testrefhuman49The resulting
testrefhuman49.tar.gzfile can be used for the Reference_Archive input of the BD Rhapsody™ Sequence Analysis Pipeline. By default the combined WTA+ATAC-Seq reference is created.To create a WTA only index please pass the flag --WTA_only, i.e. :
cwl-runner make_rhap_reference_3.0.cwl --Genome_fasta GRCh38.primary_assembly.genome.fa --Gtf gencode.v49.primary_assembly.annotation.gtf --Filter_PARs --Archive_prefix testrefhuman49 --WTA_only