Extra Utilities
The local version of the BD Rhapsody™ Sequence Analysis Pipeline comes with several useful utilities:
These utilities can be run in the same way as the main Sequence Analysis Pipeline, using cwl-runner and docker. They
use the same docker image as the main Sequence Analysis Pipeline -- bdgenomics/rhapsody. See Local Server
Setup for installation instructions.
Inputs are provided in a YML specification file, or on the command-line. CWL documents for these utilities are also in
the same location as the main pipeline CWL (versioned folders):
https://bitbucket.org/CRSwDev/cwl.
Make Rhapsody Reference
make_rhap_reference_[version].cwl
Create a new WTA Reference Archive for use as an input to the Rhapsody Sequence Analysis Pipeline.
Inputs:
-
Genome_fasta:
Required. File path to the reference genome file in FASTA or FASTA.GZ format.
-
Gtf:
Required. File path to the transcript annotation files in GTF or GTF.GZ format. The Sequence Analysis Pipeline requires the 'gene_name' or 'gene_id' attribute to be set on each gene and exon feature. Gene and exon feature lines must have the same attribute, and exons must have a corresponding gene with the same value. For TCR/BCR assays, the TCR or BCR gene segments must have the 'gene_type' or 'gene_biotype' attribute set, and the value should begin with 'TR' or 'IG', respectively.
-
Extra_sequences:
Optional. File path to additional sequences in FASTA format to use when building the STAR index. (e.g. transgenes or CRISPR guide barcodes). GTF lines for these sequences will be automatically generated and combined with the main GTF.
-
Mitochondrial_Contigs:
Optional. Names of the Mitochondrial contigs in the provided Reference Genome. Fragments originating from contigs other than these are identified as 'nuclear fragments' in the analysis pipeline.
-
Transcription_Factor_Motif_PFM:
Optional. Text file of Transcription Factor Motif position frequency matrices in JASPAR format. The pre-built BD reference archive files use the JASPAR2024_CORE_vertebrates_non-redundant_pfms_jaspar.txt file for both Human and Mouse. You can browse the list of all files here : https://jaspar.elixir.no/download/data/2024/CORE/
-
Disable_Biotype_Filtering:
Optional. [True/False] By default the input GTF files are filtered based on the gene_type/gene_biotype attribute. (Using biotypes defined by Gencode/Ensembl) If you have already pre-filtered the input Annotation files and/or wish to turn-off the filtering, please set this option to True. The GTF features having the following attribute values are are kept:
protein_coding, protein_coding_LOF, lncRNA (lincRNA and antisense for Gencode < v31/M22/Ensembl97), IG_LV_gene, IG_V_gene, IG_V_pseudogene, IG_D_gene, IG_J_gene, IG_J_pseudogene, IG_C_gene, IG_C_pseudogene, TR_V_gene, TR_V_pseudogene, TR_D_gene, TR_J_gene, TR_J_pseudogene, TR_C_gene
-
Disable_Readthrough_Filtering:
Optional. [True/False] By default genes with only readthrough transcripts are removed. Any readthrough_transcript feature is also removed if its parent gene overlaps with another gene that meets the biotype requirement. Please set this option to True to disable this behaviour.
-
Filter_PARs:
Optional. [True/False] Default: False. This applies to only a Human build 38 reference. If enabled, features in the 2 PARs on the Y chromosome are removed.
-
Archive_prefix:
Optional. String. A prefix base name for the result compressed archive file. The default value is constructed based on the input reference files.
-
Maximum_threads:
Optional. Integer. The maximum number of threads to use. By default, all available cores are used.
-
Extra_STAR_params
Optional. String. Parameters to pass directly to the STAR genomeGenerate process. Useful for very large or very small genome sizes. Example "--limitGenomeGenerateRAM 48000 --genomeSAindexNbases 11"
Example command:
cwl-runner make_rhap_reference_3.0.cwl --Genome_fasta GRCh38.primary_assembly.genome.fa --Gtf gencode.v49.primary_assembly.annotation.gtf --Archive_prefix testrefhuman49 --WTA_only --Filter_PARs
File structure of the resulting reference archive:
BD_Rhapsody_Reference_Files/
star_index/
[files created with star genomeGenerate]
[filtered/non-filtered transcriptome annotation].gtf
PhiX contamination detection
PhiXContamination_[version].cwl
Check a FASTQ file for PhiX contamination, by aligning the reads to the PhiX genome. (uses Bowtie2)
Inputs:
-
Fastq:
Required. File path to a single FASTQ file to check for PhiX contamination.
-
Threads:
Optional. Integer. The number of threads to use. By default, all available cores are used.
Example command:
cwl-runner PhiXContamination_2.0.cwl --Fastq MyRhapsodyLibrary_R1.fastq.gz --Threads 8
Example result:
36508493 reads; of these:
36508493 (100.00%) were unpaired; of these:
36503405 (99.99%) aligned 0 times
5088 (0.01%) aligned exactly 1 time
0 (0.00%) aligned >1 times
0.01% overall alignment rate
Annotate Cell Label and UMI only
AnnotateCellLabelUMI_[version].cwl
Given pairs of R1/R2 FASTQ files from Rhapsody libraries, only annotate the cell label and UMI of R1 and put it in the header of R2.
Format of result FASTQ:
@OriginalHeader;cell_index;UMI
[R2Sequence]
+
[R2Quality]
Inputs:
-
Reads:
Required. Comma-separated list of FASTQ file paths.
-
Maximum_Threads:
Optional. Integer. The maximum number of threads to use. By default, all available cores are used.
Example YML input specification file [inputs.yml]:
Reads:
- class: File
location: "test/mySample_R1_.fastq.gz"
- class: File
location: "test/mySample_R2_.fastq.gz"
Maximum_Threads: 8
Example command:
cwl-runner AnnotateCellLabelUMI_2.0.cwl inputs.yml
Example result in mySample_R2.annotated.fastq.gz:
@M04277:241:000000000-B4VBL:1:1101:9821:2660;8144695;ATGCACGC
TGCCCTCAACGACCACTTTGTCAAGCTCATTTCCTGGTATGACAACGAATTTGGCTACAGCAACAGGGTGGTGGAC
+
CCCCCGGFFGGGGGDGGGGGGGGFGGGGGGGGGGGFEGEGFGGDCEF@:FGGGGGGGGGGG?FGCFGGGEFGGGGG
@M04277:241:000000000-B4VBL:1:1101:22673:2660;11066516;GCGACACA
ATTTTTAATACACCTGCTTCACGTCCCTATGTTGGGAAGTCCATATTTGTCTGCTTTTCTTGCAGCATCATTTCCT
+
CCCCCGGGFGGGD8C@C<EFFE@@C,@FFFCFFAFGGGGCGGGGGGGGDFGGGGFA<FGGFFGFGGGGGGGEGGGG