Extra Utilities


The local version of the BD Rhapsody™ Sequence Analysis Pipeline comes with several useful utilities:

These utilities can be run in the same way as the main Sequence Analysis Pipeline, using cwl-runner and docker. They use the same docker image as the main Sequence Analysis Pipeline -- bdgenomics/rhapsody. See Local Server Setup for installation instructions. Inputs are provided in a YML specification file, or on the command-line. CWL documents for these utilities are also in the same location as the main pipeline CWL (versioned folders): https://bitbucket.org/CRSwDev/cwl.

Make Rhapsody Reference

make_rhap_reference_[version].cwl

Create a new WTA Reference Archive for use as an input to the Rhapsody Sequence Analysis Pipeline.

Inputs:

  • Genome_fasta:

    Required. File path to the reference genome file in FASTA or FASTA.GZ format.

  • Gtf:

    Required. File path to the transcript annotation files in GTF or GTF.GZ format. The Sequence Analysis Pipeline requires the 'gene_name' or 'gene_id' attribute to be set on each gene and exon feature. Gene and exon feature lines must have the same attribute, and exons must have a corresponding gene with the same value. For TCR/BCR assays, the TCR or BCR gene segments must have the 'gene_type' or 'gene_biotype' attribute set, and the value should begin with 'TR' or 'IG', respectively.

  • Extra_sequences:

    Optional. File path to additional sequences in FASTA format to use when building the STAR index. (e.g. transgenes or CRISPR guide barcodes). GTF lines for these sequences will be automatically generated and combined with the main GTF.

  • Filtering_off:

    Optional. [True/False] By default the input GTF files are filtered based on the gene_type/gene_biotype attribute. (Using biotypes defined by Gencode/Ensembl) If you have already pre-filtered the input Annotation files and/or wish to turn-off the filtering, please set this option to True. The GTF features having the following attribute values are are kept:

    protein_coding, lncRNA (lincRNA and antisense for Gencode < v31/M22/Ensembl97), IG_LV_gene, IG_V_gene, IG_V_pseudogene, IG_D_gene, IG_J_gene, IG_J_pseudogene, IG_C_gene, IG_C_pseudogene, TR_V_gene, TR_V_pseudogene, TR_D_gene, TR_J_gene, TR_J_pseudogene, TR_C_gene

  • Archive_prefix:

    Optional. String. A prefix base name for the result compressed archive file. The default value is constructed based on the input reference files.

  • Maximum_threads:

    Optional. Integer. The maximum number of threads to use. By default, all available cores are used.

  • Extra_STAR_params

    Optional. String. Parameters to pass directly to the STAR genomeGenerate process. Useful for very large or very small genome sizes. Example "--limitGenomeGenerateRAM 48000 --genomeSAindexNbases 11"

Example command:

cwl-runner make_rhap_reference_2.0.cwl --Genome_fasta GRCh38.primary_assembly.genome.fa.gz --Gtf gencode.v42.primary_assembly.annotation.gtf.gz

File structure of the resulting reference archive:

BD_Rhapsody_Reference_Files/
   star_index/
       [files created with star genomeGenerate]
   [filtered/non-filtered transcriptome annotation].gtf

PhiX contamination detection

PhiXContamination_[version].cwl

Check a FASTQ file for PhiX contamination, by aligning the reads to the PhiX genome. (uses Bowtie2)

Inputs:

  • Fastq:

    Required. File path to a single FASTQ file to check for PhiX contamination.

  • Threads:

    Optional. Integer. The number of threads to use. By default, all available cores are used.

Example command:

cwl-runner PhiXContamination_2.0.cwl --Fastq MyRhapsodyLibrary_R1.fastq.gz --Threads 8

Example result:

36508493 reads; of these:
  36508493 (100.00%) were unpaired; of these:
    36503405 (99.99%) aligned 0 times
    5088 (0.01%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
0.01% overall alignment rate

Annotate Cell Label and UMI only

AnnotateCellLabelUMI_[version].cwl

Given pairs of R1/R2 FASTQ files from Rhapsody libraries, only annotate the cell label and UMI of R1 and put it in the header of R2.

Format of result FASTQ:

@OriginalHeader;cell_index;UMI
[R2Sequence]
+
[R2Quality]

Inputs:

  • Reads:

    Required. Comma-separated list of FASTQ file paths.

  • Maximum_Threads:

    Optional. Integer. The maximum number of threads to use. By default, all available cores are used.

Example YML input specification file [inputs.yml]:

Reads:
 - class: File
   location: "test/mySample_R1_.fastq.gz"
 - class: File
   location: "test/mySample_R2_.fastq.gz"

Maximum_Threads: 8

Example command:

cwl-runner AnnotateCellLabelUMI_2.0.cwl inputs.yml

Example result in mySample_R2.annotated.fastq.gz:

@M04277:241:000000000-B4VBL:1:1101:9821:2660;8144695;ATGCACGC
TGCCCTCAACGACCACTTTGTCAAGCTCATTTCCTGGTATGACAACGAATTTGGCTACAGCAACAGGGTGGTGGAC
+
CCCCCGGFFGGGGGDGGGGGGGGFGGGGGGGGGGGFEGEGFGGDCEF@:FGGGGGGGGGGG?FGCFGGGEFGGGGG
@M04277:241:000000000-B4VBL:1:1101:22673:2660;11066516;GCGACACA
ATTTTTAATACACCTGCTTCACGTCCCTATGTTGGGAAGTCCATATTTGTCTGCTTTTCTTGCAGCATCATTTCCT
+
CCCCCGGGFGGGD8C@C<EFFE@@C,@FFFCFFAFGGGGCGGGGGGGGDFGGGGFA<FGGFFGFGGGGGGGEGGGG