Rejuvenome leverages single nuclei RNA-seq as a workhorse assay to profile the effects of anti-aging interventions in mice throughout the body. Given that, it is imperative for us to find cost-effective solutions for library preparation and sequencing.
In a later post we will discuss our ongoing comparison between Parse Biosciences and 10x Genomics kits for library preparation; here we will discuss our decision to choose BGI as our main sequencing provider.
Early on in the ideation stage of Rejuvenome we learned that BGI sequencing offered data quality comparable to Illumina’s at a fraction of the cost. We decided to set up a pilot where we would compare the same cDNA library sequenced in both an Illumina and a BGI platform.
Our prior expectation was that there would be no major issues: There were already published comparisons between Illumina and BGI sequencers, and they all found that data quality is satisfactory:
“To our knowledge, this study is the first to utilize MGISEQ-2000 platform for scRNA-seq, and the first to compare sequence performance for the widely used 10× Chromium platform against Illumina platforms. Our comprehensive benchmarking utilizes data from over 70 000 cells, and shows that the MGISEQ-2000 has to be highly comparable performance across a range of modalities to the Illumina NextSeq 500 and NovaSeq 6000 platforms at equal read depth, while being more cost effective. For single cell RNA-sequencing-specific metrics, such as read quality, cell detection and RNA molecule detection, we found the Illumina NovaSeq 6000 and BGI MGISEQ-2000 platforms generated highly comparable data, and similar observations were made between the Illumina NextSeq 500 and MGISEQ-2000 platforms.” [emphasis ours]
Senabouth, Anne, et al. “Comparative performance of the BGI and Illumina sequencing technology for single-cell RNA-sequencing.” NAR genomics and bioinformatics 2.2 (2020)
“These results are in line with our results in that the performances of the MGISEQ-2000 and DNBSEQ series were similar to those of Illumina NGS systems. Overall, we concluded that the three platforms (NovaSeq 6000, MGISEQ-2000, and DNBSEQ-T7) are highly concordant, and that MGISEQ-2000 and DNGSEQ-T7 can be fully compatible alternatives to NovaSeq 6000 in WGS analysis.” [emphasis ours]
Jeon, Sol A., et al. “Comparison between MGI and Illumina sequencing platforms for whole genome sequencing.” Genes & Genomics 43.7 (2021)
“Our study is the first to utilize BGISEQ-500 platform for scRNA-seq. Our comprehensive benchmarking of performance metrics utilizes two scRNA-seq protocols (SMARTer and Smart-seq2), multiple spike-ins (ERCC alone, ERCC+SIRV), two different cell lines (mESCs, K562s), and two technologies (Fluidigm C1, plate-based) across Illumina HiSeq and BGISEQ-500 platform. Utilizing 468 single K562 and mESCs and matched 1297 single-cell libraries, we observe BGISEQ-500 to be highly comparable in sensitivity, accuracy, and reproducibility to Illumina platform, while being considerably more cost-effective.” [emphasis ours]
Natarajan, Kedar Nath, et al. “Comparative analysis of sequencing technologies for single-cell transcriptomics.” Genome biology 20.1 (2019)
We could find no available sources that pointed to potential downsides of the BGI platforms. A recently announced sequencing technology also promises substantially lower costs but they have documented caveats that impact data quality. This is clearly not the case with BGI, which we found to be comparable to Illumina in terms of quality.
Rejuvenome’s BGI pilot
To compare each technology, we prepared two cDNA sub-libraries using ParseBio’s Whole Transcriptome Mini library preparation kit. Note that this is Parse’s v1 kit, and that they recently updated an improved v2 kit.
Both sub-libraries contained cells from the same four samples. As the purpose of our pilot was to test different sequencing technologies, we decided to use existing cell lines already available at the Buck Institute. The libraries we sequenced contain IMR90 (human lung fibroblasts) and primary cardiac fibroblasts and were each sequenced in both a NovaSeq 6000 and a BGI DNBSEQ-G400. BGI has another sequencer, the DNBSEQ-T7, capable of more reads per cell and comparable to the NovaSeq. As a result, we got more reads on the Illumina libraries than the BGI ones. Hence, for an apples-to-apples comparison, the Illumina library was subsampled to the same number of reads. Another difference in both runs is that the NovaSeq library was run with longer reads (2x150bp) than the BGI library (2x100bp), therefore we trimmed the NovaSeq reads to the same length as the BGI reads. In the analysis that follows this introduction we compare untrimmed Illumina, trimmed Illumina, and BGI.
Comparing two different sequencers using two different read lengths can be confusing, so for an apples-to-apples comparison we will compare costs per gigabase.
|Sequencer||Max theoretical gigabases per lane||Max PE Reads per lane (theoretical)||PE Reads per lane (our libraries)||Max PE reads per lane (BGI pilots)**||Cost per lane (public pricing)||Cost per gigabase*|
|Illumina NovaSeq 6000 (S1, 2x150bp, UCDavis)||750||1.6B||1.37B-1.5B||NA||$3.8k||$10.2|
|Illumina NovaSeq 6000 (S4, 2x150bp, UCDavis)||750||5B||NA||NA||$7.6k||$10.2|
|Illumina NovaSeq 6000 (S4, 2x100bp, BGI)||500||5B||NA||NA||NA||NA|
|BGI DNBSEQ-G400 (FCL, 2x100bp)||90||450M||252M||521M||NA||NA*|
|BGI DNBSEQ-T7 (2x150bp)||800||5B||NA||3.8B||NA||NA|
Illumina prices from UC Davis Sequencing Core
* There is no publicly available pricing information for BGI sequencing and our own pricing is under an NDA. We can say that it is very competitive with Illumina sequencing, and this was a key reason for our decision to go with BGI.
** After our technical pilot, BGI spent some time optimizing their sequencing pipeline for Parse libraries, greatly improving the usable data they get out of a sequencing run. You can read more about this at the appendix at the end.
The specific pricing from different providers will depend on the specific needs, number of samples, desired sequencing depth, and academic or research nonprofit status. Supplemental table S2 in Senabouth et al. (2020) has some cost information that might be useful regarding costs. Note that whereas their sequencer was an MGISEQ-2000 and ours was a DNBSEQ-G400, this is actually the same instrument.
The analysis below was performed by Tobias Fehlmann (Senior Bioinformatician at the Astera Institute) on the libraries prepared by Serban Ciotlos (Buck Institute, soon joining the Rejuvenome team).
Data was processed with the Parse Biosciences Pipeline v0.9.6p, FASTQC v0.11.9 and QualiMap v2.2.2d.
The Illumina libraries were sequenced at a depth of 1.37B and 1.50B reads, while we obtained 550M and 252M reads for the BGI libraries. The Illumina samples were run on one lane of a NovaSeq each, whereas the BGI libraries were run on two lanes (L3) and one lane (L7) of a DNBSEQ-G400.
Above is the library structure used by Parse. R1 reads are on the left, R2 on the right. R1 contains the cDNA and R2 contains all three barcodes and the UMI (10bp randomer), in reverse order. The first bases of R2 are therefore the UMI followed by the round 3 barcode, etc. Sequencing quality, shown below, is good at each position for the cDNA reads. Both platforms show decreased quality values for the subsequence corresponding to the linker sequence, although the NovaSeq libraries seem to be more affected.
Illumina trimmed and subsampled
Illumina trimmed and subsampled
Illumina trimmed and subsampled
Illumina trimmed and subsampled
Read quality per barcode, UMI and cDNA
The Q30 values are high for all barcodes and the UMI, with slightly better Q30 values obtained with BGI. The largest difference between both platforms can be observed for cDNA Q30 values, which are over 4% higher for BGI in comparison with Illumina reads of the same length.
Barcode read statistics
The number of reads with valid barcodes is similar for both platforms, slightly higher with BGI for L3 and slightly higher with Illumina for L7. Still, the number of barcodes lost is relatively high, resulting in only ~60% reads that will be further processed. Most detected barcodes correspond to perfect sequence matches, however, about 10% have an edit distance of 2, and are thus discarded by the Parse pipeline. This makes sense since the barcodes only span 8 bases and an edit distance of two is unlikely to be a true match.
The Parse pipeline allows up to 3 multi-mapping positions as default, in contrast to Cell Ranger which removes reads mapping to more than 1 gene. The number of transcripts in the Parse pipeline is equivalent to the number of UMIs in the Cell Ranger pipeline. As expected, the largest portion of reads with valid barcodes maps uniquely to a position in the genome (about 60%). Permitting up to 3 multi-mapping positions allows rescuing of about 3% of the reads, but most multi-mapping reads are discarded amounting to a total of ~25% of valid reads. The majority of the reads mapping to the genome also map to the transcriptome. Low mapping rates to the transcriptome could e.g. highlight DNA contamination.
After deduplication, 156M and 81M transcripts remain for BGI, with similar values for the sub-sampled Illumina libraries. For the untouched Illumina libraries, 261M and 304M transcripts remain. We observe that the fraction of transcripts after deduplication is higher for the BGI libraries, which is expected since fewer reads were sequenced and the likelihood of encountering PCR duplicates is therefore lower.
Gene read coverage
The read coverage profiles are similar between BGI and Illumina and show that highly expressed genes are relatively well covered through their entire length, with slight enrichments at the beginning and end.
We observe that the proportion of reads mapping to exons is similar between both technologies, with on average 48.4% (±2.2%) mapping to exons with BGI and 48.1% (±2.3%) with Illumina.
The number of called cells is very similar between both technologies and varies between 668 and 1907 cells. A total of 9,494 cells were detected in the BGI sequenced sub-libraries, while 9,542 cells were detected in the Illumina libraries and 9,580 cells in the sub-sampled and trimmed Illumina libraries. This is very close to the 10k cells we expected to recover. Interestingly the number of called cells is not higher for the deeper sequenced Illumina libraries and only results in a higher number of reads per cell.
We observe that the called cells are mostly identical between both technologies. The largest differences can be observed for the quiescent cells, with between 26 and 54 unique cells for each.
The comparison between the BGI libraries and the complete Illumina libraries shows a similar picture, with mostly no uniquely detected cells in one technology. The largest exception is L3_D12, for which the BGI library detects 52 unique cells, while only 2 unique cells are detected for the Illumina library.
The sequencing saturation is over 40% for the Illumina (untrimmed) libraries and 25% and 10% for the BGI libraries. The saturation curves are nearly identical between both technologies and show that the number of detected genes is starting to plateau at about 100-200k reads per cell.
On average a median of 7,167 genes were detected per cell with the complete Illumina library, while 4,807 genes were detected with BGI. Similarly, an average median of 34,022 transcripts per cell was available with the complete Illumina library, in comparison to 15,031 transcripts with BGI.
Genes per cell
A direct comparison between BGI samples and subsampled Illumina samples of the number of genes per cell shows that the number of genes detected for each cell is very similar and outliers are rare.
Counts per cell
Similar to observations made for the number of genes per cell, the number of transcripts per cell is very similar between both technologies as well.
Given prior results in the literature, our own pilot, and the pricing we have access to, we have chosen BGI as our provider of sequencing services.
Appendix: BGI post-pilot improvements in sequencing quality
The data quality one gets from a sequencer is not a mere function of the original cDNA libraries and the sequencer itself. Earlier we remarked we only got 252M PE reads per lane out of our G400 run with BGI. This is good enough for a technical pilot, but it is not fully utilizing the instruments to its maximum potential. To show what would be possible with some further optimization, BGI spent some time and their own resources to diagnose the our first run and increase throughput for future pilots.
BGI tried three different modifications to the original run: First, they varied the amount of in-lane control added to the sample to help with the low base diversity present in the Parse libraries. This is similar, but not identical, to the phiX control sometimes used with Illumina sequencers (See here for more ). Whereas the original run used 20% in-lane control, 30% yielded an increased number of reads per flow cell lane (275M vs 252M originally)
More importantly, BGI improved the firmware of the G400 to (This updated version is called “ECR 5.2”), which increases data quality (and reduces the need for in lane control) in unbalanced libraries. They tried a run of four of their own samples (PE-100) in one flow cell (4 lanes) without control. As a result, they got 500M reads per lane, substantially more than the 450M the instrument is technically capable of, which is great news!
Lastly, BGI ran Parse libraries on their NovaSeq-equivalent DNBSEQ-T7, using 30% of in-lane control. Excluding the in-lane control, they were able to obtain 3,878M PE reads. This was without updating the firmware of the T7, so we can only expect this will further improve in the future.
We intend to run most of our samples in the T7 as it offers a cost advantage relative to the G400, so we will be running more samples in the T7 soon. We may release an updated report with our results of that experiment here as well.