Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data

Science & Nature

Data availability

Sequencing data from our experiments, along with all relevant metadata, was uploaded to SRA, accession PRJNA905430 (ref. 55). All other datasets analyzed in this study are publicly available. The college dormitory dataset25 used in Fig. 1 and Extended Data Figs. 35 is available from the European Nucleotide Archive (ENA), accession ERP115809, and Qiita41, study ID 12470. The marine sediments dataset, used in Extended Data Fig. 3a,b, is available from Qiita41, study ID 11922. The fish microbiome dataset42, used in Extended Data Fig. 3c,d, is available from ENA, accession PRJEB54736, and Qiita41, study ID 13414. The Earth Microbiome Project soil dataset43, used in Extended Data Fig. 3e,f, is available from ENA, accession PRJEB42019, and Qiita41, study ID 13114. The office dataset44, used in Extended Data Fig. 3g,h, is available from ENA, accession PRJEB13115, and Qiita41, study ID 10423. The Central Park soil dataset45, used in Extended Data Fig. 3i,j, is available from ENA, accession PRJEB6614, and Qiita41, study ID 2104. The gut metagenomic dataset46, used in Extended Data Fig. 3k,l, is available from ENA, accession PRJEB50408, and Qiita41, study ID 13692. The negative controls dataset, used in Fig. 1, and Extended Data Figs. 3a–f, 4, 5 is available from Qiita41, study ID 12019; the one used in Extended Data Fig. 3g,h,k,l is available from ENA, accession PRJEB40903, and Qiita41, study ID 12201; and the one used in Extended Data Fig. 3i,j is available from ENA, accession PRJEB25617, and Qiita41, study ID 10333. The well-to-well leakage dataset32, is available from ENA, accession ERP115213. The plasma cfDNA data20 is available from ENA, accessions ERP119598, ERP119596 and ERP119597; and Qiita41, study IDs 12667, 12691 and 12692. The tumor microbiome dataset18 is available from SRA, accession PRJNA624822. The processed data was obtained from Supplementary Table 2 in ref. 18.

Code availability

SCRuB is available at https://github.com/Shenhav-and-Korem-labs/SCRuB56 and requires R (≥3.6.3), glmnet57 (4.1-4) and torch (1.3.1). A Code Ocean capsule replicating all analyses in this paper is available at https://codeocean.com/capsule/5737862/tree/v1 (ref. 58), with source code also available at https://github.com/Shenhav-and-Korem-labs/SCRuB_analysis. Both use tidyverse59 (0.7.2) and XGBoost60 (1.5.0). The decontamination pipeline used by Nejman et al.18 is available from Zenodo at https://doi.org/10.5281/zenodo.3740536, and the prediction pipeline used by Poore et al.20 is available at https://github.com/biocore/tcga.

References

  1. Salter, S. J. et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 12, 87 (2014).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  2. Weyrich, L. S. et al. Laboratory contamination over time during low-biomass sample analysis. Mol. Ecol. Resour. 19, 982–996 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  3. Kim, D. et al. Optimizing methods and dodging pitfalls in microbiome research. Microbiome 5, 52 (2017).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  4. Eisenhofer, R. et al. Contamination in low microbial biomass microbiome studies: issues and recommendations. Trends Microbiol. 27, 105–117 (2019).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  5. Weiss, S. et al. Tracking down the sources of experimental contamination in microbiome studies. Genome Biol. 15, 564 (2014).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  6. Aagaard, K. et al. The placenta harbors a unique microbiome. Sci. Transl. Med. 6, 237ra65 (2014).

    PubMed 
    PubMed Central 

    Google Scholar
     

  7. Parnell, L. A. et al. Microbial communities in placentas from term normal pregnancy exhibit spatially variable profiles. Sci Rep. 7, 11200 (2017).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  8. Seferovic, M. D. et al. Visualization of microbes by 16S in situ hybridization in term and preterm placentas without intraamniotic infection. Am. J. Obstet. Gynecol. 221, 146.e1–146.e23 (2019).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  9. de Goffau, M. C. et al. Human placenta has no microbiome but can contain potential pathogens. Nature 572, 329–334 (2019).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  10. Leiby, J. S. et al. Lack of detection of a human placenta microbiome in samples from preterm and term deliveries. Microbiome 6, 196 (2018).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  11. Kuperman, A. A. et al. Deep microbial analysis of multiple placentas shows no evidence for a placental microbiome. BJOG 127, 159–169 (2020).

    Article 
    CAS 
    PubMed 

    Google Scholar
     

  12. Sinha, R., Abnet, C. C., White, O., Knight, R. & Huttenhower, C. The microbiome quality control project: baseline study design and future directions. Genome Biol. 16, 276 (2015).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  13. Edmonds, K. & Williams, L. The role of the negative control in microbiome analyses. FASEB J. 31, 940.3 (2017).


    Google Scholar
     

  14. Schierwagen, R. et al. Trust is good, control is better: technical considerations in blood microbiome analysis. Gut 69, 1362–1363 (2020).

    Article 
    PubMed 

    Google Scholar
     

  15. de Goffau, M. C. et al. Recognizing the reagent microbiome. Nat Microbiol 3, 851–853 (2018).

    Article 
    PubMed 

    Google Scholar
     

  16. van der Horst, J. et al. Sterile paper points as a bacterial DNA-contamination source in microbiome profiles of clinical samples. J. Dent. 41, 1297–1301 (2013).

    Article 
    PubMed 

    Google Scholar
     

  17. Olomu, I. N. et al. Elimination of ‘kitome’ and ‘splashome’ contamination results in lack of detection of a unique placental microbiome. BMC Microbiol. 20, 157 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  18. Nejman, D. et al. The human tumor microbiome is composed of tumor type-specific intracellular bacteria. Science 368, 973–980 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  19. Pinto-Ribeiro, I. et al. Evaluation of the use of formalin-fixed and paraffin-embedded archive gastric tissues for microbiota characterization using next-generation sequencing. Int. J. Mol. Sci. 21, 1096 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  20. Poore, G. D. et al. Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature 579, 567–574 (2020).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  21. Wang, J. et al. Translocation of vaginal microbiota is involved in impairment and protection of uterine health. Nat. Commun. 12, 4191 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  22. Lam, S. Y. et al. Technical challenges regarding the use of formalin-fixed paraffin embedded (FFPE) tissue specimens for the detection of bacterial alterations in colorectal cancer. BMC Microbiol. 21, 297 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  23. Allali, I. et al. Gut microbiome compositional and functional differences between tumor and non-tumor adjacent tissues from cohorts from the US and Spain. Gut Microbes 6, 161–172 (2015).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  24. Marotz, C. et al. SARS-CoV-2 detection status associates with bacterial community composition in patients and the hospital environment. Microbiome 9, 132 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  25. Richardson, M., Gottel, N., Gilbert, J. A. & Lax, S. Microbial similarity between students in a common dormitory environment reveals the forensic potential of individual microbial signatures. mBio 10, e01054-19 (2019).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  26. Chen, Q.-L. et al. Rare microbial taxa as the major drivers of ecosystem multifunctionality in long-term fertilized soils. Soil Biol. Biochem. 141, 107686 (2020).

    Article 
    CAS 

    Google Scholar
     

  27. Smirnova, E., Huzurbazar, S. & Jafari, F. PERFect: PERmutation Filtering test for microbiome data. Biostatistics 20, 615–631 (2019).

    Article 
    PubMed 

    Google Scholar
     

  28. Davis, N. M., Proctor, D. M., Holmes, S. P., Relman, D. A. & Callahan, B. J. Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome 6, 226 (2018).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  29. McKnight, D. T. et al. microDecon: a highly accurate read‐subtraction tool for the post‐sequencing removal of contamination in metabarcoding studies. Environ. DNA 1, 14–25 (2019).

    Article 

    Google Scholar
     

  30. Shenhav, L. et al. FEAST: fast expectation-maximization for microbial source tracking. Nat. Methods 16, 627–632 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  31. Knights, D. et al. Bayesian community-wide culture-independent microbial source tracking. Nat. Methods 8, 761–763 (2011).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  32. Minich, J. J. et al. Quantifying and understanding well-to-well contamination in microbiome research. mSystems 4, e00186-19 (2019).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  33. Lou, Y. C. et al. Using strain-resolved analysis to identify contamination in metagenomics data. Preprint at bioRxiv https://doi.org/10.1101/2022.01.16.476537 (2022).

  34. An, U. et al. STENSL: Microbial Source Tracking with ENvironment SeLection. mSystems 7, e0099521 (2022).

    Article 
    PubMed 

    Google Scholar
     

  35. Bolyen, E. et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 37, 852–857 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  36. Karstens, L. et al. Controlling for contaminants in low-biomass 16S rRNA gene sequencing experiments. mSystems 4, e00290-19 (2019).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  37. Flores, R. et al. Collection media and delayed freezing effects on microbial composition of human stool. Microbiome 3, 33 (2015).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  38. Adams, R. I., Bateman, A. C., Bik, H. M. & Meadow, J. F. Microbiota of the indoor environment: a meta-analysis. Microbiome 3, 49 (2015).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  39. Lou, Y. C. et al. Infant gut strain persistence is associated with maternal origin, phylogeny, and traits including surface adhesion and iron acquisition. Cell Rep. Med. 2, 100393 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  40. Hornung, B. V. H., Zwittink, R. D. & Kuijper, E. J. Issues and current standards of controls in microbiome research. FEMS Microbiol. Ecol. 95, fiz045 (2019).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  41. Gonzalez, A. et al. Qiita: rapid, web-enabled microbiome meta-analysis. Nat. Methods 15, 796–798 (2018).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  42. Minich, J. J. et al. Host biology, ecology and the environment influence microbial biomass and diversity in 101 marine fish species. Nat. Commun. 13, 6978 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  43. Shaffer, J. P. et al. Standardized multi-omics of Earth’s microbiomes reveals microbial and metabolite diversity. Nat Microbiol. 7, 2128–2150 (2022).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  44. Chase, J. et al. Geography and location are the primary drivers of office microbiome composition. mSystems 1, e00022-16 (2016).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  45. Ramirez, K. S. et al. Biogeographic patterns in below-ground diversity in New York City’s Central Park are similar to those observed globally. Proc. Biol. Sci. 281, 20141988 (2014).

    PubMed 
    PubMed Central 

    Google Scholar
     

  46. Hanes, D. et al. The gastrointestinal and microbiome impact of a resistant starch blend from potato, banana, and apple fibers: a randomized clinical trial using smart caps. Front. Nutr. 9, 987216 (2022).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  47. Shaffer, J. P. et al. A comparison of DNA/RNA extraction protocols for high-throughput sequencing of microbial communities. Biotechniques 70, 149–159 (2021).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  48. Ruiz-Calderon, J. F. et al. Walls talk: microbial biogeography of homes spanning urbanization. Sci. Adv. 2, e1501061 (2016).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  49. Robin, X. et al. pROC: an open-source package for R and S to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  50. Callahan, B. J. et al. DADA2: high-resolution sample inference from Illumina amplicon data. Nat. Methods 13, 581–583 (2016).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  51. Annavajhala, M. K. et al. Oral and gut microbial diversity and immune regulation in patients with HIV on antiretroviral therapy. mSphere 5, e00798-19 (2020).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  52. Graspeuntner, S., Loeper, N., Künzel, S., Baines, J. F. & Rupp, J. Selection of validated hypervariable regions is crucial in 16S-based microbiota studies of the female genital tract. Sci. Rep. 8, 9678 (2018).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  53. Herlemann, D. P. et al. Transitions in bacterial communities along the 2000 km salinity gradient of the Baltic Sea. ISME J. 5, 1571–1579 (2011).

    Article 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  54. Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  55. Austin, G. I. et al. Contamination benchmark using human-derived samples. NCBI https://www.ncbi.nlm.nih.gov/bioproject/PRJNA905430 (2022).

  56. Austin, G. I., Shenhav, L. & Korem, T. SCRuB. GitHuB https://github.com/Shenhav-and-Korem-labs/SCRuB (2023).

  57. Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  58. Shenhav, L., Korem, T., & Austin, G. Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data. Code Ocean https://doi.org/10.24433/CO.2307706.v1 (2023).

  59. Wickham, H. et al. Welcome to the tidyverse. J. Open Source Softw. 4, 1686 (2019).

    Article 

    Google Scholar
     

  60. Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Krishnapuram, B. et al.) 785–794 (ACM, 2016).

Download references

Acknowledgements

We thank members of the Korem group for useful discussions. We are grateful to G. D. Poore, C. Martino, R. Knight, R. Straussman and I. Livyatan for assistance with analyzing and interpreting data from their studies, and to R. Straussman and I. Livyatan for helpful comments on the paper. In general, we thank all authors and participants involved in the generation of all data used in this study. The study was supported by the center for studies in Physics and Biology at Rockefeller University (L.S.), the Program for Mathematical Genomics at Columbia University (T.K.), the CIFAR Azrieli Global Scholarship in the Humans & the Microbiome Program (T.K.), R01HD106017 (T.K.) and R01CA245894 (A.-C.U.).

Author information

Author notes

  1. These authors contributed equally: Liat Shenhav, Tal Korem.

Authors and Affiliations

  1. Department of Computer Science, Columbia University, New York, NY, USA

    George I. Austin & Itsik Pe’er

  2. Program for Mathematical Genomics, Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA

    George I. Austin, Yoli Meydan, Itsik Pe’er & Tal Korem

  3. Division of Infectious Diseases, Columbia University Irving Medical Center, New York, NY, USA

    Heekuk Park, Dwayne Seeram & Anne-Catrin Uhlemann

  4. Department of Dermatology, Columbia University Irving Medical Center, New York, NY, USA

    Tanya Sezin & Angela M. Christiano

  5. Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA

    Yue Clare Lou

  6. Department of Surgery, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA

    Brian A. Firek & Michael J. Morowitz

  7. Department of Earth and Planetary Science, University of California, Berkeley, CA, USA

    Jillian F. Banfield

  8. Department of Environmental Science, Policy, and Management, University of California, Berkeley, CA, USA

    Jillian F. Banfield

  9. Innovative Genomics Institute, University of California, Berkeley, CA, USA

    Jillian F. Banfield

  10. Chan Zuckerberg Biohub, San Francisco, CA, USA

    Jillian F. Banfield

  11. Department of Genetics and Development, Columbia University Irving Medical Center, New York, NY, USA

    Angela M. Christiano

  12. Data Science Institute, Columbia University, New York, NY, USA

    Itsik Pe’er

  13. Center for Studies in Physics and Biology, Rockefeller University, New York, NY, USA

    Liat Shenhav

  14. Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA

    Tal Korem

  15. CIFAR Azrieli Global Scholars program, CIFAR, Toronto, Canada

    Tal Korem

Contributions

G.I.A. wrote SCRuB, and designed and conducted all computational analyses. H.K. designed and conducted all experiments. Y.M. assisted with analyses. D.S. contributed to experiments. T.S. collected samples. A.M.C. supervised sample collection. A.-C.U supervised all experiments. Y.C.L, B.F, M.M and J.F.B assisted in obtaining, analyzing and interpreting data from their study. L.S. and T.K. conceived and designed the study, designed analysis, jointly supervised the study and contributed equally to this work. G.I.A., I.P., L.S. and T.K. interpreted the results and wrote the paper.

Corresponding authors

Correspondence to
Liat Shenhav or Tal Korem.

Ethics declarations

Competing interests

A.-C.U. has received research funding from Merck that is unrelated to this study. The other authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Empirical validation of the source-tracking assumption in data from Nejman et al.18.

The source-tracking assumption30,31,34 in the context of contamination stipulates that taxa present together in a contamination source will be introduced together to other samples, and in similar proportions as in the contamination source. We demonstrate this empirically using data from Nejamn et al.18. a, The average relative abundance of each ASV (y-axis) across samples from the Netherlands Cancer Institute, plotted against the abundance of the same ASV across negative controls from the same batch (x-axis; ‘No Template Controls’ in Nejman et al.18), separated to ‘high’ and ‘low’ contamination based on SCRuB’s prediction (contamination parameter p > 0.5 and p ≤ 0.5 respectively). Consistent with the source-tracking assumption, taxa present together in a contamination source are introduced together to the samples, and in similar proportions, resulting in a clear positive correlation between the relative abundance of the taxa that are shared between samples and controls (Pearson R = 0.99, P < 10−20 and R = 0.082, P = 0.037 for high and low contamination, respectively). As expected, this correlation varies with respect to SCRuB’s predicted contamination in the samples: samples predicted to have high-contamination (blue) have a slope of 0.97, while those predicted to have low-contamination have a slope of 0.057. b,c, Same as (a) for samples predicted to have the highest (b) and lowest (c) contamination. Pearson R is displayed for panels with >3 shared taxa. Correlation was very high for highly contaminated samples (Pearson R > 0.9, P < 10−4 for all).

Extended Data Fig. 2 Description of our simulation framework.

A visualization of the simulation framework used to benchmark different decontamination methods. We implemented our simulation with the 3 outlined steps: a, We generate a dataset with 88–94 samples, 2, 4 or 8 controls, and a contamination source from an unrelated study, assumed to be biologically distinct from the samples of interest. All samples are then assigned locations across the plate. b, We add well-to-well leakage to the controls, and contamination from the shared source to the samples of interest (Methods). c, We run decontamination using one of several methods (Methods). The decontaminated dataset is evaluated against the ground truth noncontaminated taxonomic compositions using the Jensen-Shannon divergence.

Extended Data Fig. 3 SCRuB outperforms alternative decontamination methods under in silico simulations of diverse environments and data types.

a-l, Same as Fig. 1c, d, but for simulations based on data from 16S amplicon sequencing of tropical marine sediments (Qiita41 study ID 11922; a,b); 16S amplicon sequencing of multiple body sites from southern California fish42 (c,d); 16S amplicon sequencing of soil from the Earth Microbiome Project43 (e,f); ITS sequencing of office samples44 (g,h); 18S amplicon sequencing of soil from Central Park, New York45 (i,j); and human gut metagenomic sequencing46 (k,l). N = 120 simulations per panel. Across almost all simulation scenarios and environments SCRuB outperforms alternative decontamination approaches. Contamination levels were fixed to 5% for the simulations in panels b, d, f, h, j, and l. Box line, median; box, IQR; whiskers, 1.5*IQR; *, one-sided Wilcoxon signed-rank P < 10−4 for comparison between SCRuB and marked method (see Supplementary Table 1 for exact P values).

Extended Data Fig. 4 SCRuB is robust to evaluation metrics and simulation parameters.

a-d, Same as Fig. 1c, d, box and swarm plot (line, median; box, IQR; whiskers, 1.5*IQR) showing the mean (a,b) and standard deviation (c,d) of the Jensen-Shannon divergence (JSD) between the ground truth of each experiment and its decontamination output. SCRuB performs similarly when evaluated using mean JSD, and displays stable standard deviation. e,f, Same as Fig. 1c, d, but with controls placed along the edge of a plate rather than randomly. Similar to Fig. 1c, d, SCRuB outperforms alternative methods under all parameters except no decontamination and microDecon with 50% well-to-well leakage levels. g, Shown are the results from Fig. 1d with well-to-well leakage levels of 5%, stratified by the number of controls (N = 10 experiments per set). SCRuB outperforms alternative decontamination methods regardless of the number of controls (one-sided Wilcoxon signed-rank P < 10−3 for all, P = 0.0029 vs. microDecon with one control). h, Same as Fig. 1d, showing also results from SCRuB running without sample location, and thus without accounting for well-to-well leakage. While SCRuB outperforms SCRuB without sample locations in all simulations (P < 10−4 for all), SCRuB without sample locations still outperforms alternative decontamination methods in many settings. *, one-sided Wilcoxon signed-rank P < 10−3 (panel g) P < 10−4 (otherwise) for comparison between SCRuB (panels a-g) and SCRuB without sample locations (panel h) and the marked method (see Supplementary Table 1 for exact P values). * is on the bottom if the marked method has better performance.

Extended Data Fig. 5 SCRuB is robust to sequencing depth.

Shown are results from in silico simulations under our model (Methods). a, Comparison between experiments in which the read counts of all samples were set to either 1,000, 5,000, 10,000, or 25,000 reads, under contamination and well-to-well leakage levels of 5%. With the exception of the depth of 1,000 reads, SCRuB outperformed the alternative methods in all simulations (one-sided Wilcoxon signed-rank P < 10−3 for all). At a depth of 1,000 reads, SCRuB had comparable performance to decontam (P = 0.19), and significantly outperformed the rest (P < 0.01 for all). b, For each experiment, the mean read depth was set to 10,000, the standard deviation to 2,500, and the contamination and well-to-well leakage levels to 5%. We divided the samples from each experiment into four groups, Q1-Q4, based on the within-experiment quantile to which the read depth of each sample belonged to. Within all groups, SCRuB outperformed alternative decontamination methods (P < 10−3 for all), demonstrating that SCRuB has consistent performance within an experiment with varying read depths. c, Results from experiments with a mean read depth of 10,000, standard deviation of 0, 500, 2,500 or 7,500, and contamination and well-to-well leakage levels of to 5%. Across all standard deviations, SCRuB outperformed competing methods, demonstrating that it is robust to variability in read coverage across experiments. Box line, median; box, IQR; box whiskers, 1.5*IQR; *, one-sided Wilcoxon signed-rank P < 0.01 for comparison between SCRuB and marked method (see Supplementary Table 1 for exact P values).

Extended Data Fig. 6 SCRuB correctly handles unrelated controls.

a, Venn diagram illustrating the taxa removed by each decontamination method, defined as a taxa with an aggregate sum greater than zero in the observed data, and an aggregate sum of zero in the decontaminated data. When presented with unrelated controls, SCRuB removed far fewer taxa than microDecon and either version of decontam, and the majority of taxa removed by SCRuB were also removed by microDecon and decontam (LB). b, Box and swarm plots (line, median; box, IQR; whiskers, 1.5*IQR) showing the median Jensen-Shannon divergence per simulation between simulated samples before and after decontamination with an unrelated control (Methods), across 50 simulated datasets of 88 samples and 8 negative controls. SCRuB is robust to non-informative controls, producing taxonomic compositions that are very close to the original, and significantly closer than alternative methods (one-sided Wilcoxon signed-rank P = 4×10−10, P = 8.8×10−10 and P = 3.8×10−10 between SCRuB and microDecon, decontam or decontam (LB), respectively).

Extended Data Fig. 7 SCRuB correctly accounts for well-to-well leakage.

a, Similar to Fig. 2f, showing the Jensen-Shannon divergence (y-axis) between the ground truth taxonomic composition, as defined by the experimental design of Minich et al.31 (Methods), and the taxonomic composition of the unprocessed dataset (‘No decontamination’), or the dataset following decontamination by various methods (x-axis), and displayed separately for the 31 distinct low-prevalence (left) and 90 high-prevalence (right) monocultures. For low prevalence samples, SCRuB produced estimates that were significantly more similar to the ground truth compared to microDecon, decontam, decontam (LB), and to a restrictive approach (one-sided Wilcoxon P < 10−4 in all cases). For the high prevalence samples, SCRuB performed comparably to decontam and microDecon (P = 0.93, P = 0.12, respectively) and outperformed no decontamination, restrictive, and decontam (LB) (P = 10−8, P = 8.7×10−17 and P = 1.3×10−4, respectively). b-f, A simulation of a more complicated well-to-well leakage experiment, in which each taxa was placed in two monocultures instead of one. To simulate such a scenario, we randomly chose pairs of taxa, and then reassigned all reads assigned to one taxa across the experiment to the other, ‘focal’, taxa. For example, Minich et al. placed E. coli in well C10 (c), resulting in well-to-well leakage (d). We randomly selected well C3, containing a Corynbacterium species, and reassigned all Corynbacterium reads to E. coli (e). We then ran SCRuB on this simulated data, and evaluated the relative abundance of E. coli in its original well (b, f). We performed this 100 times, and examined the relative abundance of the focal taxa in its original well (b). In all cases, SCRuB accurately handled well-to-well leakage in this more complex scenario and avoided removing the taxa belonging to the focal monoculture.

Extended Data Fig. 8 SCRuB correctly infers well-to-well leakage into negative controls in a metagenomic study of infant and maternal microbiomes.

a, The plate design used by Lou et al.33,39, which included a negative control placed in the corner of each extraction plate. Through a strain-level analysis, Lou et al. identified well-to-well leakage into certain negative controls. b, When running SCRuB on each plate, using the MAG abundances of each sample (Methods), we identified well-to-well leakage into the negative control in two of the four plates that included a negative control. c, SCRuB’s predictions of well-to-well leakage were consistent with an assessment based on the results of Lou et al.’s strain-level analysis (Methods).

Extended Data Fig. 9 Well-to-well leakage is more prominent during DNA extraction.

a,b, Plate layout during DNA extraction (a) and library preparation (b) of experiment 2 (Fig. 3a). 10 controls were included in the DNA extraction stage (triangles), and additional 7 in the library preparation stage (hexagon); a pair of each was away from other samples (‘far samples’, purple). c, Box and swarm plot (line, median; box, IQR; whiskers, 1.5*IQR) showing the Jensen-Shannon divergence (y-axis) between human-derived samples adjacent to DNA extraction and library preparation controls and the various controls of each processing stage, stratified by adjacent and near controls (purple in a,b), and calculated from ‘raw’ taxonomic compositions, without any decontamination. Samples are more similar to near than far controls, demonstrating well-to-well leakage occurring during both DNA extraction and library preparation. Samples are also more similar to near extraction controls than to near library controls, suggesting that well-to-well leakage is more prominent during DNA extraction. P, two-sided Mann-Whitney U; N, number of pairwise distances between relevant samples.

Extended Data Fig. 10 SCRuB improves prediction of melanoma and treatment response.

a-f, Receiver operating characteristic (ROC) curves evaluating the pairwise classification accuracy of gradient boosted decision trees on data from patients with lung cancer, prostate cancer, melanoma, and controls, using data from Poore et al.20 Compared to alternative decontamination methods, SCRuB offers classification accuracy that is on-par or improved, and improved accuracy compared to the original analyses in all cases. See Supplementary Table 1 for P values comparing between methods. Shaded area, 95% confidence interval. g, A Venn diagram enumerating the number of taxa completely removed by each decontamination methods applied to the tumor microbiome data from Nejman et al.18 SCRuB removed fewer taxa than alternative methods.

Supplementary information

Reporting Summary

Supplementary Tables

Supplementary Table 1: Exact P values displayed in figures. Supplementary Table 2: Experimental metadata and plate layouts of experiments performed. Refers to experiments described in Fig. 3. Supplementary Table 3: V1–V2 reads in control samples. The number of reads from the V1–V2 regions found in each of the samples from the experiments with human-derived samples (Fig. 3a; Methods). Samples with NA had no reads following DADA2 processing.

About this article

Science & Nature Verify currency and authenticity via CrossMark

Cite this article

Austin, G.I., Park, H., Meydan, Y. et al. Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data.
Nat Biotechnol (2023). https://doi.org/10.1038/s41587-023-01696-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41587-023-01696-w

Read More
George I. Austin

Latest

The man behind the legendary MPC, Roger Linn, stays focused with a single browser tab

Roger Linn is a legend in the world of musical instruments. He’s been at the cutting edge of music technology for decades. He created the LM-1, the first drum machine to use samples, and its successor, the LinnDrum, is one of the most iconic drum machines of all time. They were used on countless records

FuRyu Announces Survival Shooter ANOMALITH for Switch 2, PS5, and PC

by William D'Angelo , posted 3 days ago / 2,204 Views Publisher FuRyu and developer Winning Entertainment Group have announced survival third-person shooter, ANOMALITH, for the Nintendo Switch 2, PlayStation 5, and PC via Steam. It will launch on October 29. View the reveal trailer below: Read details on the game below: ANOMALITH  takes players on

Former Embracer CEO says Fellowship Entertainment spin-off is the “most effective long-term solution”

In an open letter to shareholders, Lars Wingefors says move is key to realising "full potential" of "undervalued" IPs Image credit: Embracer Group Embracer Group's former CEO Lars Wingefors has stated that the decision to spin off Fellowship Entertainment "represents the most effective long-term solution." After stepping down as CEO last June, Wingefors, now chair

Embracer Group announces plans to spin-off Fellowship Entertainment

Structural change announced alongside Q4 results showing a 24% net sales decline and $765.2 million non-cash impairment Image credit: Warner Bros. Embracer Group has released its fourth quarter and full-year financial results, along with plans to spin off Fellowship Entertainment as a new publicly listed company. Fellowship Entertainment will concentrate on its premium IP and

Newsletter

Don't miss

The man behind the legendary MPC, Roger Linn, stays focused with a single browser tab

Roger Linn is a legend in the world of musical instruments. He’s been at the cutting edge of music technology for decades. He created the LM-1, the first drum machine to use samples, and its successor, the LinnDrum, is one of the most iconic drum machines of all time. They were used on countless records

FuRyu Announces Survival Shooter ANOMALITH for Switch 2, PS5, and PC

by William D'Angelo , posted 3 days ago / 2,204 Views Publisher FuRyu and developer Winning Entertainment Group have announced survival third-person shooter, ANOMALITH, for the Nintendo Switch 2, PlayStation 5, and PC via Steam. It will launch on October 29. View the reveal trailer below: Read details on the game below: ANOMALITH  takes players on

Former Embracer CEO says Fellowship Entertainment spin-off is the “most effective long-term solution”

In an open letter to shareholders, Lars Wingefors says move is key to realising "full potential" of "undervalued" IPs Image credit: Embracer Group Embracer Group's former CEO Lars Wingefors has stated that the decision to spin off Fellowship Entertainment "represents the most effective long-term solution." After stepping down as CEO last June, Wingefors, now chair

Embracer Group announces plans to spin-off Fellowship Entertainment

Structural change announced alongside Q4 results showing a 24% net sales decline and $765.2 million non-cash impairment Image credit: Warner Bros. Embracer Group has released its fourth quarter and full-year financial results, along with plans to spin off Fellowship Entertainment as a new publicly listed company. Fellowship Entertainment will concentrate on its premium IP and

Embracer to spin off major video game franchises and studios

Chris Kerr, Senior Editor, News, GameDeveloper.com May 20, 2026 2 Min Read Embracer Group will separate into two publicly listed companies by spinning off Fellowship Entertainment on Nasdaq Stockholm in 2027.  The move means Embracer Group— which has become something of a layoff specialist in recent years —will preside over four standalone segments: Fellowship Entertainment

Tesla’s Business Has Become Much More Diversified in Just the Past Five Years. Does That Make Its Stock a Better Buy Today?

Key Points Tesla's energy generation and storage segment generated 27% revenue growth last year. The company's non-automotive segments were able to help offset a double-digit decline in auto revenue in 2025. These 10 stocks could mint the next wave of millionaires › Tesla (NASDAQ: TSLA) is known for its electric vehicles (EVs), and while they

WD sees sustainability as key business driver in an ‘AI economy’

Hard drive company WD promoted long-term operations and sustainability executive Jackie Jung to become its first chief sustainability officer in February, as it steps up sales to companies building AI data centers. Her vision: Turn sustainability into a “brand” for WD, a strategy that reduces risk for the $6 billion company (formerly known as Western

5 Business Ideas Worth Starting in 2026

If there is one thing Nigerians understand well, it is how to spot opportunity inside hardship. In 2026, that mindset will matter more than ever. The economy is tough, competition is rising, and many people are looking for smarter ways to earn, build, and survive. But even in a difficult environment, some businesses still stand