Associate Professor San Diego State University, United States
Introduction: : Only 479 carcinogenic entities are officially recognized as carcinogens across regulatory databases (IARC, NTP, EPA, ECHA, & OSHA). Regulatory agencies rely on manual curation and are burdened by exponential growth of biomedical literature. We introduce CarD-T, an LLM framework to automate literature review of carcinogen curation with probabilistic analysis of likely carcinogenicity. This transformer-based approach combines machine learning with Bayesian modeling to efficiently process scientific publications, extract potential carcinogenic entities, and analyze temporal trends in evidence shifts.
Materials and
Methods: : CarD-T uses a Named Entity Recognition (NER) ELECTRA LLM trained on carcinogen-specific contexts from PubMed literature. The novelty of our NER approach over other LLM methods lies in the ability to index identified carcinogens to source documents, circumventing hallucinations. Training data consisted of abstracts referencing IARC Group 1 and Group 2A carcinogens. We implemented a novel AI-hybridized Context-Derived TF-IDF approach for noise reduction and synonym consolidation. For entities with conflicting evidence, we developed Probabilistic Carcinogen Denomination (PCarD), applying Bayesian Negative Binomial Regression to analyze temporal evidence shifts in scientific discourse.
Results, Conclusions, and Discussions:: CarD-T identifies 100% of IARC carcinogens with just 60% training data and discovers ~1,600 potential new carcinogens across diverse categories. When compared to GPT-4, our framework achieves superior recall (0.85 vs 0.76) with comparable precision (0.89 vs 0.90). We identified 554 entities with both supporting and opposing evidence, resolving 76 disputed cases through temporal evidence modeling. Novel nominees include COVID-19, microplastics, and schizophrenia. Our analysis documents a fundamental 25-year research paradigm shift from chemical to biological and environmental carcinogenesis. Chemical compounds decreased (45% to 35%), while biological agents (20% to 28%) and environmental factors (12% to 18%) increased. The open-source design on consumer hardware democratizes carcinogen surveillance capabilities previously limited to specialized institutions, enabling rapid responses to emerging threats by continuously monitoring scientific discourse in real-time.
Acknowledgements and/or References (Optional):: This work was supported by NIH grants NIMHD U54MD012397 and NCI U54CA285117. CarD-T framework is available at https://huggingface.co/jimnoneill/CarD-T. We thank Mr. Prasad Kothari for an introduction to NLP and text analytics. J.O acknowledges support from the SDSU ACCEL program and SDSU HealthLINK center.