P.K. Lashmet Chair Professor Rensselaer Polytechnic Institute Troy, New York, United States
Introduction: : Cardiovascular disease (CVD) is the leading cause of death worldwide, and early detection is critical for improving outcomes. While chest x-rays are widely available and inexpensive, they are rarely used for CVD screening due to the complexity of manual interpretation. There is a growing need for an automated method to detect CVD from chest x-rays. However, preparing large -scale and high-quality medical dataset for model training via manual labeling is time-consuming and often infeasible [1]. In this study, we introduce a multi-step approach based on natural language processing (NLP) and large language models (LLMs) to identify CVD cases from radiology reports. By analyzing the reports associated with chest x-rays, we generated a labeled chest x-ray dataset that can be used to train diagnostic models capable of identifying CVD directly from chest X-rays [2].
Materials and
Methods: : We utilize the publicly available MIMIC-CXR database [3], which contains over 227,835 chest radiographs with corresponding radiology reports, as of Aug 2025. First, we use keywords to search for obvious CVD cases via BM25 (Okapi Best Matching) algorithm [4] and use those reports as anchors. After converting all the reports into vector representation through one of the best transformer-based context-embedding models, Qwen-3 embedding model [5], we search for additional CVD reports that are semantically similar to the anchor reports, by calculating the cosine similarity between these reports and anchors in the vector space. Both BM25 and Semantic Similarity measurement generate matching or similarity scores. Cutoff thresholds are observed, and all reports above the threshold are labeled as potential CVD cases with high confidence. The last step is to use LLM to scan through those candidate CVD cases via zero-shot classification to further exclude false positives. Those cases confirmed by LLM will be labeled as positive CVD cases and the rest, including those cases left out from the previous steps, are labeled as negative cases. The table below shows example findings. These labels are then transferred to the corresponding images, providing a semi-automatically labeled image dataset which can be used for future training of a large-scale vision model.
Results, Conclusions, and Discussions:: We successfully identified and labeled patients in the MIMIC-CXR dataset as either having or not having CVD based on the multi-step approach. The outcome is a fully labeled image dataset, which will serve as the training foundation for a future vision model capable of diagnosing CVD directly from X-rays. By replacing manual annotation with nearly automated NLP and LLM based labeling, this method significantly reduces the time and effort needed to curate high-quality medical training data. The labeling accuracy will be validated against a manually reviewed subset.
The primary goal of this research is to generate a large, accurately labeled dataset of chest X-rays used for training of a large-scale vision model aimed at classifying CVD on chest x-rays. By using a combination of basic NLP to semantic similarity measurement and LLM as a judge, we can efficiently and accurately label chest x-rays without the need for full manual review. Once trained on this image dataset, the vision model can be used to automatically detect cardiovascular disease from new chest X-rays, offering a scalable and accessible tool for early screening in clinical and low-resource settings.
Acknowledgements and/or References (Optional): : 1. Talukder, M.A., et al., HXAI-ML: A hybrid explainable artificial intelligence-based machine learning model for cardiovascular heart disease detection. Results in Engineering, 2025, 25, pp. 104370 2. Chao, H., et al. Deep Learning Predicts Cardiovascular Disease Risks from Lung Cancer Screening Low Dose Computed Tomography. Nature Communications, 2021, 12, pp.2963 3. Goldberger, A. et al., PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online], 2000, 101 (23), pp. e215–e220. RRID:SCR_007345. 4. Robertson, S., et al., The Probabilistic Relevance Framework: BM25 and Beyond, Foundations and Trends in Information Retrieval, 2009, 3(4), pp. 333-389 5. Zhang, Y., et al., Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models, arXiv preprint, 2025, arXiv: 2506.05176