Grothey, Bastian
ORCID: 0000-0002-0883-6481, Odenkirchen, Jan, Brkic, Adnan, Schömig-Markiefka, Birgid
ORCID: 0000-0003-1893-8796, Quaas, Alexander
ORCID: 0000-0002-3537-6011, Büttner, Reinhard
ORCID: 0000-0001-8806-4786 and Tolkach, Yuri
ORCID: 0000-0001-5239-2841
(2025).
Comprehensive testing of large language models for extraction of structured data in pathology.
Communications Medicine, 5 (1).
ISSN 2730-664X
|
PDF
s43856-025-00808-8.pdf Bereitstellung unter der CC-Lizenz: Creative Commons Attribution. Download (4MB) |
Abstract
Background: Pathology departments generate large volumes of unstructured data as free-text diagnostic reports. Converting these reports into structured formats for analytics or artificial intelligence projects requires substantial manual effort by specialized personnel. While recent studies show promise in using advanced language models for structuring pathology data, they primarily rely on proprietary models, raising cost and privacy concerns. Additionally, important aspects such as prompt engineering and model quantization for deployment on consumer-grade hardware remain unaddressed. Methods: We created a dataset of 579 annotated pathology reports in German and English versions. Six language models (proprietary: GPT-4; open-source: Llama2 13B, Llama2 70B, Llama3 8B, Llama3 70B, and Qwen2.5 7B) were evaluated for their ability to extract eleven key parameters from these reports. Additionally, we investigated model performance across different prompt engineering strategies and model quantization techniques to assess practical deployment scenarios. Results: Here we show that open-source language models extract structured data from pathology reports with high precision, matching the accuracy of proprietary GPT-4 model. The precision varies significantly across different models and configurations. These variations depend on specific prompt engineering strategies and quantization methods used during model deployment. Conclusions: Open-source language models demonstrate comparable performance to proprietary solutions in structuring pathology report data. This finding has significant implications for healthcare institutions seeking cost-effective, privacy-preserving data structuring solutions. The variations in model performance across different configurations provide valuable insights for practical deployment in pathology departments. Our publicly available bilingual dataset serves as both a benchmark and a resource for future research.
| Item Type: | Article |
| Creators: | Creators Email ORCID ORCID Put Code Odenkirchen, Jan UNSPECIFIED UNSPECIFIED UNSPECIFIED Brkic, Adnan UNSPECIFIED UNSPECIFIED UNSPECIFIED |
| URN: | urn:nbn:de:hbz:38-792734 |
| Identification Number: | 10.1038/s43856-025-00808-8 |
| Journal or Publication Title: | Communications Medicine |
| Volume: | 5 |
| Number: | 1 |
| Date: | 31 March 2025 |
| ISSN: | 2730-664X |
| Language: | English |
| Faculty: | Faculty of Medicine |
| Divisions: | Faculty of Medicine > Anatomie Faculty of Medicine > Pathologie und Neuropathologie > Institut für Pathologie |
| Subjects: | Medical sciences Medicine |
| ['eprint_fieldname_oa_funders' not defined]: | Publikationsfonds UzK |
| Refereed: | Yes |
| URI: | http://kups.ub.uni-koeln.de/id/eprint/79273 |
Downloads
Downloads per month over past year
Altmetric
Export
Actions (login required)
![]() |
View Item |
https://orcid.org/0000-0002-0883-6481