Fine-grained semantic indexing of biomedical texts with linguistic models
Authorship
M.G.L.
Bachelor’s Degree in Informatics Engineering
M.G.L.
Bachelor’s Degree in Informatics Engineering
Defense date
02.20.2025 17:00
02.20.2025 17:00
Summary
This Final Degree Project (TFG) addresses the semantic indexing of biomedical texts through the use of large linguistic models (LLMs), with the aim of improving access to information in biomedicine through the automated assignment of MeSH descriptors. The proposed method consists of several stages. First, the MeSH ontology obtained through BioPortal is preprocessed. Next, biomedical abstracts previously indexed with coarse-grained labels are selected for subsequent semantic refinement. The methodology employs a zero-shot prompting strategy with the LLaMa3 model, developing and optimizing different prompt configurations to improve classification. The ensemble combination of the most effective strategies allowed to significantly optimize the system's performance. Finally, the model is evaluated using standardized metrics (precision, recall and F-measure) to analyze its performance and determine its viability in biomedical indexing tasks. The results show that LLaMa3 outperforms traditional weakly supervised methods in terms of precision, recall and F-measure, consolidating itself as an effective alternative for biomedical indexing. However, challenges persist in terms of computational efficiency and scalability, especially for its implementation in large volumes of data. The analysis of the assigned labels allowed to identify performance patterns and define strategies to improve the quality of semantic indexing. To address these challenges, semantic search using vector databases is explored as a possible computational optimization strategy. However, the results obtained did not reach the expected quality in terms of indexing, suggesting the need for additional adjustments in threshold settings and the representation of the semantic context. In conclusion, this work validates the potential of generative language models in biomedical indexing, highlighting the importance of optimizing their performance and scalability for their application in large volumes of data. These findings lay the foundation for future research aimed at improving the efficiency and accuracy of semantic indexing systems in biomedicine.
This Final Degree Project (TFG) addresses the semantic indexing of biomedical texts through the use of large linguistic models (LLMs), with the aim of improving access to information in biomedicine through the automated assignment of MeSH descriptors. The proposed method consists of several stages. First, the MeSH ontology obtained through BioPortal is preprocessed. Next, biomedical abstracts previously indexed with coarse-grained labels are selected for subsequent semantic refinement. The methodology employs a zero-shot prompting strategy with the LLaMa3 model, developing and optimizing different prompt configurations to improve classification. The ensemble combination of the most effective strategies allowed to significantly optimize the system's performance. Finally, the model is evaluated using standardized metrics (precision, recall and F-measure) to analyze its performance and determine its viability in biomedical indexing tasks. The results show that LLaMa3 outperforms traditional weakly supervised methods in terms of precision, recall and F-measure, consolidating itself as an effective alternative for biomedical indexing. However, challenges persist in terms of computational efficiency and scalability, especially for its implementation in large volumes of data. The analysis of the assigned labels allowed to identify performance patterns and define strategies to improve the quality of semantic indexing. To address these challenges, semantic search using vector databases is explored as a possible computational optimization strategy. However, the results obtained did not reach the expected quality in terms of indexing, suggesting the need for additional adjustments in threshold settings and the representation of the semantic context. In conclusion, this work validates the potential of generative language models in biomedical indexing, highlighting the importance of optimizing their performance and scalability for their application in large volumes of data. These findings lay the foundation for future research aimed at improving the efficiency and accuracy of semantic indexing systems in biomedicine.
Direction
TABOADA IGLESIAS, MARÍA JESÚS (Tutorships)
TABOADA IGLESIAS, MARÍA JESÚS (Tutorships)
Court
TABOADA IGLESIAS, MARÍA JESÚS (Student’s tutor)
TABOADA IGLESIAS, MARÍA JESÚS (Student’s tutor)