Trustworthiness in Medical Product Question Answering by Large Language Models
Date:
I presented a poster at the KDD 2024 Workshop on GenAI Evaluation in Barcelona, corresponding to the paper “Trustworthiness in medical product question answering by large language models”. The work introduces a claim-level evaluation framework to assess whether large language models provide medically accurate and label-consistent answers when responding to questions about prescription drugs and medical products.
The poster summarized our methodology for decomposing LLM outputs into atomic claims, classifying each claim according to FDA product label sections, and evaluating factual support using the full labeling documents. The approach, detailed in the KDD workshop paper, combines large-scale synthetic question generation with a multi-stage LLM evaluation pipeline to surface contradictions, unsupported claims, and off-label recommendations. This framework highlights the importance of fine-grained verification when deploying LLMs in medical and health-related contexts.

