Detecting sensitive medical responses in general purpose large language models

Date:

I presented a poster at the Machine Learning for Health Symposium (ML4H) 2024 in Vancouver, corresponding to the paper Detecting sensitive medical responses in general purpose large language models. The work investigates how to identify sensitive or potentially harmful medical responses produced by general-purpose large language models.

The presentation summarized our framework for detecting unsafe medical behaviors using synthetic prompt generation, persona-based augmentation, and LLM-powered response classification. The methodology, detailed in the ML4H paper, combines large-scale synthetic red-teaming with a fine-tuned Flan-T5 evaluator to identify responses that provide unvetted medical advice or omit necessary referrals to licensed professionals. This approach exposes critical safety gaps in generalist LLMs and demonstrates how synthetic data pipelines can surface failures that traditional manual red-teaming cannot reach.