Video Avatars & Multimodal Experiences

A set of real-time and near-real-time video avatar experiences that combine speech, vision, and structured interaction flows. The work explored how embodied, multimodal interfaces can support more natural, persistent, and task-oriented interactions between users and AI systems.

This included building and evaluating end-to-end prototypes integrating speech recognition, low-latency response generation, expressive video avatars, and orchestration logic for turn-taking, interruption handling, and contextual grounding. I assessed multiple architectural approaches and technologies, identifying trade-offs across realism, latency, scalability, and deployment constraints.