AWE USA 2026
When deployed in unconstrained real-world environments, modern mixed reality (MR) experiences can fail in subtle but consequential ways: virtual objects may drift or occlude critical real-world elements; spatial audio cues may contradict visual content; MR-based guidance may suggest actions that are impossible, unsafe, or societally unacceptable. Such highly context-dependent failures, where MR content that is appropriate in one setting becomes unsafe in another, cannot be detected by classical signal processing or conventional ML approaches alone. Recent large multimodal models (LMMs) enable semantic, context-aware monitoring of immersive systems, going far beyond what was previously feasible. By jointly reasoning over video streams captured by MR devices, rendered virtual content, spatial audio, and scene context, these models can identify misaligned overlays, unsafe occlusions, inconsistent audio-visual cues, and policy violations in running MR applications. In this talk, we present an LMM-based MR experience evaluation pipeline that combines foundation-scale multimodal models with auxiliary ML components to analyze continuous multimodal streams and trigger corrective actions when problematic content is detected. A central technical challenge in LMM-based MR evaluation is achieving sufficiently low latency for integration into live MR systems. We describe a system architecture and inference strategy that enables responsive, near–real-time multimodal monitoring while maintaining the breadth of semantic reasoning provided by large models. The methodology presented in this talk received the 2025 IEEE International Symposium on Mixed and Augmented Reality (IEEE ISMAR) Best Paper Award. This research is funded by an NSF AI Institute, a DARPA Young Faculty Award, and a DARPA Director’s Fellowship.