World Model Failure Classification and Anomaly Detection for Autonomous Inspection

¹Stanford University, ²Field AI

📌 ICRA 2026

Abstract

Autonomous inspection robots for monitoring industrial sites can reduce costs and risks associated with human-led inspection. However, accurate readings can be challenging due to occlusions, limited viewpoints, or unexpected environmental conditions. We propose a hybrid framework that unifies supervised failure classification with anomaly detection, enabling classification of inspection tasks as a success, known failure, or anomaly (i.e., out-of-distribution) case. Our approach uses a world model backbone with compressed video inputs. This policy-agnostic, distribution-free framework determines classification based on two decision functions set by conformal prediction (CP) thresholds before a human observer recognizes them. We evaluate the framework on gauge inspection feeds collected from office and industrial sites and demonstrate real-time deployment on a Boston Dynamics Spot. Experiments show over 90% accuracy in distinguishing between successes, failures, and OOD cases, with classifications occurring earlier than a human observer. These results highlight the potential for robust, anticipatory failure detection in autonomous inspection tasks or as a feedback signal for model training to assess and improve the quality of training data.

Approach

Our framework unifies supervised failure classification with anomaly detection to provide early, reliable recognition of inspection outcomes. At a high level, we train separate models on successful and failed gauge readings, then use conformal prediction to set statistically principled thresholds on their outputs. By combining the models’ decisions, the system can distinguish between success, known failure, and out-of-distribution cases in real time, enabling timely corrective actions and more robust autonomous inspection.

Decision Functions

Our framework relies on two complementary decision functions: one trained on successes and one on failures. Each produces a score for the trajectory, and by comparing those scores to calibrated thresholds, the system determines whether the outcome is a success, a known failure, or out-of-distribution (see below).

Decision Function Pipeline

Design Steps

Train a success model and a failure model, each forming its own decision function.
Use the decision functions to assign scores to calibration trajectories.
Calibrate conformal prediction thresholds from these scores to set cutoff values. (see below).
At runtime, apply both decision functions with their thresholds, and combine the outputs to classify the trajectory.

Calibration Pipeline

Experiments

Demo: Real-time hardware deployment

We tested our framework both in simulation and on hardware. First, we ran experiments on inspection videos collected from office and industrial sites to evaluate how well the decision functions and calibrated thresholds distinguish success, failure, and out-of-distribution cases. Then, we validated the approach in a hardware deployment on a Boston Dynamics Spot with a PTZ camera, where the system classified gauge readings in real time and triggered corrective actions based on the outcome.