Omni-Modal Trust & Verification

Detecting tampered media across image, video, and audio with explainable reasoning.

Omni-modal VLMs trained to verify media across image, video, and audio by detecting semantic inconsistencies, temporal artifacts, and audio-visual mismatches. The model outputs chain-of-thought reasoning to make decisions explainable, and generalizes to attacks unseen at training time.

The thread connecting this to the rest of my work: trust models decay because attackers move. The right architectural answer is models that reason about why a piece of content is or isn’t trustworthy, so updates shift the reasoning rather than retrain a black-box classifier.

The closest published work is AuthGuard (Shen et al., 2026), which uses language-guided commonsense reasoning for deepfake detection (AUC gains of 6.15% on DFDC and 16.68% on DF40). Earlier deepfake work used self-consistency learning (Zhao et al., 2021) as the basis for source-feature inconsistency detection. The model-diagnosis-and-correction framework (Chen et al., 2025) closes the loop on this thread: when the model errs, an automated system localizes the cause via attribute editing and synthesizes counterfactual training data to fix it.

References

2026

  1. AuthGuard: Generalizable Deepfake Detection via Language Guidance
    Guangyu Shen, Zhihua Li, Xiang Xu, and 6 more authors
    In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026

2025

  1. Model Diagnosis and Correction via Linguistic and Implicit Attribute Editing
    Xuanbai Chen, Xiang Xu, Zhihua Li, and 4 more authors
    In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2021

  1. Learning Self-Consistency for Deepfake Detection
    Tianchen Zhao, Xiang Xu, Mingze Xu, and 3 more authors
    In IEEE/CVF International Conference on Computer Vision (ICCV), 2021
    Oral