End-to-end alignment for multi-modal foundation models.
End-to-end alignment pipeline for multi-modal foundation models. Multi-modal red-teaming, distributed evaluation with LLM-as-judge scoring, and adversarial post-training using preference optimization. Reduces attack success rate by an order of magnitude across text and vision while preserving task quality.
The thing the literature underweights: alignment in production is a moving target. Static reward models decay against adapting attackers, which is why I treat alignment as a continual learning problem and pair this pipeline with the self-evolving agent for continuous updates.
The principle: a one-shot RLHF pass is the start of alignment, not the end. Production alignment lives in the loop. The adversarial component connects to the sharpness-aware optimization work for transferable attacks (Ye et al., 2024) and the principles-of-design work for remote anti-spoofing systems (Xu et al., 2024).
@inproceedings{ye2024sharpness,title={Sharpness-Aware Optimization for Real-World Adversarial Attacks for Diverse Compute Platforms with Enhanced Transferability},author={Ye, Muchao and Xu, Xiang and Zhang, Qin and Wu, Jon},booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},year={2024},}
arXiv
Principles of Designing Robust Remote Face Anti-Spoofing Systems
Xiang Xu, Tianchen Zhao, Zheng Zhang, and 4 more authors
@article{xu2024principles,title={Principles of Designing Robust Remote Face Anti-Spoofing Systems},author={Xu, Xiang and Zhao, Tianchen and Zhang, Zheng and Li, Zhihua and Wu, Jon and Achille, Alessandro and Srivastava, Mani},journal={arXiv preprint arXiv:2406.03684},year={2024},}