Super Co-Alignment

End-to-end alignment for multi-modal foundation models.

End-to-end alignment pipeline for multi-modal foundation models. Multi-modal red-teaming, distributed evaluation with LLM-as-judge scoring, and adversarial post-training using preference optimization. Reduces attack success rate by an order of magnitude across text and vision while preserving task quality.

The thing the literature underweights: alignment in production is a moving target. Static reward models decay against adapting attackers, which is why I treat alignment as a continual learning problem and pair this pipeline with the self-evolving agent for continuous updates.

The principle: a one-shot RLHF pass is the start of alignment, not the end. Production alignment lives in the loop. The adversarial component connects to the sharpness-aware optimization work for transferable attacks (Ye et al., 2024) and the principles-of-design work for remote anti-spoofing systems (Xu et al., 2024).

References

2024

  1. Sharpness-Aware Optimization for Real-World Adversarial Attacks for Diverse Compute Platforms with Enhanced Transferability
    Muchao Ye, Xiang Xu, Qin Zhang, and 1 more author
    In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024
  2. arXiv
    Principles of Designing Robust Remote Face Anti-Spoofing Systems
    Xiang Xu, Tianchen Zhao, Zheng Zhang, and 4 more authors
    arXiv preprint arXiv:2406.03684, 2024