Hybrid Attention/Recurrent FM

A hybrid VLM that replaces 25% of attention layers with linear-time recurrent layers (GKA) for constant-memory, constant-latency inference on long-context agentic workloads. Trained with progressive distillation from the dense teacher and fine-tuned on vision-language and tool-calling tasks.

Achieves near-parity with the dense model on VL benchmarks at 1.3–1.5× inference speedup at long contexts. Open question: where the right boundary is between attention and recurrence in a VLM. The current 25% mix was empirical; the principled answer is unclear.

The architecture work builds on the codebook-anchored adaptation thread (Wu et al., 2026) that showed how to decouple visual encoders from language backbones, and on continual learning that determines how often you need to update such an architecture in production.

References

2026

CVPR

Decoupling Vision and Language: Codebook Anchored Visual Adaptation

Jason Wu, Tianchen Zhao, Chang Liu, and 7 more authors

In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

arXiv Bib

@inproceedings{wu2026decoupling,
  title = {Decoupling Vision and Language: Codebook Anchored Visual Adaptation},
  author = {Wu, Jason and Zhao, Tianchen and Liu, Chang and Cai, Jiarui and Zhang, Zheng and Li, Zhuowei and Singh, Aaditya and Xu, Xiang and Srivastava, Mani and Wu, Jonathan},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2026},
}