LVIDA

Benefiting from recent advancements in large language models and modality alignment techniques, existing Large Vision-Language Models~(LVLMs) have achieved prominent performance across a wide range of scenarios. However, the excessive computational complexity limits the widespread use of these models in practical applications. We argue that one main bottleneck in computational complexity is caused by the involvement of redundant vision sequences in model computation. This is inspired by a reassessment of the efficiency of vision and language information transmission in the language decoder of LVLMs. Then, we propose a novel vision-language interaction mechanism called Layer-wise Vision Injection with Disentangled Attention (LVIDA). In LVIDA, only the language sequence undergoes full forward propagation, while the vision sequence interacts with the language at specific stages within each language decoder layer. It is striking that our approach significantly reduces computational complexity with minimal performance loss. Specifically, LVIDA achieves approximately a 10× reduction in the computational cost of the language decoder across multiple LVLM models while maintaining comparable performance

Comparison of Vanilla Model and HiMix Architectures. Left: Overall structure of traditional Vanilla. Middle: Overall structure of HiMix. Right: Details of Mixture Attention. Hierarchical Vision Injection for Mixture Attention (HiMix) is a method designed to reduce computational overhead while maintaining LVLM performance. After fusing the vision and language features through Mixture Attention, the vision sequence no longer participates in the forward propagation process within the language decoder, thereby substantially decreasing the overall computational load.

Comprehensive Comparison of LVIDA and Baseline Models. The V-L input ratio in these LVLMs is 728:64. The table reports computational efficiency (measured by FLOPs, Time to First Token, Peak Memory consumption) and performance, highlighting the efficiency of LVIDA with comparable results.

If you find our work useful in your research or applications, please cite our paper:

Xuange Zhang^1†	Dengjie Li^2†	Bo Liu¹	Zenghao Bao²	Yao Zhou²
Baisong Yang²	Zhongying Liu²	Yujie Zhong^2*	Tongtong Yuan^1*