Our evaluation setup involves the following key components:
- (a) Needle Sub-Image: The needle sub-image to be retrieved based on the given caption.
- (b) Haystack Image Inputs: The long-context visual inputs consist of M images, each stitched from N × N sub-images.
- (c) Text Inputs (Instructions and Caption): Detailed instructions to MLLMs, followed by a caption describing the needle, i.e., sub-image 20.
- (d) LLM Outputs: The answers from different MLLMs, indicating their ability to accurately locate the needle in the haystack based on the given caption. The expected output is composed of the model's identification of the index, row, and column of the matching sub-image. The results showcase the comparative performance of various models: GPT-4o correctly predicts the exact location of the needle; Gemini Pro 1.5 only correctly predicts the image index of the needle; other API models predict incorrect locations; open-source models often output with wrong formats.
Our findings underscore a considerable performance gap between models and reveal the hallucination problem in state-of-the-art MLLMs through negative samples. For example, we find that:
- There is still a large performance gap between state-of-the-art API-based and state-of-the-art open-source models.
- Accuracy drops significantly with more images in the haystacks, even for state-of-the-art API-based MLLMs such as Claude 3 Opus and Gemini 1.0 Pro.
- All models (including Claude 3 Opus, Gemini 1.5 Pro, and GPT-4V) perform poorly in MMNeedle settings with sub-images (e.g., N × N = 2 × 2 = 4 sub-images in Fig. 1); this is true even for the best model, GPT-4o, whose accuracy drops from 97.00% for M = 10 images without sub-images (i.e., equivalent to 10 images in the haystack) to 26.90% for M = 10 images with N × N = 4 × 4 = 16 sub-images for each image (equivalent to 160 images in the haystack). See more results in Sec. 4.
The highlights of our MMNeedle benchmark include:
- Comprehensive Dataset. Our dataset ensures sufficient samples for each setting, with a total number of 40,000 images, 560,000 captions, and 280,000 needle-haystack pairs.
- Diverse Settings. Our benchmark covers diverse settings with varying context lengths, single and multiple needles, as well as positive and negative samples, among others (details in Sec. 3).
- Coarse-to-Fine Evaluation Metrics. We establish a set of evaluation metrics, including “existence accuracy”, “index accuracy”, and “exact accuracy”, to holistically evaluate MLLM at the sequence-, image-, and sub-image- levels (details in Sec. 3.5).
- Wide Coverage. Our evaluation covers both state-of-the-art API-based and state-of-the-art open-source MLLMs, shedding light on their long-context capabilities.
@misc{wang2024multimodal,
title={Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models},
author={Hengyi Wang and
Haizhou Shi and
Shiwei Tan and
Weiyi Qin and
Wenyuan Wang and
Tuny Zhang and
Akshay Nambi and
Tanuja Ganu and
Hao Wang},
year={2024},
eprint={2406.11230},
archivePrefix={arXiv},
primaryClass={cs.LG}
}