Qwen-VL: Alibaba's Versatile Vision-Language Model Outperforms GPT-4V

Name: Jennie Rose

Published on 4/30/2024

An in-depth look at Qwen-VL, Alibaba's powerful vision-language model that surpasses GPT-4V and other models on various benchmarks, with a guide on running it locally.

Alibaba has recently introduced Qwen-VL, a series of large-scale vision-language models (LVLMs) designed to perceive and understand both text and images. Built upon the foundation of Qwen-LM, Qwen-VL has been endowed with visual capabilities through a meticulously designed visual receptor, input-output interface, 3-stage training pipeline, and multilingual multimodal cleaned corpus.

Want to learn the latest LLM News? Check out the latest LLM leaderboard!

Key Features and Capabilities of Qwen-VL

Qwen-VL accepts images, text, and bounding boxes as input, and outputs text and bounding boxes. It supports multilingual conversations in English, Chinese, and other languages, and can process multiple images in a conversation. Qwen-VL also supports high-resolution images up to millions of pixels and various aspect ratios.

The model demonstrates strong visual reasoning, text recognition, and few-shot learning capabilities. It can accurately identify and describe various elements within images, provide detailed background information, answer questions, and analyze complex visual content. Qwen-VL also excels in problem-solving and reasoning tasks, including mathematical problem-solving and profound interpretations of charts and graphs.

One of the standout features of Qwen-VL is its ability to engage in multimodal conversations. Users can provide a combination of text and images as input, and the model will generate relevant responses based on the context of the conversation. This enables more natural and intuitive interactions between humans and AI, as the model can understand and respond to visual cues in addition to textual prompts.

Qwen-VL's multilingual support is another significant advantage. The model has been trained on a diverse corpus of data in multiple languages, allowing it to understand and generate responses in languages such as English, Chinese, and others. This makes Qwen-VL a valuable tool for cross-cultural communication and global applications.

Benchmark Performance

Qwen-VL has achieved impressive results on various benchmarks, outperforming existing open-source large vision-language models (LVLMs) and even rivaling larger models like GPT-4V and Gemini Ultra.

On the VQAv2, OKVQA, and GQA benchmarks, Qwen-VL achieves accuracies of 79.5%, 58.6%, and 59.3% respectively, surpassing recent LVLMs. Qwen-VL-Max performs on par with Gemini Ultra and GPT-4V on various multimodal tasks, while Qwen-VL-Plus and Qwen-VL-Max significantly outperform previous best results from open-source models.

Model	DocVQA	ChartQA	TextVQA	MMMU	MM-Bench-CN
Gemini Pro	88.1%	74.1%	74.6%	45.2%	74.3%
Gemini Ultra	90.9%	80.8%	82.3%	53.0%	-
GPT-4V	88.4%	78.5%	78.0%	49.9%	73.9%
Qwen-VL-Plus	91.4%	78.1%	78.9%	43.3%	68.0%
Qwen-VL-Max	93.1%	79.8%	79.5%	51.0%	75.1%

Notably, Qwen-VL-Max outperforms both GPT-4V from OpenAI and Gemini from Google in tasks related to Chinese question answering and Chinese text comprehension. This highlights the model's strong performance in handling Chinese language tasks, making it a valuable resource for applications targeting Chinese-speaking users.

In addition to the benchmarks mentioned above, Qwen-VL has also demonstrated impressive results on other tasks such as image captioning, visual grounding, and visual reasoning. For example, on the Flickr30k dataset for image captioning, Qwen-VL achieves a BLEU-4 score of 41.2, surpassing the previous state-of-the-art models.

On the RefCOCO dataset for visual grounding, Qwen-VL attains an accuracy of 87.5%, outperforming existing models by a significant margin. This showcases the model's ability to accurately locate and identify objects within images based on textual descriptions.

Furthermore, Qwen-VL has shown strong performance on visual reasoning tasks such as the NLVR2 dataset, which requires the model to determine the truthfulness of a statement based on the provided image. Qwen-VL achieves an accuracy of 85.7% on this task, demonstrating its capability to reason about the relationships between objects and their attributes in images.

These benchmark results highlight the versatility and robustness of Qwen-VL across a wide range of vision-language tasks. The model's ability to excel in both English and Chinese tasks, as well as its strong performance on multimodal benchmarks, sets it apart from other vision-language models and positions it as a powerful tool for various applications.

Running Qwen-VL Locally

To run Qwen-VL locally, you can use the Ollama platform. Here's a step-by-step guide:

Install Ollama on your device:
```
pip install ollama
```
Choose the Qwen-VL model size to run (0.5B to 72B available):
```
ollama run qwen:7b
```
Alternatively, you can run Ollama with your own GGUF files of Qwen-VL models:
```
ollama run path/to/your/model.gguf
```

Here's a sample code snippet for interacting with Qwen-VL using Python:

from qwen_vl import QwenVL
 
model = QwenVL("qwen-vl-7b")
 
# Greeting Qwen with no conversational history
response, history = model.chat(tokenizer, "Hello Qwen!", history=None)
print("Qwen:", response)
 
# Passing along the history for context
response, history = model.chat(tokenizer, "Any thoughts on the meaning of life, the universe, and everything?", history=history)
print("Qwen:", response)
 
# Providing an image and a question
image_path = "path/to/your/image.jpg"
question = "What objects can you see in this image?"
response, history = model.chat(tokenizer, question, image_path=image_path, history=history)
print("Qwen:", response)

In the above code snippet, we first import the QwenVL class from the qwen_vl module. We then instantiate a Qwen-VL model by specifying the desired model size (e.g., "qwen-vl-7b").

To interact with the model, we use the chat method, which takes the tokenizer, a text prompt, an optional image path, and the conversation history as arguments. The model generates a response based on the provided input and returns the response along with the updated conversation history.

We can start a conversation by greeting Qwen without any prior history. The model will generate a response based on the greeting. We can then pass along the conversation history to maintain context in subsequent interactions.

To provide an image as input, we specify the path to the image file using the image_path argument. We can ask a question related to the image, and Qwen-VL will generate a response based on the visual content and the question.

Qwen-VL is also accessible via Hugging Face, ModelScope, API, and other platforms, making it convenient for researchers and developers to leverage its powerful capabilities.

Potential Applications and Impact

Qwen-VL's impressive performance and versatility open up a wide range of potential applications across industries. It can enhance multimodal AI systems with advanced visual understanding, enable more natural human-computer interaction via images and text, and power new applications in areas like visual search, image analysis, and more.

For example, Qwen-VL can be used to develop intelligent image retrieval systems that allow users to search for images based on natural language queries. By understanding the content and context of images, Qwen-VL can provide more accurate and relevant search results compared to traditional keyword-based image search engines.

In the field of e-commerce, Qwen-VL can be applied to enhance product recommendations and personalization. By analyzing product images and user preferences, the model can suggest visually similar or complementary products to customers, improving their shopping experience and increasing sales.

Qwen-VL can also be utilized in the development of intelligent virtual assistants and chatbots. By integrating visual understanding capabilities, these assistants can provide more contextually relevant responses and engage in more natural conversations with users. For instance, a user could send an image of a product they are interested in, and the virtual assistant could provide information, reviews, and recommendations based on the visual content.

In the realm of education, Qwen-VL can be employed to create interactive learning materials and assessments. The model can generate questions and explanations based on educational images, diagrams, and charts, making learning more engaging and effective for students.

Moreover, Qwen-VL has the potential to revolutionize the way we interact with and consume visual media. With its ability to understand and describe images, the model can be used to generate automatic captions, summaries, and translations for images and videos. This can greatly enhance accessibility for visually impaired individuals and bridge language barriers in global communication.

As Alibaba continues to refine and expand the capabilities of Qwen-VL, we can expect it to make significant contributions to the field of vision-language AI. With its strong performance, ease of access, and potential for driving innovation, Qwen-VL is poised to become a key player in the development of multimodal AI systems.

In conclusion, Qwen-VL represents a major milestone in the advancement of vision-language models. Its exceptional performance on various benchmarks, coupled with its versatility and accessibility, make it a powerful tool for researchers, developers, and businesses alike. As the field of multimodal AI continues to evolve, Qwen-VL is well-positioned to play a significant role in shaping its future.

The introduction of Qwen-VL by Alibaba marks an exciting development in the realm of vision-language AI. With its impressive capabilities, strong benchmark performance, and potential for wide-ranging applications, Qwen-VL is set to make a significant impact across various industries. As researchers and developers continue to explore and leverage the power of this versatile model, we can anticipate groundbreaking innovations and advancements in the field of multimodal AI.

Want to learn the latest LLM News? Check out the latest LLM leaderboard!

Qwen 110B: Alibaba's Powerful Language Model and How to Run It Locally Apple MM1: A Groundbreaking Multimodal Language Model