WizardLM 2: Microsoft's Next Generation of State-of-the-Art Large Language Models

Name: Lynn Mikami

Published on 4/30/2024

Microsoft has recently introduced and open-sourced WizardLM 2, their next generation of state-of-the-art large language models (LLMs). This new family includes three cutting-edge models: WizardLM-2 8x22B, WizardLM-2 70B, and WizardLM-2 7B, which have shown improved performance in complex chat, multilingual, reasoning, and agent capabilities.

The Evolution of WizardLM

WizardLM 2 is the latest milestone in Microsoft's effort to scale up LLM post-training. Over the past year, the company has been iterating on the training of the Wizard series, starting with their work on empowering large language models to follow complex instructions. They then accelerated the evolution to code and math reasoning scenarios. As a result, Evol-Instruct and Instruction&Process Supervised Reinforcement Learning (RLEIF) have become fundamental technologies for the GenAI community.

WizardLM 2 Models

The WizardLM 2 family consists of three models:

WizardLM-2 8x22B: Microsoft's most advanced model, it is the best open-source LLM in their internal evaluation for highly complex tasks.
WizardLM-2 70B: This model reaches top-tier reasoning capabilities and is the first choice in its size category.
WizardLM-2 7B: The fastest model, it achieves comparable performance with existing open-source leading models that are 10 times larger.

Method Overview

As human-generated data becomes increasingly exhausted, Microsoft believes that data carefully created by AI and models supervised by AI will be the sole path towards more powerful AI. To achieve this, they have built a fully AI-powered synthetic training system.

Data Pre-Processing

The data pre-processing pipeline consists of the following steps:

Data Analysis: This step helps to understand the distribution of different attributes in the new source data.
Weighted Sampling: The distribution of the best training data is not always consistent with the natural distribution of human chat corpora. Therefore, the weights of various attributes in the training data are adjusted based on experimental experience.
Progressive Learning: Unlike the common practice of using all data for one-time training, Microsoft found that using different data partitions and progressively training stage-by-stage can achieve better results with less data.

Evol Lab

The Evol Lab is responsible for generating more diverse and complex [instruction, response] pairs. It consists of two main components:

Evol-Instruct: This method enables various agents to automatically generate high-quality instructions.
Evol-Answer: Guiding the model to generate and rewrite responses multiple times can improve its logic, correctness, and affinity.

AI Align AI (AAA)

AI Align AI (AAA) is a framework that collects WizardLMs and various state-of-the-art models to co-teach and improve each other. It consists of two main components:

Co-Teaching: The models engage in simulated chat, quality judging, improvement suggestions, and closing skill gaps to teach and improve each other.
Self-Teaching: WizardLM can generate new evolution training data for supervised learning and preference data for reinforcement learning via active learning from itself.

Learning

The learning process involves three main steps:

Supervised Learning: The models are trained using labeled data.
Stage-DPO: For more effective offline reinforcement learning, the preference data is split into different slices, and the model is progressively improved stage by stage.
RLEIF: This approach employs instruction quality reward models (IRM) combined with process supervision reward models (PRM) to achieve more precise correctness in online reinforcement learning.

WizardLM 2 Capabilities

To evaluate the performance of WizardLM 2, Microsoft conducted both human and automatic evaluations, comparing their models with diverse baselines. The results show that WizardLM 2 demonstrates highly competitive performance compared to leading proprietary works and consistently outperforms all existing state-of-the-art open-source models.

Human Preferences Evaluation

In a blind pairwise comparison, WizardLM 2 models were evaluated against baselines using a complex and challenging set of real-world instructions. The results showed that:

WizardLM-2 8x22B is just slightly behind GPT-4-1106-preview and significantly stronger than Command R Plus and GPT4-0314.
WizardLM-2 70B is better than GPT4-0613, Mistral-Large, and Qwen1.5-72B-Chat.
WizardLM-2 7B is comparable with Qwen1.5-32B-Chat and surpasses Qwen1.5-14B-Chat and Starling-LM-7B-beta.

MT-Bench

Microsoft also adopted the automatic MT-Bench evaluation framework based on GPT-4 to assess the performance of their models. The results showed that WizardLM-2 8x22B demonstrates highly competitive performance compared to the most advanced proprietary works such as GPT-4-Turbo and Claude-3. Meanwhile, WizardLM-2 7B and WizardLM-2 70B are the top-performing models among other leading baselines at 7B to 70B model scales.

Usage

The model weights of WizardLM-2 8x22B and WizardLM-2 7B are shared on Hugging Face, and WizardLM-2 70B and the demo of all the models will be available in the coming days. To guarantee the generation quality, users should use the same system prompts strictly as provided by Microsoft.

WizardLM-2 adopts the prompt format from Vicuna and supports multi-turn conversation. The prompt should be as follows:

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: Hi
ASSISTANT: Hello.
USER: Who are you?
ASSISTANT: I am WizardLM.
...

Microsoft also provides a WizardLM-2 inference demo code on their GitHub repository.

In conclusion, WizardLM 2 represents a significant advancement in large language models, showcasing improved performance in complex chat, multilingual, reasoning, and agent capabilities. By leveraging AI-powered synthetic training systems and innovative learning techniques, Microsoft has pushed the boundaries of what open-source language models can achieve.

WizardCoder: The Cutting-Edge AI Revolutionizing Code Generation Zephyr-7b: The Language Model That's Changing the Game