RedPajama-Data-V2: The Game-Changer for Open-Source Large Language Models

Name: Lynn Mikami

Published on 4/30/2024

Explore RedPajama-Data-V2, the largest open dataset for training LLMs. With 30 trillion tokens, extensive quality annotations, and a commitment to open-source development, RedPajama is set to transform the landscape of AI and democratize access to cutting-edge language models.

Introduction

In the rapidly evolving landscape of artificial intelligence, the development of powerful large language models (LLMs) has been largely dominated by commercial entities. However, the open-source community has been making significant strides in democratizing access to cutting-edge AI technology. Among the numerous initiatives driving this change, the RedPajama project stands out as a beacon of innovation and collaboration.

RedPajama, a joint effort by Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute, aims to create leading, fully open-source LLMs that rival their proprietary counterparts. The project's latest milestone, RedPajama-Data-V2, is set to transform the way open-source LLMs are trained by providing an unprecedented dataset of 30 trillion tokens.

Want to learn the latest LLM News? Check out the latest LLM leaderboard!

RedPajama-Data-V2: A Game-Changer for Open-Source LLMs

RedPajama-Data-V2 is a massive web dataset specifically designed for training LLMs. It includes over 100 billion text documents sourced from 84 CommonCrawl snapshots, processed using the CCNet pipeline. Out of these documents, 30 billion come with pre-computed quality signals, and 20 billion are deduplicated.

The dataset covers five languages: English, French, Spanish, German, and Italian. The number of documents and tokens for the annotated and deduplicated head_middle part of the dataset is as follows:

Language	# Documents	Estimated Token Count (deduplicated)
English	14.5B	20.5T
German	1.9B	3.0T
French	1.6B	2.7T
Spanish	1.8B	2.8T
Italian	0.9B	1.5T
Total	20.8B	30.4T

The inclusion of 40+ pre-computed quality annotations sets RedPajama-Data-V2 apart from other datasets. These annotations allow users to further filter and weigh the data according to their specific needs, providing unprecedented flexibility and customization options.

RedPajama-Data-V2 is currently the largest public dataset specifically designed for LLM training, with over 100 billion raw documents and 30 trillion tokens after deduplication and filtering. It covers five languages: English (20.5T tokens), German (3.0T), French (2.7T), Spanish (2.8T), and Italian (1.5T).

In comparison, other prominent datasets like The Pile (used to train EleutherAI models) contain around 1.2 trillion tokens, while the dataset used to train Llama 2 had 2.4 trillion tokens. RedPajama-Data-V2's scale is unmatched, providing an abundance of high-quality training data.

Why RedPajama-Data-V2 is So Good

What truly sets RedPajama-Data-V2 apart is the inclusion of over 40 pre-computed quality annotations for each document. These annotations cover various categories like perplexity scores, classifier predictions, natural language metrics, content toxicity indicators, and more.

High Quality: Having these quality signals allows researchers to easily filter and weight the dataset according to their specific needs. Other datasets typically apply fixed heuristics, limiting downstream customization. RedPajama-Data-V2 provides unprecedented flexibility to create tailored subsets optimized for different LLM applications.
Fully Open Source: The RedPajama project is fully open-source, with all data processing scripts available on GitHub and the dataset hosted on HuggingFace. This transparency enables the community to understand, reproduce, and build upon the dataset.
Reduce Cost of LLM Tranings: The scale and quality of RedPajama-Data-V2 have the potential to significantly reduce the computational costs of training powerful LLMs. By providing a vast pool of informative data, models can achieve strong performance with fewer parameters and less compute.

Setting Up RedPajama-Data-V2

To get started with RedPajama-Data-V2, you can load the sample dataset using the following Python code:

from datasets import load_dataset
 
ds = load_dataset("togethercomputer/RedPajama-Data-V2", name="sample")

To download a specific combination of partition x snapshot_id x language, use the following command:

wget "https://data.together.xyz/redpajama-data-v2/v1.0.0/urls/minhash-urls.txt" -O "minhash-urls.txt"

Running the RedPajama-Data-V2 Pipeline

The RedPajama-Data-V2 pipeline consists of three main steps:

Preparing artifacts
Computing quality signals
Deduplication

Step 1: Preparing Artifacts

This step creates the artifacts used in subsequent steps, including building quality classifiers, training bag-of-ngram generative models for importance weight computation, fetching the list of bad words from the LDNOOBW repo, and fetching the most recent list of blacklisted URLs from the UT1 blacklist.

To create the artifacts, set the environment variables in the config file and run:

bash scripts/run_prep_artifacts.sh \
  --config configs/rp_v2.0.conf \
  --listings /path/to/listings/file.txt \
  --max_workers 32

Step 2: Computing Quality Signals

The second step computes the quality signals, including the minhash signatures for fuzzy deduplication. Set the environment variables in the config file and run:

bash scripts/apptainer_run_quality_signals.sh \
  --config configs/rp_v2.0.conf \
  --dump_id "2022-49" \
  --input_base_uri "file:///path/to/data/root" \
  --output_base_uri "file:///path/to/outout/data/root" \
  --max_docs -1

Step 3: Deduplication

The third step involves exact and fuzzy deduplication. For exact deduplication using a Bloom filter, run:

python3 app/src/bloomfilter.py \
  --listings /path/to/listings/file.txt \
  --input_base_uri "s3://path/to/ccnet/data" \
  --output_dir "/path/to/output" \
  --s3_profile "..." \
  --endpoint_url "..." \
  --parallel_readers 32 \
  --batch_size 10 \
  --capacity "..." \
  --error_rate "..."

For fuzzy deduplication with locality sensitive hashing, run:

bash scripts/apptainer_run_lsh.sh \
  --config configs/rp_v2.0.conf \
  --dump_id "2022-49" \
  --input_base_uri "file:///path/to/data/root" \
  --output_dir "/path/to/output" \
  --similarity "<similarity_threshold>" \
  --listings "/minhash/listings/file.txt" \
  --max_docs -1

RedPajama-Data-V2 Quality Signals

RedPajama-Data-V2 includes a comprehensive set of quality signals, which can be categorized into the following groups:

CCNet: Signals derived from the CCNet pipeline, such as perplexity score, language identification, and document length.
ML Heuristics: Signals based on machine learning models, such as classifiers for Wikipedia-like pages and importance resampling.
Natural Language: Signals related to the content's linguistic properties, such as word count, sentence count, and fraction of unique words.
Repetitiveness: Signals measuring the repetitiveness of the content, such as the fraction of characters in duplicate n-grams.
Toxicity: Signals indicating potentially toxic content, such as the presence of words from the LDNOOBW blocklist and categories from the UT1 blacklist.
Deduplication: Minhash signatures for fuzzy deduplication at various Jaccard similarity thresholds.

A detailed list of all quality signals can be found in the RedPajama-Data-V2 documentation.

The Future of RedPajama and Open-Source LLMs

RedPajama-Data-V2 marks a significant milestone in the development of open-source LLMs. By providing a vast, high-quality dataset with extensive annotations, the project aims to lower the barriers to entry for researchers and organizations seeking to build powerful language models.

The RedPajama team envisions expanding the dataset with additional quality annotations, such as contamination annotations, topic modeling, and document categorization. They actively encourage community involvement in suggesting and developing new annotations to further enhance the dataset's utility.

In parallel with the dataset development, Together is building open models based on RedPajama-Data-V2. These models will be fully open-source and commercially viable, providing a clean-room, drama-free alternative to existing LLMs. The project also offers assistance to companies and organizations interested in building custom models using a combination of open and proprietary data.

Conclusion

RedPajama-Data-V2 represents a significant step forward in the democratization of AI technology. By providing a fully open-source, high-quality dataset for training LLMs, the project empowers researchers, developers, and organizations to create powerful language models without the limitations imposed by proprietary APIs.

As the RedPajama project continues to grow and evolve, it holds the potential to reshape the landscape of AI, fostering innovation, collaboration, and accessibility. With the support and involvement of the AI community, RedPajama is well-positioned to become a catalyst for the next generation of LLMs and beyond.

Want to learn the latest LLM News? Check out the latest LLM leaderboard!

Qwen-14B: Alibaba's Powerhouse Open-Source LLM Samantha-1.2-Mistral-7B: Best LLM Trained on Philosophy, Psychology, and Personal Relationships