Accelerating Transformer Inference with ctransformers

Name: Lynn Mikami

Published on 4/30/2024

Introduction

In recent years, Transformer-based language models have revolutionized natural language processing, enabling breakthroughs in tasks like language generation, question answering, and text classification. However, these models are often extremely large, with billions or even trillions of parameters, making them computationally expensive to run, especially on CPU.

ctransformers is a Python library that aims to make deploying these large language models more efficient and accessible. It provides Python bindings for Transformer models implemented in optimized C/C++ code, leveraging techniques like quantization and the use of AVX instructions to significantly speed up inference on CPU hardware.

With ctransformers, it's possible to load and run models like GPT-2, GPT-J, GPT-NeoX, Llama, and more with just a few lines of Python code. The library offers a simple, unified interface for various models, integration with the Hugging Face Hub and LangChain framework, and access to low-level APIs for more fine-grained control.

What is ctransformers?

Under the hood, ctransformers utilizes the GGML library, which is a tensor library focused on running ML models on CPU. GGML provides efficient implementations of common operations like matrix multiplication, especially for quantized data types. By combining GGML with model-specific optimized kernels, ctransformers is able to achieve impressive performance.

One of the most exciting applications of ctransformers is the ability to run large open-source models like Llama 2 on consumer hardware. This opens up possibilities for cost-effective and environmentally friendly deployment of large language models, making them more accessible to a wider range of users and applications.

In this article, we'll dive into the technical details of ctransformers, exploring its features, performance characteristics, and API. We'll walk through code examples showing how to load models, generate text, and integrate with LangChain. Finally, we'll discuss the implications and potential of efficient CPU inference for the future of NLP and AI.

Some key features of ctransformers:

Unified interface for loading and running various models
Support for running models from Hugging Face Hub
Integration with LangChain framework
Access to low-level C API for more control
Optimized CPU inference using AVX instructions

Installation of ctransformers

To install ctransformers, simply use pip:

pip install ctransformers

For GPU support, install with the CT_CUBLAS environment variable set:

CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers

Basic Usage of ctransformers

The main class for loading and running models is AutoModelForCausalLM. Here's how to load a model:

from ctransformers import AutoModelForCausalLM
 
# Load from local file 
llm = AutoModelForCausalLM.from_pretrained('path/to/ggml-model.bin', model_type='gpt2')
 
# Load from Hugging Face Hub
llm = AutoModelForCausalLM.from_pretrained('marella/gpt-2-ggml')

The model_type argument specifies the type of model being loaded. Options include gpt2, gptj, gpt_neox, dolly-v2, starcoder, etc.

To generate text, simply call the model:

output = llm("AI is going to")
print(output)

For more control, there is a generator interface:

tokens = llm.tokenize("AI is going to")
 
for token in llm.generate(tokens):
    print(llm.detokenize(token))

LangChain Integration with ctransformers

ctransformers provides a wrapper to use models with the LangChain framework:

from ctransformers.langchain import CTransformers
 
llm = CTransformers(model='marella/gpt-2-ggml')
 
# Use with LangChain primitives 
from langchain import PromptTemplate, LLMChain
 
template = """Question: {question}
Answer:"""
 
prompt = PromptTemplate(template=template, input_variables=['question'])
llm_chain = LLMChain(prompt=prompt, llm=llm)
 
question = "What is AI?"
print(llm_chain.run(question))

Running Llama Models

ctransformers can run the open-source Llama models in GGML format. Here's an example using the Llama 2 model:

from ctransformers import AutoModelForCausalLM
 
model_id = "TheBloke/Llama-2-7B-GGML"
 
config = {
    'max_new_tokens': 256, 
    'repetition_penalty': 1.1,
    'temperature': 0.1
}
 
llm = AutoModelForCausalLM.from_pretrained(
    model_id, 
    model_type="llama",
    config=config
)
 
prompt = "Write a poem to help me remember the first 10 elements on the periodic table"
output = llm(prompt)
print(output)

This loads the 7B parameter Llama 2 model converted to GGML format and generates a poem from the given prompt.

The config dictionary allows specifying various generation parameters like max tokens, repetition penalty, temperature, etc.

Conclusion

ctransformers provides an easy and efficient way to run large language models on CPU using optimized C/C++ implementations under the hood. With a simple Python API, integration with Hugging Face Hub and LangChain, and support for a variety of models, it's a powerful tool for building applications powered by Transformers.

The ability to run models like Llama 2 on CPU with reasonable performance opens up new possibilities for cost-effective and environmentally friendly deployment of large language models. As the ecosystem around open-source Transformer models continues to grow, libraries like ctransformers will play an important role in making them accessible and practical to use.

Command R+: Cohere's Powerful Enterprise LLM for Real-World AI Applications Dolphin-2.1-Mistral-7B: Uncensored LLM Based on Microsoft's Orca Paper