Scalable Question Answering Over Large Documents with LangChain and Vertex AI PaLM

Published on 4/30/2024

This article explores how to build a scalable question answering system for large documents by combining the LangChain framework with Google's Vertex AI PaLM API.

Introduction to Question Answering Over Large Documents with LLMs

Question answering (QA) is a key natural language processing task that aims to automatically answer questions posed by humans in natural language. While large language models (LLMs) like PaLM have shown impressive QA capabilities, they are limited by the amount of context that can fit within their token limit (typically a few thousand tokens). This presents a challenge for QA over large documents that may span many pages.

In this article, we'll explore how to build a scalable QA system for large documents by combining the LangChain framework with Google's Vertex AI PaLM API. We'll cover several methods including:

Stuffing - Pushing the full document as context
Map-Reduce - Splitting documents into chunks and processing in parallel
Refine - Iteratively refining an answer over document chunks
Similarity Search - Using vector embeddings to find relevant chunks

We'll compare the strengths and limitations of each approach. The full code is available in this Colab notebook.

Let's compare metrics for each method on our 50 page sample document:

Method	Relevant Docs	LLM Calls	Total Tokens	Answer Quality
Stuffing	3 pages	1	8432	Good
Map-Reduce	50 pages	51	63019	Okay
Refine	50 pages	50	71209	Good
Similarity Search	4 pages	5	5194	Great

The similarity search approach is able to find a high quality answer with 10x fewer pages, LLM calls and tokens compared to the full document methods. This gap would widen even further on larger datasets.

Step 1. Setup LangChain for Question Answering Over Large Documents

First, install the required dependencies including the Vertex AI SDK, LangChain, and ChromaDB:

!pip install google-cloud-aiplatform langchain==0.0.323 chromadb==0.3.26 pypdf

Import the key libraries:

from langchain.document_loaders import PyPDFLoader  
from langchain.llms import VertexAI
from langchain.chains.question_answering import load_qa_chain
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import VertexAIEmbeddings

Load the PaLM text model and embeddings model:

vertex_llm_text = VertexAI(model_name="text-bison@001")
vertex_embeddings = VertexAIEmbeddings(model_name="textembedding-gecko@001")

Step 2. Loading Documents

For this example, we'll use a PDF whitepaper on MLOps. Download it and load the text using PyPDFLoader:

pdf_url = "https://services.google.com/fh/files/misc/practitioners_guide_to_mlops_whitepaper.pdf"
pdf_loader = PyPDFLoader(pdf_file)
pages = pdf_loader.load_and_split()

This splits the PDF into pages which we can use as the base documents.

Step 3. Stuffing Documents

The simplest approach is to stuff the full document text into the context window of the LLM. Set up a prompt template:

prompt_template = """Answer the question as precise as possible using the provided context. 
If the answer is not contained in the context, say "answer not available in context" \n\n
Context: \n {context}?\n
Question: \n {question} \n
Answer:
"""
 
prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

Load a stuffing QA chain:

stuff_chain = load_qa_chain(vertex_llm_text, chain_type="stuff", prompt=prompt)

Then run it on a question:

question = "What is Experimentation?"
context = "\n".join(str(p.page_content) for p in pages[:7])
stuff_answer = stuff_chain(
    {"input_documents": pages[7:10], "question": question}, return_only_outputs=True
)

This works but is limited by the context size the model can handle (a few thousand tokens). Stuffing the full 50 page document hits this limit:

try:
    print(stuff_chain(
        {"input_documents": pages[7:], "question": question}, 
        return_only_outputs=True))
except Exception as e:  
    print("The code failed since it won't be able to run inference on such a huge context")

Step 4. Map-Reduce

To scale to larger documents, we can split them into chunks, run QA on each chunk, then aggregate the results. LangChain provides a map-reduce chain to handle this.

First define separate question and combine prompts:

question_prompt_template = """
Answer the question as precise as possible using the provided context. \n\n
Context: \n {context} \n
Question: \n {question} \n  
Answer:
"""
question_prompt = PromptTemplate(
    template=question_prompt_template, input_variables=["context", "question"]
)
 
combine_prompt_template = """Given the extracted content and the question, create a final answer.  
If the answer is not contained in the context, say "answer not available in context. \n\n
Summaries: \n {summaries}?\n
Question: \n {question} \n
Answer:  
"""
combine_prompt = PromptTemplate(
    template=combine_prompt_template, input_variables=["summaries", "question"]
)

Load the map-reduce chain specifying the question and combine prompts:

map_reduce_chain = load_qa_chain(
    vertex_llm_text, 
    chain_type="map_reduce",
    return_intermediate_steps=True,
    question_prompt=question_prompt,
    combine_prompt=combine_prompt,
)

Run it on the full document set:

map_reduce_outputs = map_reduce_chain({"input_documents": pages, "question": question})

This runs QA on each page individually, then combines the results in a final step. We can inspect the intermediate results:

for doc, out in zip(
    map_reduce_outputs["input_documents"], map_reduce_outputs["intermediate_steps"]
):
    print(f"Page: {doc.metadata['page']}")
    print(f"Answer: {out}")

The map-reduce approach scales to large documents and provides some insight into where the information is coming from. However, information can sometimes be lost in the final combine step.

Step 5. Refine

The refine approach aims to mitigate information loss by iteratively refining an answer. It starts with an initial answer on the first chunk, then refines it with each subsequent chunk.

Define a refine prompt that incorporates the existing answer and new context:

refine_prompt_template = """
The original question is: \n {question} \n
The provided answer is: \n {existing_answer}\n  
Refine the existing answer if needed with the following context: \n {context_str} \n
Given the extracted content and the question, create a final answer.
If the answer is not contained in the context, say "answer not available in context. \n\n  
"""
refine_prompt = PromptTemplate(
    input_variables=["question", "existing_answer", "context_str"],  
    template=refine_prompt_template,
)

Load a refine chain:

refine_chain = load_qa_chain(
    vertex_llm_text,
    chain_type="refine", 
    return_intermediate_steps=True,
    question_prompt=initial_question_prompt,
    refine_prompt=refine_prompt,
)

Run it on the full document:

refine_outputs = refine_chain({"input_documents": pages, "question": question})

Inspect the intermediate steps to see the answer get refined:

for doc, out in zip(
    refine_outputs["input_documents"], refine_outputs["intermediate_steps"]
):
    print(f"Page: {doc.metadata['page']}")  
    print(f"Answer: {out}")

The refine approach helps preserve information across the full document. But it still requires processing the entire document linearly.

Step 6. Similarity Search

For improved efficiency, we can first use embeddings to find just the most relevant chunks for a given question. This avoids needing to process the full document.

Create a vector index of the document chunks using ChromaDB:

vector_index = Chroma.from_documents(pages, vertex_embeddings).as_retriever()

Retrieve the most relevant chunks for the question:

docs = vector_index.get_relevant_documents(question)

Run the map-reduce chain on just these relevant chunks:

map_reduce_embeddings_outputs = map_reduce_chain(
    {"input_documents": docs, "question": question}
)
print(map_reduce_embeddings_outputs["output_text"])

This finds a high quality answer while only needing to process a small subset of the full document. The similarity search approach provides the best balance of accuracy and efficiency.

Conclusion

In this article, we demonstrated several approaches to question answering over large documents using LangChain and Vertex AI PaLM. While simple stuffing can work for small documents, map-reduce and refine approaches are needed to scale to larger data.

However, the most efficient and effective method is to first use vector similarity search to find only the most relevant passages for a given question. This minimizes the amount of text the LLM needs to process while still producing high quality answers.

The combination of similarity search, LangChain's QA chains, and powerful LLMs like PaLM enable building scalable question answering systems over large document collections. You can get started with the full code in this notebook.

Want to learn the latest LLM News? Check out the latest LLM leaderboard!

QA-LoRA: The Ultimate Guide to Fine-Tuning Large Language Models Efficiently Qwen-14B: Alibaba's Powerhouse Open-Source LLM