Want to Become a Sponsor? Contact Us Now!🎉

claude
Clude-ception: Teaching Claude to Prompt Engineer Itself with LangChain

Claude-ception: Teaching Claude to Prompt Engineer Itself with LangChain

Published on

This article demonstrates an innovative approach to automated prompt engineering using the powerful Claude 3 AI model from Anthropic and the LangChain framework. The key idea is to leverage Claude's own prompt engineering capabilities to iteratively improve prompts for a given task through a feedback loop.

Introduction

Claude, the powerful AI assistant from Anthropic, has demonstrated remarkable language understanding and generation capabilities. The recently released Claude 3 model, especially the Opus variant, excels at prompt engineering - the art of designing effective prompts to elicit high-quality outputs from language models.

This opens up an exciting possibility: what if we could leverage Claude's prompt engineering prowess to iteratively improve its own prompts? By having Claude 3 analyze the quality of outputs generated from an initial prompt and suggest improvements, we could create an automated feedback loop to optimize prompts for a given task.

In this article, we'll walk through how to implement this self-improving prompt strategy using Claude 3 Opus and LangChain, a popular framework for building applications with language models. We'll apply it to the task of summarizing academic papers in the engaging style of AI researcher Elvis Saravia (@omarsar0 (opens in a new tab) on Twitter).

Anakin AI - The Ultimate No-Code AI App Builder

The Prompt Engineering Workflow

Alex Albert (@alexalbert__ (opens in a new tab) on Twitter) recently shared an effective workflow for prompt engineering with Claude 3 Opus:

  1. Write an initial prompt for the task
  2. Generate a test set of inputs to evaluate the prompt
  3. Run the prompt on the test cases
  4. Manually review and grade the outputs
  5. Feed the graded examples back into Claude 3 Opus and ask it to revise the prompt
  6. Repeat

Will Hinthorn (@WHinthorn (opens in a new tab)) and Ross Lance Martin (@rlancemartin (opens in a new tab)) demonstrate how to streamline this process using LangSmith:

  1. Create a dataset of test cases
  2. Annotate the generated outputs with feedback
  3. Pass the feedback to Claude 3 Opus to rewrite the prompt
  4. Run this as an iterative improvement loop

Let's see how to implement this approach for the task of summarizing academic papers in the style of Elvis Saravia's excellent Twitter threads.

Step 1: Load Papers into a Dataset

First, we select a few papers that Elvis has tweeted about and load them using the ArxivLoader from LangChain:

from langchain_community.document_loaders import ArxivLoader
 
ids = ["2403.05313", "2403.04121", "2402.15809"] 
 
docs = []
for paper_id in ids:
    doc = ArxivLoader(query=paper_id, load_max_docs=1).load() 
    docs.extend(doc)

We then add the paper text to a LangSmith dataset:

from langsmith import Client
 
client = Client()
 
ds_name = "Tweet Generator"  
ds = client.create_dataset(dataset_name=ds_name)
client.create_examples(
    inputs=[{"paper": doc.page_content} for doc in docs], dataset_id=ds.id
)

Step 2: Test with an Initial Prompt

Next, we write a reasonable starting prompt for Claude 3 Opus to generate paper summary tweets:

from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
 
chat = ChatAnthropic(temperature=0, model_name="claude-3-opus-20240229")
 
system = (
    "<role> You are an assistant that generates Tweets to distill / summarize"
    " an academic paper or open source project. It should be" 
    " well crafted but avoid gimicks or over-reliance on buzzwords. </role>"
)
human = "Here is a paper to convert into a Tweet: <paper> {paper} </paper>"
current_prompt_str = system + human
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", human)])
 
tweet_generator = prompt | chat

We can test it on an example paper:

tweet_example = tweet_generator.invoke({"paper": docs[0].page_content})
print(tweet_example)

This generates a decent initial tweet summary of the paper.

Step 3: Run on the Dataset

To evaluate the prompt more thoroughly, we run it on our full dataset of papers:

res = client.run_on_dataset(
    dataset_name=ds_name, 
    llm_or_chain_factory=tweet_generator,
)

Step 4: Manually Evaluate the Outputs

We can use LangSmith's annotation queue feature to manually review and provide feedback on the generated tweets:

q = client.create_annotation_queue(name="Tweet Generator")
client.add_runs_to_annotation_queue(
    q.id,
    run_ids=[
        r.id 
        for r in client.list_runs(project_name=res["project_name"], execution_order=1)
    ],  
)

To guide the evaluation, we can refer to some of Elvis's actual paper summary tweets as examples of the target style and content.

Step 5: Improve the Prompt

After annotating the outputs, we pull in the structured feedback:

formatted_feedback = get_formatted_feedback(res["project_name"])

We then feed this back into Claude 3 Opus using a prompt optimization template from LangChain Hub:

from langchain import hub
from langchain_core.output_parsers import StrOutputParser
 
optimizer_prompt = hub.pull("rlm/prompt-optimizer-tweet-drafts") 
optimizer = optimizer_prompt | chat | StrOutputParser() | extract_new_prompt
new_prompt_str = optimizer.invoke(
    {
        "current_prompt": current_prompt_str,
        "annotated_predictions": "\n\n".join(formatted_feedback).strip(),  
    }
)

This generates an improved prompt that incorporates the feedback to better match Elvis's writing style and thread structure.

Step 6: Evaluate the New Prompt

Finally, we can run the updated prompt on our dataset again to check for improvements:

new_prompt = ChatPromptTemplate.from_messages([("user", new_prompt_str)])
tweet_generator_v2 = new_prompt | chat
 
updated_results = client.run_on_dataset(
    dataset_name=ds_name,
    llm_or_chain_factory=tweet_generator_v2, 
)

Comparing the outputs of the original and revised prompts shows that the feedback loop with Claude 3 Opus was effective in tuning the prompt to produce more engaging, Elvis-style paper summaries.

For example, here are tweets generated by the initial and improved prompts for a new test paper:

Initial Prompt:

New survey on using large language models (LLMs) for tabular data tasks like prediction, generation, and question answering. Covers key techniques like serialization, prompt engineering, and benchmarks. Identifies opportunities and challenges.

Improved Prompt:

Unlocking LLMs' Potential for Tabular Data 🔑

This comprehensive survey explores how large language models can be applied to tabular data for tasks like prediction, generation, and understanding.

Key techniques covered:

  • Serializing tables into LLM-readable formats
  • Table manipulations to fit context length
  • Prompt engineering tricks
  • Building end-to-end systems

The paper provides a taxonomy of datasets, metrics, and methods for each application area. It also discusses current limitations and future research directions.

LLMs show great promise for working with structured data when combined with the right preprocessing steps and prompting strategies. This opens up exciting possibilities for more intelligent and automated systems to analyze tabular data.

What other innovative applications of LLMs to structured data do you foresee? 🤔

The revised tweet does a better job breaking down the key points, adds emojis for visual interest, and ends with a thought-provoking question to engage the audience - all characteristic elements of Elvis's style.

Conclusion

By combining the prompt engineering capabilities of Claude 3 Opus with the LangChain and LangSmith frameworks, we were able to create an automated feedback loop to progressively optimize a prompt for summarizing papers in a particular writing style.

This demonstrates a powerful general approach for tuning language models to perform a task according to certain specifications or to emulate a target style. The same technique could be applied to automatically improve prompts for a wide variety of other text generation tasks.

As language models continue to advance, we can expect to see more sophisticated prompt engineering tools and workflows to maximize their potential. Frameworks like LangChain will play a key role in making this accessible to a broader range of developers and opening up exciting new applications.

Anakin AI - The Ultimate No-Code AI App Builder