cheat sheet
Agent Frameworks Comparison
Side-by-side comparison of LangChain, LlamaIndex, AutoGen, CrewAI, Haystack, and Semantic Kernel for building LLM-powered applications and agent systems. Covers strengths, weaknesses, and when to pick each.
Agent Frameworks Comparison — Decision Matrix
What it is
This page compares the six most-deployed open-source frameworks for building LLM applications and multi-agent systems in 2026: LangChain, LlamaIndex, AutoGen, CrewAI, Haystack, and Semantic Kernel. The goal is to help you pick — not to pick for you. The right answer depends on your runtime (Python, .NET, Node), the dominant workload (RAG, agent loops, multi-stage workflows), and the team's tolerance for breaking-change churn.
Where useful, the page also references DSPy (a prompt-compilation framework, not a typical agent framework), MCP-based custom servers, and naked SDK code with the model provider's own tool-calling API.
For deeper material on each framework, see the linked pages.
When / why to read this
- You are starting a new LLM application and want a defensible framework choice.
- You inherited a LangChain or LlamaIndex codebase and are considering migration.
- You are integrating LLM features into an existing .NET or Java app.
- You want to compare a current framework against the alternatives before committing to a six-month roadmap.
The frameworks at a glance
| Framework | Primary language | First commit | Maintained by | Core abstraction |
|---|---|---|---|---|
| LangChain | Python, JS | Oct 2022 | LangChain Inc. | LCEL pipe-composed runnables |
| LlamaIndex | Python | Nov 2022 | LlamaIndex Inc. | Indexes, query engines, agents |
| AutoGen | Python, .NET | Mar 2023 | Microsoft Research | Conversable agents, group chats |
| CrewAI | Python | Oct 2023 | crewAI Inc. | Role + goal + backstory + Crew |
| Haystack | Python | Nov 2019 | deepset | Pipeline DAG of typed components |
| Semantic Kernel | Python, .NET, Java | Mar 2023 | Microsoft | Kernel + plugins + functions |
High-level positioning
- LangChain — the kitchen-sink framework. Most integrations, biggest community, fastest-moving API. Best when you want every model, vector store, and tool already wrapped, and you tolerate occasional churn.
- LlamaIndex — the data framework. Sharper than LangChain on document-centric RAG (indexes, query engines, knowledge graphs, multi-document reasoning). Lighter on general agents.
- AutoGen — the multi-agent conversation framework. Strong for "two agents review each other" patterns, code generation + execution loops, and structured multi-agent debates.
- CrewAI — the role-based crew framework. Optimised for long-horizon, multi-step workflows where each step has a distinct expert "agent". YAML-friendly.
- Haystack — the pipeline framework. Explicit DAG of typed components, YAML-serialisable, deployable via
hayhooks. Production-friendly for RAG with strong evaluation primitives. - Semantic Kernel — Microsoft's polyglot SDK. The right answer when you are in the .NET ecosystem; also a strong Python option when you want clean dependency injection and OpenTelemetry tracing.
Decision matrix — by use case
| You want… | First pick | Strong second |
|---|---|---|
| RAG over a folder of PDFs | LlamaIndex | Haystack |
| Production RAG with strict eval | Haystack | LlamaIndex |
| Multi-document, multi-hop reasoning | LlamaIndex | Haystack |
| Tool-using agent with provider tool calling | LangChain | Semantic Kernel |
| Multi-agent code-review loop | AutoGen | CrewAI |
| Long workflow with role specialists (research → write → QA) | CrewAI | AutoGen |
| LLM features inside a .NET app | Semantic Kernel | LangChain.NET |
| Java-based stack | Semantic Kernel (Java) | LangChain4j |
| Tracing, evals, datasets across vendors | LangChain (+ LangSmith) | Haystack |
| Optimising prompts against a metric | DSPy | LangChain (manual) |
| Cross-LLM-client tool servers | MCP (FastMCP, mcp-go) | n/a |
| Embedded in a single small CLI | Provider SDK directly | LangChain |
| Streaming chat UI | LangChain or Semantic Kernel | LlamaIndex |
| Knowledge-graph RAG | LlamaIndex | Haystack |
| Hybrid retrieval (BM25 + dense + rerank) | Haystack | LlamaIndex |
Feature matrix — capabilities
| Capability | LangChain | LlamaIndex | AutoGen | CrewAI | Haystack | Semantic Kernel |
|---|---|---|---|---|---|---|
| Python | yes | yes | yes | yes | yes | yes |
| .NET / C# | partial | no | yes | no | no | yes |
| Java | community | no | community | no | community | yes |
| JS / TS | yes | yes (alpha) | partial | no | partial | partial |
| Provider tool calling | yes | yes | yes | yes | yes | yes |
| Streaming | yes | yes | yes | yes | yes | yes |
| Async-first | yes | yes | yes | partial | yes | yes |
| Built-in retrievers | many | many | few | basic | many | basic |
| Built-in vector stores | 50+ | 30+ | n/a | n/a | 20+ | 10+ |
| Multi-agent group chat | partial | partial | yes | yes | partial | preview |
| Planner / router | LCEL routing | RouterQueryEngine | GroupChatManager | Process.hierarchical | Branch component | deprecated planners |
| Code execution sandbox | tool | tool | yes (Docker, local) | tool | tool | tool |
| Memory abstraction | yes | yes | yes | yes | yes | yes |
| Evaluation primitives | LangSmith | basic | n/a | basic | yes (built-in) | basic |
| YAML / file config | partial | partial | partial | yes | yes | yes (.prompty) |
| OpenTelemetry | via LangSmith / OTel | community | yes | community | community | yes (built-in) |
| MCP client support | yes | yes | yes | community | community | yes |
| MCP server SDK alignment | yes | yes | yes | community | community | yes |
Strengths and weaknesses — one paragraph each
LangChain
Strengths: by far the largest integration catalogue — every embedding model, vector store, document loader, and observability tool has a LangChain wrapper. LCEL composes pipelines with a uniform streaming/async/batch interface. LangSmith gives near-zero-effort tracing and dataset evaluation across LangChain and non-LangChain code.
Weaknesses: breaking changes are routine; imports migrate between releases. Heavy abstraction surface means tracing the actual prompt sent to the LLM often requires chain.get_prompts() or LangSmith. The "everything is a Runnable" philosophy can make simple tasks look complicated.
LlamaIndex
Strengths: the sharpest abstraction for "load → index → query" over your documents. Indexes go beyond plain vector search — keyword, knowledge graph, summary, document hierarchies. Query engines decompose multi-hop questions automatically. Better default chunking and ingestion pipelines than LangChain.
Weaknesses: agent surface is thinner than LangChain or AutoGen. The Settings global is convenient but bites teams that swap models mid-process. JS port lags behind Python by months.
AutoGen
Strengths: the cleanest mental model for multi-agent conversations. ConversableAgent, UserProxyAgent, CodeExecutorAgent map directly to recognisable roles. Code execution with Docker isolation works out of the box. v0.4 rewrite introduced typed, async, distributed runtime suitable for serverless and queues.
Weaknesses: v0.4 broke compatibility with v0.2 (pyautogen), so older tutorials are misleading. Group chats can loop or stall without careful termination conditions. Token costs from chatty multi-agent runs surprise teams.
CrewAI
Strengths: ergonomic role/goal/backstory model maps naturally to business workflows ("a researcher, a writer, an editor"). Sequential and hierarchical processes cover most workflow shapes. YAML-driven config lets non-engineers tune the crew.
Weaknesses: when a "crew" doesn't quite fit your workflow, customising is fiddly. Smaller integration catalogue than LangChain. Strong opinions about agent design that can constrain implementation patterns.
Haystack
Strengths: explicit pipeline DAG with typed sockets is the easiest framework to review in PRs. First-class evaluation primitives (ContextRelevance, Faithfulness, SAS). YAML serialisation + hayhooks deploys pipelines as REST endpoints without extra glue. 2.x rewrite is clean and stable.
Weaknesses: smaller ecosystem of pre-built integrations than LangChain or LlamaIndex. Less focus on the "agent loop" pattern — possible, but not the primary metaphor. .NET and JS support are community efforts.
Semantic Kernel
Strengths: the right answer in .NET shops — dependency injection, OpenTelemetry, async, and Microsoft.Extensions.AI all align. Plugins map cleanly to existing .NET service architectures. .prompty file format is portable across SK, the Azure AI Foundry, and external tools. Strong agent and workflow stories arriving in 1.x.
Weaknesses: Python SDK trails .NET in features and stability. Naming conventions ("kernel functions", "skills" historically) take some adjustment. Smaller community than LangChain.
When to use no framework
Sometimes the right answer is no framework at all.
- The Anthropic / OpenAI / Google SDKs already do streaming, tool calling, retries, and async. If your application is "call model with tools" and that is it, a 200-line direct-SDK implementation is easier to maintain than a LangChain chain.
- For tool servers consumed by multiple clients, use MCP directly (see MCP frameworks) — that is the framework.
- For prompt-quality work where the metric matters more than the orchestration, use DSPy to compile prompts, then expose the result through any of the frameworks above.
Migration sketches
Patterns observed in real migrations. Most are achievable in a week of focused work for a small RAG project.
LangChain → Haystack
ChatModel→OpenAIChatGenerator/OpenAIGenerator.PromptTemplate→PromptBuilder(template=...)(Jinja2 instead of f-string).Retriever→InMemoryEmbeddingRetriever(or per-store retriever).- LCEL pipe
prompt | model | parser→Pipeline.add_component+connect(). AgentExecutor→Agentfromhaystack.components.agentsplus tools.
LlamaIndex → LangChain
VectorStoreIndex.as_query_engine()→ custom LCEL chain with retriever + prompt + model.Settings.llm→ constructor-levelChatModelinstances; no global state.query_engine.query(q)→chain.invoke({"question": q}).
CrewAI → AutoGen
Agent→AssistantAgentwith focusedsystem_message.Taskdescription → first user message to the relevant agent.Crew(process=sequential)→ chain ofagent.run_stream(task=...)calls.Process.hierarchical→RoundRobinGroupChatorSelectorGroupChatwith a manager agent.
Semantic Kernel → LangChain (rarely needed but asked about)
@kernel_function→@tooldecorator.- Kernel-registered plugins → tools bound to model via
model.bind_tools([...]). .promptyfiles →ChatPromptTemplate.from_messages.FunctionChoiceBehavior.Auto()→create_tool_calling_agentwithAgentExecutor.
Performance and operational considerations
| Concern | What to ask | Notes |
|---|---|---|
| Tokens per task | "How many LLM calls per user request?" | AutoGen/CrewAI multi-agent loops can be 5–20×; budget accordingly. |
| Cold start | "How long until first token after process start?" | LangChain and LlamaIndex are 1–3s heavy; SK / Haystack are leaner; raw SDK is fastest. |
| Memory footprint | "What does the framework load at import time?" | LangChain imports a lot transitively; lazy-import provider subpackages where you can. |
| Observability | "Can I see the exact prompt sent to the LLM?" | LangSmith (LangChain), built-in tracing (SK), pipeline.draw() (Haystack). |
| Determinism | "Can I replay a failed run with the same inputs?" | All frameworks support deterministic temperatures; only LangSmith and SK ship reproducible run records by default. |
| Sandbox safety | "Where does generated code execute?" | AutoGen Docker executor is the cleanest; everywhere else use a sandbox you bring yourself. |
Real-world recipes
Recipe — chunk-and-evaluate harness portable across frameworks
A pipeline-agnostic eval set ({"question": ..., "ground_truth": ...}) plus a metric function (semantic_match(predicted, gold) -> bool) lets you A/B any of these frameworks against the same questions. Run each framework's pipeline once per question, store predictions in predictions/<framework>.jsonl, then compute the metric and report a markdown table.
import json
from pathlib import Path
def evaluate(pred_file: str, gold_file: str, metric) -> float:
preds = [json.loads(l) for l in Path(pred_file).read_text().splitlines()]
gold = [json.loads(l) for l in Path(gold_file).read_text().splitlines()]
return sum(metric(p["answer"], g["answer"]) for p, g in zip(preds, gold)) / len(preds)
for name in ["langchain", "llamaindex", "haystack", "semantic-kernel"]:
score = evaluate(f"predictions/{name}.jsonl", "gold/qa.jsonl", semantic_match)
print(f"{name:18s} {score:.2%}")
Output:
langchain 78.50%
llamaindex 82.00%
haystack 81.00%
semantic-kernel 79.50%
Numbers are illustrative — always run on your own dataset before deciding.
Recipe — shared MCP tool layer across frameworks
Write business tools once as an MCP server; consume from whichever framework is best per workload.
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("acme-tools")
@mcp.tool()
def get_customer(customer_id: str) -> dict:
"""Look up a customer by ID."""
...
if __name__ == "__main__":
mcp.run()
Then in any of LangChain (via langchain-mcp-adapters), Semantic Kernel (via semantic-kernel-mcp), or AutoGen (via the official MCP client) consume the tools without rewriting them. See the MCP frameworks page for client wiring.
Recipe — gradually replacing LangChain with raw SDK
- Identify a chain that does only
prompt | model | parser— these are 80% of LangChain code in many apps. - Replace with a 30-line function that calls the provider SDK directly.
- Move retries, structured outputs, and tracing to first-party constructs (
tenacity, Pydantic, OpenTelemetry). - Keep LangChain only for chains that genuinely use multi-step LCEL behaviour.
The result is fewer dependencies, fewer breaking changes per release, and clearer error messages.
Recipe — multi-framework strategy in one product
It is legitimate to use more than one framework in a single product. A common shape:
- Ingestion: Haystack pipelines for batch indexing with
hayhooksas a deployment artefact. - Online retrieval + answering: LangChain LCEL chain because LangSmith tracing on every user request is valuable.
- Background research crew: CrewAI runs a research → write → QA flow for long-form content.
- Embedded tools: an MCP server exposes domain tools to all of the above and to Claude Code / Codex CLI for engineers.
Keep the interfaces small and the framework boundaries crisp; the orchestration code that ties them together is the actual product.
Anti-patterns
Picking by GitHub star count — popularity correlates with integration breadth, not fitness for your problem. CrewAI is smaller than LangChain but better for role-driven workflows; Haystack is smaller still but better for evaluation-heavy RAG.
Wrapping the framework in your own abstraction "in case we swap" — the leak rate is high. Either commit to the framework or write thin direct-SDK code. Premature framework-agnostic abstractions are usually wasted.
Adopting two competing frameworks for the same workload — running LangChain and LlamaIndex side-by-side on the same RAG flow doubles dependency surface and confuses contributors. Multi-framework is fine when each framework owns a distinct workload (above), not when they overlap.
No metric, infinite tuning — for any non-trivial agent or RAG, define an evaluation set and a metric before picking a framework. Most framework choice arguments dissolve once a metric exists.
Selection guide — three questions
-
What language is the host application?
- .NET / Java → Semantic Kernel.
- Python → continue.
- JS / TS → LangChain (others trail).
-
What is the dominant workload?
- Documents → LlamaIndex or Haystack.
- Multi-agent loop → AutoGen or CrewAI.
- Single chain with tools → LangChain or Semantic Kernel.
- Multi-client tool exposure → MCP, not a framework.
-
How important is evaluation?
- First-class metric → Haystack or LangChain + LangSmith.
- Best-effort dev iteration → any of them.
- Compiled prompts → DSPy, then expose the result through any framework.
Side-by-side code — "answer a question against three docs"
A minimal RAG flow shown in each framework, same inputs, same expected output. Read top to bottom; the verbosity differences are real.
LangChain
from langchain_anthropic import ChatAnthropic
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import os
vs = Chroma(
collection_name="docs",
embedding_function=OpenAIEmbeddings(),
persist_directory="./chroma_db",
)
retriever = vs.as_retriever(search_kwargs={"k": 4})
prompt = ChatPromptTemplate.from_messages([
("system", "Answer from the context only.\n\n{context}"),
("human", "{q}"),
])
model = ChatAnthropic(model="claude-sonnet-4-6")
def join(docs): return "\n\n".join(d.page_content for d in docs)
chain = {"context": retriever | join, "q": RunnablePassthrough()} | prompt | model | StrOutputParser()
print(chain.invoke("What does Haystack do?"))
Output:
Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.
LlamaIndex
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.openai import OpenAIEmbedding
Settings.llm = Anthropic(model="claude-sonnet-4-6")
Settings.embed_model = OpenAIEmbedding()
docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)
qe = index.as_query_engine()
print(qe.query("What does Haystack do?"))
Output:
Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.
Haystack
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators.chat import AnthropicChatGenerator
p = Pipeline()
p.add_component("emb", SentenceTransformersTextEmbedder(model="BAAI/bge-small-en-v1.5"))
p.add_component("ret", InMemoryEmbeddingRetriever(document_store=store, top_k=4))
p.add_component("prm", PromptBuilder(template="Answer from context.\n{% for d in documents %}{{ d.content }}\n{% endfor %}\nQ: {{ q }}"))
p.add_component("llm", AnthropicChatGenerator(model="claude-sonnet-4-6"))
p.connect("emb.embedding", "ret.query_embedding")
p.connect("ret.documents", "prm.documents")
p.connect("prm.prompt", "llm.prompt")
q = "What does Haystack do?"
print(p.run({"emb": {"text": q}, "prm": {"q": q}})["llm"]["replies"][0])
Output:
Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.
Semantic Kernel (Python)
import asyncio
from semantic_kernel import Kernel
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion
from semantic_kernel.contents.chat_history import ChatHistory
async def main():
k = Kernel()
k.add_service(OpenAIChatCompletion(service_id="chat", ai_model_id="gpt-4o-mini"))
docs = open("./docs.txt").read()
history = ChatHistory()
history.add_user_message(f"Answer from context.\n{docs}\nQ: What does Haystack do?")
chat = k.get_service("chat")
settings = chat.get_prompt_execution_settings_class()(service_id="chat")
reply = await chat.get_chat_message_content(chat_history=history, settings=settings, kernel=k)
print(reply.content)
asyncio.run(main())
Output:
Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.
AutoGen
import asyncio
from autogen_agentchat.agents import AssistantAgent
from autogen_ext.models.openai import OpenAIChatCompletionClient
async def main():
client = OpenAIChatCompletionClient(model="gpt-4o-mini")
agent = AssistantAgent(name="answerer", model_client=client,
system_message="Answer from context. Be concise.")
docs = open("./docs.txt").read()
result = await agent.on_messages([{"role": "user", "content": f"{docs}\nQ: What does Haystack do?"}], None)
print(result.chat_message.content)
asyncio.run(main())
Output:
Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.
CrewAI
from crewai import Agent, Task, Crew, Process
from crewai_tools import FileReadTool
reader = Agent(role="Doc Reader", goal="Read context docs.",
backstory="You read provided docs carefully.", tools=[FileReadTool()])
answerer = Agent(role="Answerer", goal="Answer using only the docs.",
backstory="Concise technical responder.")
t1 = Task(description="Read ./docs.txt.", expected_output="Verbatim doc contents.", agent=reader)
t2 = Task(description="Answer 'What does Haystack do?' from the read content.",
expected_output="One sentence.", agent=answerer)
crew = Crew(agents=[reader, answerer], tasks=[t1, t2], process=Process.sequential)
print(crew.kickoff().raw)
Output:
Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.
The five-or-six way comparison shows the trade: LangChain and LlamaIndex compress retrieval + answering into a few lines via heavy abstractions; Haystack and AutoGen are explicit and verbose; CrewAI requires you to model the work as roles; Semantic Kernel sits close to the raw chat API.
Cost and latency notes
Hard numbers depend on your data, models, and prompts — these are rough rules of thumb worth confirming with traces.
| Workload | Typical LLM calls | Comment |
|---|---|---|
| LangChain LCEL chain | 1 per request | Add 1 if a tool call is needed. |
LlamaIndex query_engine.query | 1 per request | Sub-question decomposition can multiply this 3–5×. |
| Haystack RAG pipeline | 1 per request | Reranker adds local inference, not LLM calls. |
| AutoGen single-agent | 1 per turn, ~3–5 turns | Per task; group chats are 5–15+. |
| CrewAI sequential 3-role crew | 3+ per kickoff | Hierarchical adds a manager round per task. |
| Semantic Kernel auto function call | 1–2 per request | One per tool round-trip. |
| DSPy compiled program | 1 per request at runtime | Compilation phase consumes hundreds offline. |
Caching at the framework boundary
- LangChain:
set_llm_cache(InMemoryCache())orSQLiteCache(database_path="...")caches all LLM calls globally. - LlamaIndex:
IngestionPipelinecaches embeddings; LLM caching is per-callsite. - Haystack: no built-in LLM cache; wrap generators with a
@componentthat hashes input and returns cached output. - AutoGen:
ChatCompletionCacheviaautogen_ext.cache. - CrewAI:
cache=Trueon agents for tool-result caching. - Semantic Kernel: filter-based caching (write a
IFunctionInvocationFilter).
Always cache during eval and optimisation runs — the LLM call count dwarfs everything else.
Observability across frameworks
Tracing options that work today:
| Framework | Best-in-class | Notes |
|---|---|---|
| LangChain | LangSmith | Set LANGCHAIN_TRACING_V2=true and LANGCHAIN_API_KEY. Every chain call appears with prompts, tokens, latency. |
| LlamaIndex | LangSmith via instrument | from llama_index.core import set_global_handler; set_global_handler("langfuse") is also common. |
| Haystack | OpenTelemetry via haystack.tracing | Use LangfuseTracer or OpenTelemetryTracer. |
| AutoGen | OpenTelemetry built-in | Auto-spans for agent.run, model.call, tool.call. |
| CrewAI | AgentOps and Langfuse plugins | Set AGENTOPS_API_KEY; CrewAI hooks send full traces. |
| Semantic Kernel | OpenTelemetry built-in | Standard gen_ai.* semantic conventions. |
For any framework, an OTel collector → Tempo / Honeycomb / Azure Monitor → Grafana is the most portable setup. Vendor SaaS (LangSmith, AgentOps, Langfuse) wins on rich LLM-specific UI; OTel wins on owning your data.
Compatibility with model providers
Most frameworks support OpenAI, Anthropic, Google, Azure OpenAI, Cohere, Mistral, and Hugging Face out of the box. Differences are at the edges.
| Provider | LangChain | LlamaIndex | AutoGen | CrewAI | Haystack | Semantic Kernel |
|---|---|---|---|---|---|---|
| OpenAI | yes | yes | yes | yes | yes | yes |
| Anthropic Claude | yes | yes | yes (ext) | yes | yes | community |
| Google Gemini | yes | yes | community | yes | yes | yes |
| Azure OpenAI | yes | yes | yes | yes | yes | yes |
| Cohere | yes | yes | community | yes | yes | community |
| Mistral | yes | yes | yes | yes | yes | community |
| Hugging Face local | yes | yes | community | community | yes | community |
| Ollama | yes | yes | yes | yes | yes | yes |
| AWS Bedrock | yes | yes | community | yes | yes | community |
| OpenAI-compatible (vLLM, Groq, Fireworks, Together) | yes (point base URL) | yes | yes | yes | yes | yes |
For any provider exposing an OpenAI-compatible endpoint, all six frameworks support it by setting base_url= on the OpenAI client config. This makes vLLM, Groq, Fireworks, Together, and Anyscale drop-in choices everywhere.
Common pitfalls — framework selection
Choosing for capability you might need later — every framework is large enough that "might need" features add maintenance burden. Choose for the workload in front of you; migrating later is usually easier than predicted.
Tutorial-driven choice — the framework with the best tutorial is not necessarily the best for production. CrewAI is famously approachable in tutorials but has fewer escape hatches than LangChain at scale.
Underestimating evaluation cost — Haystack and LangChain (with LangSmith) make evaluation cheap; the others need glue. If you have not built an eval set, your framework choice is premature.
Build the same minimal pipeline (3 documents, 5 questions, one metric) in your top-two frameworks before committing. The decision usually resolves itself in a half-day spike.
Anatomy of a multi-framework production stack
A representative shape, distilled from several deployed systems.
+-----------------------------+ +-----------------------------+
| Ingestion Worker (cron) | | Online API (FastAPI/.NET) |
| | | |
| Haystack indexing pipeline | | LangChain LCEL chain |
| - converters | | - retriever (shared Chroma)|
| - splitter | | - tools via MCP client |
| - embedder | | - LangSmith tracing |
| - writer (ChromaDB) | | - streaming |
+--------------+--------------+ +--------------+--------------+
| |
v v
+-----------------------------+ +-----------------------------+
| Chroma vector store (shared) | MCP servers (multiple) |
| (Postgres / S3 backed) | - github.py |
| | - postgres.py |
| | - search.py |
+--------------------------------------+-----------------------------+
|
v
+-----------------------------+ +-----------------------------+
| Background research crew | | Eval / CI |
| CrewAI: researcher → writer| | Haystack evaluators |
| | | + LangSmith datasets |
+-----------------------------+ +-----------------------------+
Each framework owns a single workload. The data plane (vector store, MCP tools) is shared. New workloads pick the right framework rather than forcing one to do everything.
Future-proofing notes
- MCP is winning the tool-server story. Even if your app is single-client today, exposing tools as MCP servers separates concerns and lets you swap LLM clients without rewriting tools.
Microsoft.Extensions.AIis becoming the .NET-wide abstraction. Semantic Kernel already aligns to it; LangChain.NET and others are following.- Provider tool-calling APIs are converging. OpenAI, Anthropic, and Google all expose tool calling with similar shapes. Direct-SDK code is increasingly viable, and the unique value of "uniform model interface" frameworks shrinks accordingly.
- DSPy-style prompt compilation is migrating into frameworks. Watch for
langchain-dspyandhaystack-dspystyle integrations that let you compile a specific chain or pipeline against a metric.
Quick reference — pick X when
| Pick | When |
|---|---|
| LangChain | Maximum integration breadth, LangSmith tracing, JS/TS support needed. |
| LlamaIndex | Document-centric RAG, multi-document reasoning, knowledge graphs. |
| AutoGen | Multi-agent code execution loops, structured debates, Docker sandbox. |
| CrewAI | Role-driven multi-step workflows, YAML-driven config, business-friendly abstractions. |
| Haystack | Production RAG with strong evaluation, pipeline DAG visibility, REST deploys. |
| Semantic Kernel | .NET / Java host applications, OpenTelemetry, dependency injection. |
| DSPy | Optimising prompts against a metric across LLMs. |
| MCP (FastMCP / mcp-go) | Tools consumed by multiple LLM clients. |
| Raw provider SDK | Single-purpose CLI / Worker, latency-critical, minimal deps. |