cheat sheet

Agent Frameworks Comparison

Side-by-side comparison of LangChain, LlamaIndex, AutoGen, CrewAI, Haystack, and Semantic Kernel for building LLM-powered applications and agent systems. Covers strengths, weaknesses, and when to pick each.

updated 05-25-2026

Agent Frameworks Comparison — Decision Matrix

What it is

This page compares the six most-deployed open-source frameworks for building LLM applications and multi-agent systems in 2026: LangChain, LlamaIndex, AutoGen, CrewAI, Haystack, and Semantic Kernel. The goal is to help you pick — not to pick for you. The right answer depends on your runtime (Python, .NET, Node), the dominant workload (RAG, agent loops, multi-stage workflows), and the team's tolerance for breaking-change churn.

Where useful, the page also references DSPy (a prompt-compilation framework, not a typical agent framework), MCP-based custom servers, and naked SDK code with the model provider's own tool-calling API.

For deeper material on each framework, see the linked pages.

When / why to read this

You are starting a new LLM application and want a defensible framework choice.
You inherited a LangChain or LlamaIndex codebase and are considering migration.
You are integrating LLM features into an existing .NET or Java app.
You want to compare a current framework against the alternatives before committing to a six-month roadmap.

The frameworks at a glance

Framework	Primary language	First commit	Maintained by	Core abstraction
LangChain	Python, JS	Oct 2022	LangChain Inc.	LCEL pipe-composed runnables
LlamaIndex	Python	Nov 2022	LlamaIndex Inc.	Indexes, query engines, agents
AutoGen	Python, .NET	Mar 2023	Microsoft Research	Conversable agents, group chats
CrewAI	Python	Oct 2023	crewAI Inc.	Role + goal + backstory + Crew
Haystack	Python	Nov 2019	deepset	Pipeline DAG of typed components
Semantic Kernel	Python, .NET, Java	Mar 2023	Microsoft	Kernel + plugins + functions

High-level positioning

LangChain — the kitchen-sink framework. Most integrations, biggest community, fastest-moving API. Best when you want every model, vector store, and tool already wrapped, and you tolerate occasional churn.
LlamaIndex — the data framework. Sharper than LangChain on document-centric RAG (indexes, query engines, knowledge graphs, multi-document reasoning). Lighter on general agents.
AutoGen — the multi-agent conversation framework. Strong for "two agents review each other" patterns, code generation + execution loops, and structured multi-agent debates.
CrewAI — the role-based crew framework. Optimised for long-horizon, multi-step workflows where each step has a distinct expert "agent". YAML-friendly.
Haystack — the pipeline framework. Explicit DAG of typed components, YAML-serialisable, deployable via hayhooks. Production-friendly for RAG with strong evaluation primitives.
Semantic Kernel — Microsoft's polyglot SDK. The right answer when you are in the .NET ecosystem; also a strong Python option when you want clean dependency injection and OpenTelemetry tracing.

Decision matrix — by use case

You want…	First pick	Strong second
RAG over a folder of PDFs	LlamaIndex	Haystack
Production RAG with strict eval	Haystack	LlamaIndex
Multi-document, multi-hop reasoning	LlamaIndex	Haystack
Tool-using agent with provider tool calling	LangChain	Semantic Kernel
Multi-agent code-review loop	AutoGen	CrewAI
Long workflow with role specialists (research → write → QA)	CrewAI	AutoGen
LLM features inside a .NET app	Semantic Kernel	LangChain.NET
Java-based stack	Semantic Kernel (Java)	LangChain4j
Tracing, evals, datasets across vendors	LangChain (+ LangSmith)	Haystack
Optimising prompts against a metric	DSPy	LangChain (manual)
Cross-LLM-client tool servers	MCP (FastMCP, mcp-go)	n/a
Embedded in a single small CLI	Provider SDK directly	LangChain
Streaming chat UI	LangChain or Semantic Kernel	LlamaIndex
Knowledge-graph RAG	LlamaIndex	Haystack
Hybrid retrieval (BM25 + dense + rerank)	Haystack	LlamaIndex

Feature matrix — capabilities

Capability	LangChain	LlamaIndex	AutoGen	CrewAI	Haystack	Semantic Kernel
Python	yes	yes	yes	yes	yes	yes
.NET / C#	partial	no	yes	no	no	yes
Java	community	no	community	no	community	yes
JS / TS	yes	yes (alpha)	partial	no	partial	partial
Provider tool calling	yes	yes	yes	yes	yes	yes
Streaming	yes	yes	yes	yes	yes	yes
Async-first	yes	yes	yes	partial	yes	yes
Built-in retrievers	many	many	few	basic	many	basic
Built-in vector stores	50+	30+	n/a	n/a	20+	10+
Multi-agent group chat	partial	partial	yes	yes	partial	preview
Planner / router	LCEL routing	RouterQueryEngine	GroupChatManager	Process.hierarchical	Branch component	deprecated planners
Code execution sandbox	tool	tool	yes (Docker, local)	tool	tool	tool
Memory abstraction	yes	yes	yes	yes	yes	yes
Evaluation primitives	LangSmith	basic	n/a	basic	yes (built-in)	basic
YAML / file config	partial	partial	partial	yes	yes	yes (.prompty)
OpenTelemetry	via LangSmith / OTel	community	yes	community	community	yes (built-in)
MCP client support	yes	yes	yes	community	community	yes
MCP server SDK alignment	yes	yes	yes	community	community	yes

Strengths and weaknesses — one paragraph each

LangChain

Strengths: by far the largest integration catalogue — every embedding model, vector store, document loader, and observability tool has a LangChain wrapper. LCEL composes pipelines with a uniform streaming/async/batch interface. LangSmith gives near-zero-effort tracing and dataset evaluation across LangChain and non-LangChain code.

Weaknesses: breaking changes are routine; imports migrate between releases. Heavy abstraction surface means tracing the actual prompt sent to the LLM often requires chain.get_prompts() or LangSmith. The "everything is a Runnable" philosophy can make simple tasks look complicated.

LlamaIndex

Strengths: the sharpest abstraction for "load → index → query" over your documents. Indexes go beyond plain vector search — keyword, knowledge graph, summary, document hierarchies. Query engines decompose multi-hop questions automatically. Better default chunking and ingestion pipelines than LangChain.

Weaknesses: agent surface is thinner than LangChain or AutoGen. The Settings global is convenient but bites teams that swap models mid-process. JS port lags behind Python by months.

AutoGen

Strengths: the cleanest mental model for multi-agent conversations. ConversableAgent, UserProxyAgent, CodeExecutorAgent map directly to recognisable roles. Code execution with Docker isolation works out of the box. v0.4 rewrite introduced typed, async, distributed runtime suitable for serverless and queues.

Weaknesses: v0.4 broke compatibility with v0.2 (pyautogen), so older tutorials are misleading. Group chats can loop or stall without careful termination conditions. Token costs from chatty multi-agent runs surprise teams.

CrewAI

Strengths: ergonomic role/goal/backstory model maps naturally to business workflows ("a researcher, a writer, an editor"). Sequential and hierarchical processes cover most workflow shapes. YAML-driven config lets non-engineers tune the crew.

Weaknesses: when a "crew" doesn't quite fit your workflow, customising is fiddly. Smaller integration catalogue than LangChain. Strong opinions about agent design that can constrain implementation patterns.

Haystack

Strengths: explicit pipeline DAG with typed sockets is the easiest framework to review in PRs. First-class evaluation primitives (ContextRelevance, Faithfulness, SAS). YAML serialisation + hayhooks deploys pipelines as REST endpoints without extra glue. 2.x rewrite is clean and stable.

Weaknesses: smaller ecosystem of pre-built integrations than LangChain or LlamaIndex. Less focus on the "agent loop" pattern — possible, but not the primary metaphor. .NET and JS support are community efforts.

Semantic Kernel

Strengths: the right answer in .NET shops — dependency injection, OpenTelemetry, async, and Microsoft.Extensions.AI all align. Plugins map cleanly to existing .NET service architectures. .prompty file format is portable across SK, the Azure AI Foundry, and external tools. Strong agent and workflow stories arriving in 1.x.

Weaknesses: Python SDK trails .NET in features and stability. Naming conventions ("kernel functions", "skills" historically) take some adjustment. Smaller community than LangChain.

When to use no framework

Sometimes the right answer is no framework at all.

The Anthropic / OpenAI / Google SDKs already do streaming, tool calling, retries, and async. If your application is "call model with tools" and that is it, a 200-line direct-SDK implementation is easier to maintain than a LangChain chain.
For tool servers consumed by multiple clients, use MCP directly (see MCP frameworks) — that is the framework.
For prompt-quality work where the metric matters more than the orchestration, use DSPy to compile prompts, then expose the result through any of the frameworks above.

Migration sketches

Patterns observed in real migrations. Most are achievable in a week of focused work for a small RAG project.

LangChain → Haystack

ChatModel → OpenAIChatGenerator / OpenAIGenerator.
PromptTemplate → PromptBuilder(template=...) (Jinja2 instead of f-string).
Retriever → InMemoryEmbeddingRetriever (or per-store retriever).
LCEL pipe prompt | model | parser → Pipeline.add_component + connect().
AgentExecutor → Agent from haystack.components.agents plus tools.

LlamaIndex → LangChain

VectorStoreIndex.as_query_engine() → custom LCEL chain with retriever + prompt + model.
Settings.llm → constructor-level ChatModel instances; no global state.
query_engine.query(q) → chain.invoke({"question": q}).

CrewAI → AutoGen

Agent → AssistantAgent with focused system_message.
Task description → first user message to the relevant agent.
Crew(process=sequential) → chain of agent.run_stream(task=...) calls.
Process.hierarchical → RoundRobinGroupChat or SelectorGroupChat with a manager agent.

Semantic Kernel → LangChain (rarely needed but asked about)

@kernel_function → @tool decorator.
Kernel-registered plugins → tools bound to model via model.bind_tools([...]).
.prompty files → ChatPromptTemplate.from_messages.
FunctionChoiceBehavior.Auto() → create_tool_calling_agent with AgentExecutor.

Performance and operational considerations

Concern	What to ask	Notes
Tokens per task	"How many LLM calls per user request?"	AutoGen/CrewAI multi-agent loops can be 5–20×; budget accordingly.
Cold start	"How long until first token after process start?"	LangChain and LlamaIndex are 1–3s heavy; SK / Haystack are leaner; raw SDK is fastest.
Memory footprint	"What does the framework load at import time?"	LangChain imports a lot transitively; lazy-import provider subpackages where you can.
Observability	"Can I see the exact prompt sent to the LLM?"	LangSmith (LangChain), built-in tracing (SK), `pipeline.draw()` (Haystack).
Determinism	"Can I replay a failed run with the same inputs?"	All frameworks support deterministic temperatures; only LangSmith and SK ship reproducible run records by default.
Sandbox safety	"Where does generated code execute?"	AutoGen Docker executor is the cleanest; everywhere else use a sandbox you bring yourself.

Real-world recipes

Recipe — chunk-and-evaluate harness portable across frameworks

A pipeline-agnostic eval set ({"question": ..., "ground_truth": ...}) plus a metric function (semantic_match(predicted, gold) -> bool) lets you A/B any of these frameworks against the same questions. Run each framework's pipeline once per question, store predictions in predictions/<framework>.jsonl, then compute the metric and report a markdown table.

python

import json
from pathlib import Path

def evaluate(pred_file: str, gold_file: str, metric) -> float:
    preds = [json.loads(l) for l in Path(pred_file).read_text().splitlines()]
    gold  = [json.loads(l) for l in Path(gold_file).read_text().splitlines()]
    return sum(metric(p["answer"], g["answer"]) for p, g in zip(preds, gold)) / len(preds)

for name in ["langchain", "llamaindex", "haystack", "semantic-kernel"]:
    score = evaluate(f"predictions/{name}.jsonl", "gold/qa.jsonl", semantic_match)
    print(f"{name:18s} {score:.2%}")

Output:

text

langchain          78.50%
llamaindex         82.00%
haystack           81.00%
semantic-kernel    79.50%

Numbers are illustrative — always run on your own dataset before deciding.

Recipe — shared MCP tool layer across frameworks

Write business tools once as an MCP server; consume from whichever framework is best per workload.

python

from mcp.server.fastmcp import FastMCP

mcp = FastMCP("acme-tools")

@mcp.tool()
def get_customer(customer_id: str) -> dict:
    """Look up a customer by ID."""
    ...

if __name__ == "__main__":
    mcp.run()

Then in any of LangChain (via langchain-mcp-adapters), Semantic Kernel (via semantic-kernel-mcp), or AutoGen (via the official MCP client) consume the tools without rewriting them. See the MCP frameworks page for client wiring.

Recipe — gradually replacing LangChain with raw SDK

Identify a chain that does only prompt | model | parser — these are 80% of LangChain code in many apps.
Replace with a 30-line function that calls the provider SDK directly.
Move retries, structured outputs, and tracing to first-party constructs (tenacity, Pydantic, OpenTelemetry).
Keep LangChain only for chains that genuinely use multi-step LCEL behaviour.

The result is fewer dependencies, fewer breaking changes per release, and clearer error messages.

Recipe — multi-framework strategy in one product

It is legitimate to use more than one framework in a single product. A common shape:

Ingestion: Haystack pipelines for batch indexing with hayhooks as a deployment artefact.
Online retrieval + answering: LangChain LCEL chain because LangSmith tracing on every user request is valuable.
Background research crew: CrewAI runs a research → write → QA flow for long-form content.
Embedded tools: an MCP server exposes domain tools to all of the above and to Claude Code / Codex CLI for engineers.

Keep the interfaces small and the framework boundaries crisp; the orchestration code that ties them together is the actual product.

Anti-patterns

Picking by GitHub star count — popularity correlates with integration breadth, not fitness for your problem. CrewAI is smaller than LangChain but better for role-driven workflows; Haystack is smaller still but better for evaluation-heavy RAG.

Wrapping the framework in your own abstraction "in case we swap" — the leak rate is high. Either commit to the framework or write thin direct-SDK code. Premature framework-agnostic abstractions are usually wasted.

Adopting two competing frameworks for the same workload — running LangChain and LlamaIndex side-by-side on the same RAG flow doubles dependency surface and confuses contributors. Multi-framework is fine when each framework owns a distinct workload (above), not when they overlap.

No metric, infinite tuning — for any non-trivial agent or RAG, define an evaluation set and a metric before picking a framework. Most framework choice arguments dissolve once a metric exists.

Selection guide — three questions

What language is the host application?
- .NET / Java → Semantic Kernel.
- Python → continue.
- JS / TS → LangChain (others trail).
What is the dominant workload?
- Documents → LlamaIndex or Haystack.
- Multi-agent loop → AutoGen or CrewAI.
- Single chain with tools → LangChain or Semantic Kernel.
- Multi-client tool exposure → MCP, not a framework.
How important is evaluation?
- First-class metric → Haystack or LangChain + LangSmith.
- Best-effort dev iteration → any of them.
- Compiled prompts → DSPy, then expose the result through any framework.

Side-by-side code — "answer a question against three docs"

A minimal RAG flow shown in each framework, same inputs, same expected output. Read top to bottom; the verbosity differences are real.

LangChain

python

from langchain_anthropic import ChatAnthropic
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import os

vs = Chroma(
    collection_name="docs",
    embedding_function=OpenAIEmbeddings(),
    persist_directory="./chroma_db",
)
retriever = vs.as_retriever(search_kwargs={"k": 4})
prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer from the context only.\n\n{context}"),
    ("human", "{q}"),
])
model = ChatAnthropic(model="claude-sonnet-4-6")

def join(docs): return "\n\n".join(d.page_content for d in docs)

chain = {"context": retriever | join, "q": RunnablePassthrough()} | prompt | model | StrOutputParser()
print(chain.invoke("What does Haystack do?"))

Output:

text

Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.

LlamaIndex

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = Anthropic(model="claude-sonnet-4-6")
Settings.embed_model = OpenAIEmbedding()

docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)
qe = index.as_query_engine()
print(qe.query("What does Haystack do?"))

Output:

text

Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.

Haystack

python

from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators.chat import AnthropicChatGenerator

p = Pipeline()
p.add_component("emb", SentenceTransformersTextEmbedder(model="BAAI/bge-small-en-v1.5"))
p.add_component("ret", InMemoryEmbeddingRetriever(document_store=store, top_k=4))
p.add_component("prm", PromptBuilder(template="Answer from context.\n{% for d in documents %}{{ d.content }}\n{% endfor %}\nQ: {{ q }}"))
p.add_component("llm", AnthropicChatGenerator(model="claude-sonnet-4-6"))
p.connect("emb.embedding", "ret.query_embedding")
p.connect("ret.documents", "prm.documents")
p.connect("prm.prompt", "llm.prompt")

q = "What does Haystack do?"
print(p.run({"emb": {"text": q}, "prm": {"q": q}})["llm"]["replies"][0])

Output:

text

Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.

Semantic Kernel (Python)

python

import asyncio
from semantic_kernel import Kernel
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion
from semantic_kernel.contents.chat_history import ChatHistory

async def main():
    k = Kernel()
    k.add_service(OpenAIChatCompletion(service_id="chat", ai_model_id="gpt-4o-mini"))
    docs = open("./docs.txt").read()
    history = ChatHistory()
    history.add_user_message(f"Answer from context.\n{docs}\nQ: What does Haystack do?")
    chat = k.get_service("chat")
    settings = chat.get_prompt_execution_settings_class()(service_id="chat")
    reply = await chat.get_chat_message_content(chat_history=history, settings=settings, kernel=k)
    print(reply.content)

asyncio.run(main())

Output:

text

Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.

AutoGen

python

import asyncio
from autogen_agentchat.agents import AssistantAgent
from autogen_ext.models.openai import OpenAIChatCompletionClient

async def main():
    client = OpenAIChatCompletionClient(model="gpt-4o-mini")
    agent = AssistantAgent(name="answerer", model_client=client,
                           system_message="Answer from context. Be concise.")
    docs = open("./docs.txt").read()
    result = await agent.on_messages([{"role": "user", "content": f"{docs}\nQ: What does Haystack do?"}], None)
    print(result.chat_message.content)

asyncio.run(main())

Output:

text

Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.

CrewAI

python

from crewai import Agent, Task, Crew, Process
from crewai_tools import FileReadTool

reader = Agent(role="Doc Reader", goal="Read context docs.",
               backstory="You read provided docs carefully.", tools=[FileReadTool()])
answerer = Agent(role="Answerer", goal="Answer using only the docs.",
                 backstory="Concise technical responder.")

t1 = Task(description="Read ./docs.txt.", expected_output="Verbatim doc contents.", agent=reader)
t2 = Task(description="Answer 'What does Haystack do?' from the read content.",
          expected_output="One sentence.", agent=answerer)

crew = Crew(agents=[reader, answerer], tasks=[t1, t2], process=Process.sequential)
print(crew.kickoff().raw)

Output:

text

Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.

The five-or-six way comparison shows the trade: LangChain and LlamaIndex compress retrieval + answering into a few lines via heavy abstractions; Haystack and AutoGen are explicit and verbose; CrewAI requires you to model the work as roles; Semantic Kernel sits close to the raw chat API.

Cost and latency notes

Hard numbers depend on your data, models, and prompts — these are rough rules of thumb worth confirming with traces.

Workload	Typical LLM calls	Comment
LangChain LCEL chain	1 per request	Add 1 if a tool call is needed.
LlamaIndex `query_engine.query`	1 per request	Sub-question decomposition can multiply this 3–5×.
Haystack RAG pipeline	1 per request	Reranker adds local inference, not LLM calls.
AutoGen single-agent	1 per turn, ~3–5 turns	Per task; group chats are 5–15+.
CrewAI sequential 3-role crew	3+ per kickoff	Hierarchical adds a manager round per task.
Semantic Kernel auto function call	1–2 per request	One per tool round-trip.
DSPy compiled program	1 per request at runtime	Compilation phase consumes hundreds offline.

Caching at the framework boundary

LangChain: set_llm_cache(InMemoryCache()) or SQLiteCache(database_path="...") caches all LLM calls globally.
LlamaIndex: IngestionPipeline caches embeddings; LLM caching is per-callsite.
Haystack: no built-in LLM cache; wrap generators with a @component that hashes input and returns cached output.
AutoGen: ChatCompletionCache via autogen_ext.cache.
CrewAI: cache=True on agents for tool-result caching.
Semantic Kernel: filter-based caching (write a IFunctionInvocationFilter).

Always cache during eval and optimisation runs — the LLM call count dwarfs everything else.

Observability across frameworks

Tracing options that work today:

Framework	Best-in-class	Notes
LangChain	LangSmith	Set `LANGCHAIN_TRACING_V2=true` and `LANGCHAIN_API_KEY`. Every chain call appears with prompts, tokens, latency.
LlamaIndex	LangSmith via `instrument`	`from llama_index.core import set_global_handler; set_global_handler("langfuse")` is also common.
Haystack	OpenTelemetry via `haystack.tracing`	Use `LangfuseTracer` or `OpenTelemetryTracer`.
AutoGen	OpenTelemetry built-in	Auto-spans for `agent.run`, `model.call`, `tool.call`.
CrewAI	AgentOps and Langfuse plugins	Set `AGENTOPS_API_KEY`; CrewAI hooks send full traces.
Semantic Kernel	OpenTelemetry built-in	Standard `gen_ai.*` semantic conventions.

For any framework, an OTel collector → Tempo / Honeycomb / Azure Monitor → Grafana is the most portable setup. Vendor SaaS (LangSmith, AgentOps, Langfuse) wins on rich LLM-specific UI; OTel wins on owning your data.

Compatibility with model providers

Most frameworks support OpenAI, Anthropic, Google, Azure OpenAI, Cohere, Mistral, and Hugging Face out of the box. Differences are at the edges.

Provider	LangChain	LlamaIndex	AutoGen	CrewAI	Haystack	Semantic Kernel
OpenAI	yes	yes	yes	yes	yes	yes
Anthropic Claude	yes	yes	yes (ext)	yes	yes	community
Google Gemini	yes	yes	community	yes	yes	yes
Azure OpenAI	yes	yes	yes	yes	yes	yes
Cohere	yes	yes	community	yes	yes	community
Mistral	yes	yes	yes	yes	yes	community
Hugging Face local	yes	yes	community	community	yes	community
Ollama	yes	yes	yes	yes	yes	yes
AWS Bedrock	yes	yes	community	yes	yes	community
OpenAI-compatible (vLLM, Groq, Fireworks, Together)	yes (point base URL)	yes	yes	yes	yes	yes

For any provider exposing an OpenAI-compatible endpoint, all six frameworks support it by setting base_url= on the OpenAI client config. This makes vLLM, Groq, Fireworks, Together, and Anyscale drop-in choices everywhere.

Common pitfalls — framework selection

Choosing for capability you might need later — every framework is large enough that "might need" features add maintenance burden. Choose for the workload in front of you; migrating later is usually easier than predicted.

Tutorial-driven choice — the framework with the best tutorial is not necessarily the best for production. CrewAI is famously approachable in tutorials but has fewer escape hatches than LangChain at scale.

Underestimating evaluation cost — Haystack and LangChain (with LangSmith) make evaluation cheap; the others need glue. If you have not built an eval set, your framework choice is premature.

Build the same minimal pipeline (3 documents, 5 questions, one metric) in your top-two frameworks before committing. The decision usually resolves itself in a half-day spike.

Anatomy of a multi-framework production stack

A representative shape, distilled from several deployed systems.

text

+-----------------------------+        +-----------------------------+
|  Ingestion Worker (cron)    |        |  Online API (FastAPI/.NET)  |
|                             |        |                             |
|  Haystack indexing pipeline |        |  LangChain LCEL chain       |
|  - converters               |        |  - retriever (shared Chroma)|
|  - splitter                 |        |  - tools via MCP client     |
|  - embedder                 |        |  - LangSmith tracing        |
|  - writer (ChromaDB)        |        |  - streaming                |
+--------------+--------------+        +--------------+--------------+
               |                                      |
               v                                      v
+-----------------------------+        +-----------------------------+
|        Chroma vector store (shared)  |  MCP servers (multiple)     |
|        (Postgres / S3 backed)        |  - github.py                |
|                                      |  - postgres.py              |
|                                      |  - search.py                |
+--------------------------------------+-----------------------------+
                |
                v
+-----------------------------+        +-----------------------------+
|  Background research crew   |        |  Eval / CI                  |
|  CrewAI: researcher → writer|        |  Haystack evaluators        |
|                             |        |  + LangSmith datasets       |
+-----------------------------+        +-----------------------------+

Each framework owns a single workload. The data plane (vector store, MCP tools) is shared. New workloads pick the right framework rather than forcing one to do everything.

Future-proofing notes

MCP is winning the tool-server story. Even if your app is single-client today, exposing tools as MCP servers separates concerns and lets you swap LLM clients without rewriting tools.
Microsoft.Extensions.AI is becoming the .NET-wide abstraction. Semantic Kernel already aligns to it; LangChain.NET and others are following.
Provider tool-calling APIs are converging. OpenAI, Anthropic, and Google all expose tool calling with similar shapes. Direct-SDK code is increasingly viable, and the unique value of "uniform model interface" frameworks shrinks accordingly.
DSPy-style prompt compilation is migrating into frameworks. Watch for langchain-dspy and haystack-dspy style integrations that let you compile a specific chain or pipeline against a metric.

Quick reference — pick X when

Pick	When
LangChain	Maximum integration breadth, LangSmith tracing, JS/TS support needed.
LlamaIndex	Document-centric RAG, multi-document reasoning, knowledge graphs.
AutoGen	Multi-agent code execution loops, structured debates, Docker sandbox.
CrewAI	Role-driven multi-step workflows, YAML-driven config, business-friendly abstractions.
Haystack	Production RAG with strong evaluation, pipeline DAG visibility, REST deploys.
Semantic Kernel	.NET / Java host applications, OpenTelemetry, dependency injection.
DSPy	Optimising prompts against a metric across LLMs.
MCP (FastMCP / mcp-go)	Tools consumed by multiple LLM clients.
Raw provider SDK	Single-purpose CLI / Worker, latency-critical, minimal deps.