cheat sheet

Agent Frameworks Comparison

Side-by-side comparison of LangChain, LlamaIndex, AutoGen, CrewAI, Haystack, and Semantic Kernel for building LLM-powered applications and agent systems. Covers strengths, weaknesses, and when to pick each.

Agent Frameworks Comparison — Decision Matrix

What it is

This page compares the six most-deployed open-source frameworks for building LLM applications and multi-agent systems in 2026: LangChain, LlamaIndex, AutoGen, CrewAI, Haystack, and Semantic Kernel. The goal is to help you pick — not to pick for you. The right answer depends on your runtime (Python, .NET, Node), the dominant workload (RAG, agent loops, multi-stage workflows), and the team's tolerance for breaking-change churn.

Where useful, the page also references DSPy (a prompt-compilation framework, not a typical agent framework), MCP-based custom servers, and naked SDK code with the model provider's own tool-calling API.

For deeper material on each framework, see the linked pages.

When / why to read this

  • You are starting a new LLM application and want a defensible framework choice.
  • You inherited a LangChain or LlamaIndex codebase and are considering migration.
  • You are integrating LLM features into an existing .NET or Java app.
  • You want to compare a current framework against the alternatives before committing to a six-month roadmap.

The frameworks at a glance

FrameworkPrimary languageFirst commitMaintained byCore abstraction
LangChainPython, JSOct 2022LangChain Inc.LCEL pipe-composed runnables
LlamaIndexPythonNov 2022LlamaIndex Inc.Indexes, query engines, agents
AutoGenPython, .NETMar 2023Microsoft ResearchConversable agents, group chats
CrewAIPythonOct 2023crewAI Inc.Role + goal + backstory + Crew
HaystackPythonNov 2019deepsetPipeline DAG of typed components
Semantic KernelPython, .NET, JavaMar 2023MicrosoftKernel + plugins + functions

High-level positioning

  • LangChain — the kitchen-sink framework. Most integrations, biggest community, fastest-moving API. Best when you want every model, vector store, and tool already wrapped, and you tolerate occasional churn.
  • LlamaIndex — the data framework. Sharper than LangChain on document-centric RAG (indexes, query engines, knowledge graphs, multi-document reasoning). Lighter on general agents.
  • AutoGen — the multi-agent conversation framework. Strong for "two agents review each other" patterns, code generation + execution loops, and structured multi-agent debates.
  • CrewAI — the role-based crew framework. Optimised for long-horizon, multi-step workflows where each step has a distinct expert "agent". YAML-friendly.
  • Haystack — the pipeline framework. Explicit DAG of typed components, YAML-serialisable, deployable via hayhooks. Production-friendly for RAG with strong evaluation primitives.
  • Semantic Kernel — Microsoft's polyglot SDK. The right answer when you are in the .NET ecosystem; also a strong Python option when you want clean dependency injection and OpenTelemetry tracing.

Decision matrix — by use case

You want…First pickStrong second
RAG over a folder of PDFsLlamaIndexHaystack
Production RAG with strict evalHaystackLlamaIndex
Multi-document, multi-hop reasoningLlamaIndexHaystack
Tool-using agent with provider tool callingLangChainSemantic Kernel
Multi-agent code-review loopAutoGenCrewAI
Long workflow with role specialists (research → write → QA)CrewAIAutoGen
LLM features inside a .NET appSemantic KernelLangChain.NET
Java-based stackSemantic Kernel (Java)LangChain4j
Tracing, evals, datasets across vendorsLangChain (+ LangSmith)Haystack
Optimising prompts against a metricDSPyLangChain (manual)
Cross-LLM-client tool serversMCP (FastMCP, mcp-go)n/a
Embedded in a single small CLIProvider SDK directlyLangChain
Streaming chat UILangChain or Semantic KernelLlamaIndex
Knowledge-graph RAGLlamaIndexHaystack
Hybrid retrieval (BM25 + dense + rerank)HaystackLlamaIndex

Feature matrix — capabilities

CapabilityLangChainLlamaIndexAutoGenCrewAIHaystackSemantic Kernel
Pythonyesyesyesyesyesyes
.NET / C#partialnoyesnonoyes
Javacommunitynocommunitynocommunityyes
JS / TSyesyes (alpha)partialnopartialpartial
Provider tool callingyesyesyesyesyesyes
Streamingyesyesyesyesyesyes
Async-firstyesyesyespartialyesyes
Built-in retrieversmanymanyfewbasicmanybasic
Built-in vector stores50+30+n/an/a20+10+
Multi-agent group chatpartialpartialyesyespartialpreview
Planner / routerLCEL routingRouterQueryEngineGroupChatManagerProcess.hierarchicalBranch componentdeprecated planners
Code execution sandboxtooltoolyes (Docker, local)tooltooltool
Memory abstractionyesyesyesyesyesyes
Evaluation primitivesLangSmithbasicn/abasicyes (built-in)basic
YAML / file configpartialpartialpartialyesyesyes (.prompty)
OpenTelemetryvia LangSmith / OTelcommunityyescommunitycommunityyes (built-in)
MCP client supportyesyesyescommunitycommunityyes
MCP server SDK alignmentyesyesyescommunitycommunityyes

Strengths and weaknesses — one paragraph each

LangChain

Strengths: by far the largest integration catalogue — every embedding model, vector store, document loader, and observability tool has a LangChain wrapper. LCEL composes pipelines with a uniform streaming/async/batch interface. LangSmith gives near-zero-effort tracing and dataset evaluation across LangChain and non-LangChain code.

Weaknesses: breaking changes are routine; imports migrate between releases. Heavy abstraction surface means tracing the actual prompt sent to the LLM often requires chain.get_prompts() or LangSmith. The "everything is a Runnable" philosophy can make simple tasks look complicated.

LlamaIndex

Strengths: the sharpest abstraction for "load → index → query" over your documents. Indexes go beyond plain vector search — keyword, knowledge graph, summary, document hierarchies. Query engines decompose multi-hop questions automatically. Better default chunking and ingestion pipelines than LangChain.

Weaknesses: agent surface is thinner than LangChain or AutoGen. The Settings global is convenient but bites teams that swap models mid-process. JS port lags behind Python by months.

AutoGen

Strengths: the cleanest mental model for multi-agent conversations. ConversableAgent, UserProxyAgent, CodeExecutorAgent map directly to recognisable roles. Code execution with Docker isolation works out of the box. v0.4 rewrite introduced typed, async, distributed runtime suitable for serverless and queues.

Weaknesses: v0.4 broke compatibility with v0.2 (pyautogen), so older tutorials are misleading. Group chats can loop or stall without careful termination conditions. Token costs from chatty multi-agent runs surprise teams.

CrewAI

Strengths: ergonomic role/goal/backstory model maps naturally to business workflows ("a researcher, a writer, an editor"). Sequential and hierarchical processes cover most workflow shapes. YAML-driven config lets non-engineers tune the crew.

Weaknesses: when a "crew" doesn't quite fit your workflow, customising is fiddly. Smaller integration catalogue than LangChain. Strong opinions about agent design that can constrain implementation patterns.

Haystack

Strengths: explicit pipeline DAG with typed sockets is the easiest framework to review in PRs. First-class evaluation primitives (ContextRelevance, Faithfulness, SAS). YAML serialisation + hayhooks deploys pipelines as REST endpoints without extra glue. 2.x rewrite is clean and stable.

Weaknesses: smaller ecosystem of pre-built integrations than LangChain or LlamaIndex. Less focus on the "agent loop" pattern — possible, but not the primary metaphor. .NET and JS support are community efforts.

Semantic Kernel

Strengths: the right answer in .NET shops — dependency injection, OpenTelemetry, async, and Microsoft.Extensions.AI all align. Plugins map cleanly to existing .NET service architectures. .prompty file format is portable across SK, the Azure AI Foundry, and external tools. Strong agent and workflow stories arriving in 1.x.

Weaknesses: Python SDK trails .NET in features and stability. Naming conventions ("kernel functions", "skills" historically) take some adjustment. Smaller community than LangChain.

When to use no framework

Sometimes the right answer is no framework at all.

  • The Anthropic / OpenAI / Google SDKs already do streaming, tool calling, retries, and async. If your application is "call model with tools" and that is it, a 200-line direct-SDK implementation is easier to maintain than a LangChain chain.
  • For tool servers consumed by multiple clients, use MCP directly (see MCP frameworks) — that is the framework.
  • For prompt-quality work where the metric matters more than the orchestration, use DSPy to compile prompts, then expose the result through any of the frameworks above.

Migration sketches

Patterns observed in real migrations. Most are achievable in a week of focused work for a small RAG project.

LangChain → Haystack

  • ChatModelOpenAIChatGenerator / OpenAIGenerator.
  • PromptTemplatePromptBuilder(template=...) (Jinja2 instead of f-string).
  • RetrieverInMemoryEmbeddingRetriever (or per-store retriever).
  • LCEL pipe prompt | model | parserPipeline.add_component + connect().
  • AgentExecutorAgent from haystack.components.agents plus tools.

LlamaIndex → LangChain

  • VectorStoreIndex.as_query_engine() → custom LCEL chain with retriever + prompt + model.
  • Settings.llm → constructor-level ChatModel instances; no global state.
  • query_engine.query(q)chain.invoke({"question": q}).

CrewAI → AutoGen

  • AgentAssistantAgent with focused system_message.
  • Task description → first user message to the relevant agent.
  • Crew(process=sequential) → chain of agent.run_stream(task=...) calls.
  • Process.hierarchicalRoundRobinGroupChat or SelectorGroupChat with a manager agent.

Semantic Kernel → LangChain (rarely needed but asked about)

  • @kernel_function@tool decorator.
  • Kernel-registered plugins → tools bound to model via model.bind_tools([...]).
  • .prompty files → ChatPromptTemplate.from_messages.
  • FunctionChoiceBehavior.Auto()create_tool_calling_agent with AgentExecutor.

Performance and operational considerations

ConcernWhat to askNotes
Tokens per task"How many LLM calls per user request?"AutoGen/CrewAI multi-agent loops can be 5–20×; budget accordingly.
Cold start"How long until first token after process start?"LangChain and LlamaIndex are 1–3s heavy; SK / Haystack are leaner; raw SDK is fastest.
Memory footprint"What does the framework load at import time?"LangChain imports a lot transitively; lazy-import provider subpackages where you can.
Observability"Can I see the exact prompt sent to the LLM?"LangSmith (LangChain), built-in tracing (SK), pipeline.draw() (Haystack).
Determinism"Can I replay a failed run with the same inputs?"All frameworks support deterministic temperatures; only LangSmith and SK ship reproducible run records by default.
Sandbox safety"Where does generated code execute?"AutoGen Docker executor is the cleanest; everywhere else use a sandbox you bring yourself.

Real-world recipes

Recipe — chunk-and-evaluate harness portable across frameworks

A pipeline-agnostic eval set ({"question": ..., "ground_truth": ...}) plus a metric function (semantic_match(predicted, gold) -> bool) lets you A/B any of these frameworks against the same questions. Run each framework's pipeline once per question, store predictions in predictions/<framework>.jsonl, then compute the metric and report a markdown table.

python
import json
from pathlib import Path

def evaluate(pred_file: str, gold_file: str, metric) -> float:
    preds = [json.loads(l) for l in Path(pred_file).read_text().splitlines()]
    gold  = [json.loads(l) for l in Path(gold_file).read_text().splitlines()]
    return sum(metric(p["answer"], g["answer"]) for p, g in zip(preds, gold)) / len(preds)

for name in ["langchain", "llamaindex", "haystack", "semantic-kernel"]:
    score = evaluate(f"predictions/{name}.jsonl", "gold/qa.jsonl", semantic_match)
    print(f"{name:18s} {score:.2%}")

Output:

text
langchain          78.50%
llamaindex         82.00%
haystack           81.00%
semantic-kernel    79.50%

Numbers are illustrative — always run on your own dataset before deciding.

Recipe — shared MCP tool layer across frameworks

Write business tools once as an MCP server; consume from whichever framework is best per workload.

python
from mcp.server.fastmcp import FastMCP

mcp = FastMCP("acme-tools")

@mcp.tool()
def get_customer(customer_id: str) -> dict:
    """Look up a customer by ID."""
    ...

if __name__ == "__main__":
    mcp.run()

Then in any of LangChain (via langchain-mcp-adapters), Semantic Kernel (via semantic-kernel-mcp), or AutoGen (via the official MCP client) consume the tools without rewriting them. See the MCP frameworks page for client wiring.

Recipe — gradually replacing LangChain with raw SDK

  1. Identify a chain that does only prompt | model | parser — these are 80% of LangChain code in many apps.
  2. Replace with a 30-line function that calls the provider SDK directly.
  3. Move retries, structured outputs, and tracing to first-party constructs (tenacity, Pydantic, OpenTelemetry).
  4. Keep LangChain only for chains that genuinely use multi-step LCEL behaviour.

The result is fewer dependencies, fewer breaking changes per release, and clearer error messages.

Recipe — multi-framework strategy in one product

It is legitimate to use more than one framework in a single product. A common shape:

  • Ingestion: Haystack pipelines for batch indexing with hayhooks as a deployment artefact.
  • Online retrieval + answering: LangChain LCEL chain because LangSmith tracing on every user request is valuable.
  • Background research crew: CrewAI runs a research → write → QA flow for long-form content.
  • Embedded tools: an MCP server exposes domain tools to all of the above and to Claude Code / Codex CLI for engineers.

Keep the interfaces small and the framework boundaries crisp; the orchestration code that ties them together is the actual product.

Anti-patterns

Picking by GitHub star count — popularity correlates with integration breadth, not fitness for your problem. CrewAI is smaller than LangChain but better for role-driven workflows; Haystack is smaller still but better for evaluation-heavy RAG.

Wrapping the framework in your own abstraction "in case we swap" — the leak rate is high. Either commit to the framework or write thin direct-SDK code. Premature framework-agnostic abstractions are usually wasted.

Adopting two competing frameworks for the same workload — running LangChain and LlamaIndex side-by-side on the same RAG flow doubles dependency surface and confuses contributors. Multi-framework is fine when each framework owns a distinct workload (above), not when they overlap.

No metric, infinite tuning — for any non-trivial agent or RAG, define an evaluation set and a metric before picking a framework. Most framework choice arguments dissolve once a metric exists.

Selection guide — three questions

  1. What language is the host application?

    • .NET / Java → Semantic Kernel.
    • Python → continue.
    • JS / TS → LangChain (others trail).
  2. What is the dominant workload?

    • Documents → LlamaIndex or Haystack.
    • Multi-agent loop → AutoGen or CrewAI.
    • Single chain with tools → LangChain or Semantic Kernel.
    • Multi-client tool exposure → MCP, not a framework.
  3. How important is evaluation?

    • First-class metric → Haystack or LangChain + LangSmith.
    • Best-effort dev iteration → any of them.
    • Compiled prompts → DSPy, then expose the result through any framework.

Side-by-side code — "answer a question against three docs"

A minimal RAG flow shown in each framework, same inputs, same expected output. Read top to bottom; the verbosity differences are real.

LangChain

python
from langchain_anthropic import ChatAnthropic
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import os

vs = Chroma(
    collection_name="docs",
    embedding_function=OpenAIEmbeddings(),
    persist_directory="./chroma_db",
)
retriever = vs.as_retriever(search_kwargs={"k": 4})
prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer from the context only.\n\n{context}"),
    ("human", "{q}"),
])
model = ChatAnthropic(model="claude-sonnet-4-6")

def join(docs): return "\n\n".join(d.page_content for d in docs)

chain = {"context": retriever | join, "q": RunnablePassthrough()} | prompt | model | StrOutputParser()
print(chain.invoke("What does Haystack do?"))

Output:

text
Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.

LlamaIndex

python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = Anthropic(model="claude-sonnet-4-6")
Settings.embed_model = OpenAIEmbedding()

docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)
qe = index.as_query_engine()
print(qe.query("What does Haystack do?"))

Output:

text
Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.

Haystack

python
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators.chat import AnthropicChatGenerator

p = Pipeline()
p.add_component("emb", SentenceTransformersTextEmbedder(model="BAAI/bge-small-en-v1.5"))
p.add_component("ret", InMemoryEmbeddingRetriever(document_store=store, top_k=4))
p.add_component("prm", PromptBuilder(template="Answer from context.\n{% for d in documents %}{{ d.content }}\n{% endfor %}\nQ: {{ q }}"))
p.add_component("llm", AnthropicChatGenerator(model="claude-sonnet-4-6"))
p.connect("emb.embedding", "ret.query_embedding")
p.connect("ret.documents", "prm.documents")
p.connect("prm.prompt", "llm.prompt")

q = "What does Haystack do?"
print(p.run({"emb": {"text": q}, "prm": {"q": q}})["llm"]["replies"][0])

Output:

text
Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.

Semantic Kernel (Python)

python
import asyncio
from semantic_kernel import Kernel
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion
from semantic_kernel.contents.chat_history import ChatHistory

async def main():
    k = Kernel()
    k.add_service(OpenAIChatCompletion(service_id="chat", ai_model_id="gpt-4o-mini"))
    docs = open("./docs.txt").read()
    history = ChatHistory()
    history.add_user_message(f"Answer from context.\n{docs}\nQ: What does Haystack do?")
    chat = k.get_service("chat")
    settings = chat.get_prompt_execution_settings_class()(service_id="chat")
    reply = await chat.get_chat_message_content(chat_history=history, settings=settings, kernel=k)
    print(reply.content)

asyncio.run(main())

Output:

text
Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.

AutoGen

python
import asyncio
from autogen_agentchat.agents import AssistantAgent
from autogen_ext.models.openai import OpenAIChatCompletionClient

async def main():
    client = OpenAIChatCompletionClient(model="gpt-4o-mini")
    agent = AssistantAgent(name="answerer", model_client=client,
                           system_message="Answer from context. Be concise.")
    docs = open("./docs.txt").read()
    result = await agent.on_messages([{"role": "user", "content": f"{docs}\nQ: What does Haystack do?"}], None)
    print(result.chat_message.content)

asyncio.run(main())

Output:

text
Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.

CrewAI

python
from crewai import Agent, Task, Crew, Process
from crewai_tools import FileReadTool

reader = Agent(role="Doc Reader", goal="Read context docs.",
               backstory="You read provided docs carefully.", tools=[FileReadTool()])
answerer = Agent(role="Answerer", goal="Answer using only the docs.",
                 backstory="Concise technical responder.")

t1 = Task(description="Read ./docs.txt.", expected_output="Verbatim doc contents.", agent=reader)
t2 = Task(description="Answer 'What does Haystack do?' from the read content.",
          expected_output="One sentence.", agent=answerer)

crew = Crew(agents=[reader, answerer], tasks=[t1, t2], process=Process.sequential)
print(crew.kickoff().raw)

Output:

text
Haystack is a Python framework for building LLM pipelines as directed graphs of typed components.

The five-or-six way comparison shows the trade: LangChain and LlamaIndex compress retrieval + answering into a few lines via heavy abstractions; Haystack and AutoGen are explicit and verbose; CrewAI requires you to model the work as roles; Semantic Kernel sits close to the raw chat API.

Cost and latency notes

Hard numbers depend on your data, models, and prompts — these are rough rules of thumb worth confirming with traces.

WorkloadTypical LLM callsComment
LangChain LCEL chain1 per requestAdd 1 if a tool call is needed.
LlamaIndex query_engine.query1 per requestSub-question decomposition can multiply this 3–5×.
Haystack RAG pipeline1 per requestReranker adds local inference, not LLM calls.
AutoGen single-agent1 per turn, ~3–5 turnsPer task; group chats are 5–15+.
CrewAI sequential 3-role crew3+ per kickoffHierarchical adds a manager round per task.
Semantic Kernel auto function call1–2 per requestOne per tool round-trip.
DSPy compiled program1 per request at runtimeCompilation phase consumes hundreds offline.

Caching at the framework boundary

  • LangChain: set_llm_cache(InMemoryCache()) or SQLiteCache(database_path="...") caches all LLM calls globally.
  • LlamaIndex: IngestionPipeline caches embeddings; LLM caching is per-callsite.
  • Haystack: no built-in LLM cache; wrap generators with a @component that hashes input and returns cached output.
  • AutoGen: ChatCompletionCache via autogen_ext.cache.
  • CrewAI: cache=True on agents for tool-result caching.
  • Semantic Kernel: filter-based caching (write a IFunctionInvocationFilter).

Always cache during eval and optimisation runs — the LLM call count dwarfs everything else.

Observability across frameworks

Tracing options that work today:

FrameworkBest-in-classNotes
LangChainLangSmithSet LANGCHAIN_TRACING_V2=true and LANGCHAIN_API_KEY. Every chain call appears with prompts, tokens, latency.
LlamaIndexLangSmith via instrumentfrom llama_index.core import set_global_handler; set_global_handler("langfuse") is also common.
HaystackOpenTelemetry via haystack.tracingUse LangfuseTracer or OpenTelemetryTracer.
AutoGenOpenTelemetry built-inAuto-spans for agent.run, model.call, tool.call.
CrewAIAgentOps and Langfuse pluginsSet AGENTOPS_API_KEY; CrewAI hooks send full traces.
Semantic KernelOpenTelemetry built-inStandard gen_ai.* semantic conventions.

For any framework, an OTel collector → Tempo / Honeycomb / Azure Monitor → Grafana is the most portable setup. Vendor SaaS (LangSmith, AgentOps, Langfuse) wins on rich LLM-specific UI; OTel wins on owning your data.

Compatibility with model providers

Most frameworks support OpenAI, Anthropic, Google, Azure OpenAI, Cohere, Mistral, and Hugging Face out of the box. Differences are at the edges.

ProviderLangChainLlamaIndexAutoGenCrewAIHaystackSemantic Kernel
OpenAIyesyesyesyesyesyes
Anthropic Claudeyesyesyes (ext)yesyescommunity
Google Geminiyesyescommunityyesyesyes
Azure OpenAIyesyesyesyesyesyes
Cohereyesyescommunityyesyescommunity
Mistralyesyesyesyesyescommunity
Hugging Face localyesyescommunitycommunityyescommunity
Ollamayesyesyesyesyesyes
AWS Bedrockyesyescommunityyesyescommunity
OpenAI-compatible (vLLM, Groq, Fireworks, Together)yes (point base URL)yesyesyesyesyes

For any provider exposing an OpenAI-compatible endpoint, all six frameworks support it by setting base_url= on the OpenAI client config. This makes vLLM, Groq, Fireworks, Together, and Anyscale drop-in choices everywhere.

Common pitfalls — framework selection

Choosing for capability you might need later — every framework is large enough that "might need" features add maintenance burden. Choose for the workload in front of you; migrating later is usually easier than predicted.

Tutorial-driven choice — the framework with the best tutorial is not necessarily the best for production. CrewAI is famously approachable in tutorials but has fewer escape hatches than LangChain at scale.

Underestimating evaluation cost — Haystack and LangChain (with LangSmith) make evaluation cheap; the others need glue. If you have not built an eval set, your framework choice is premature.

Build the same minimal pipeline (3 documents, 5 questions, one metric) in your top-two frameworks before committing. The decision usually resolves itself in a half-day spike.

Anatomy of a multi-framework production stack

A representative shape, distilled from several deployed systems.

text
+-----------------------------+        +-----------------------------+
|  Ingestion Worker (cron)    |        |  Online API (FastAPI/.NET)  |
|                             |        |                             |
|  Haystack indexing pipeline |        |  LangChain LCEL chain       |
|  - converters               |        |  - retriever (shared Chroma)|
|  - splitter                 |        |  - tools via MCP client     |
|  - embedder                 |        |  - LangSmith tracing        |
|  - writer (ChromaDB)        |        |  - streaming                |
+--------------+--------------+        +--------------+--------------+
               |                                      |
               v                                      v
+-----------------------------+        +-----------------------------+
|        Chroma vector store (shared)  |  MCP servers (multiple)     |
|        (Postgres / S3 backed)        |  - github.py                |
|                                      |  - postgres.py              |
|                                      |  - search.py                |
+--------------------------------------+-----------------------------+
                |
                v
+-----------------------------+        +-----------------------------+
|  Background research crew   |        |  Eval / CI                  |
|  CrewAI: researcher → writer|        |  Haystack evaluators        |
|                             |        |  + LangSmith datasets       |
+-----------------------------+        +-----------------------------+

Each framework owns a single workload. The data plane (vector store, MCP tools) is shared. New workloads pick the right framework rather than forcing one to do everything.

Future-proofing notes

  • MCP is winning the tool-server story. Even if your app is single-client today, exposing tools as MCP servers separates concerns and lets you swap LLM clients without rewriting tools.
  • Microsoft.Extensions.AI is becoming the .NET-wide abstraction. Semantic Kernel already aligns to it; LangChain.NET and others are following.
  • Provider tool-calling APIs are converging. OpenAI, Anthropic, and Google all expose tool calling with similar shapes. Direct-SDK code is increasingly viable, and the unique value of "uniform model interface" frameworks shrinks accordingly.
  • DSPy-style prompt compilation is migrating into frameworks. Watch for langchain-dspy and haystack-dspy style integrations that let you compile a specific chain or pipeline against a metric.

Quick reference — pick X when

PickWhen
LangChainMaximum integration breadth, LangSmith tracing, JS/TS support needed.
LlamaIndexDocument-centric RAG, multi-document reasoning, knowledge graphs.
AutoGenMulti-agent code execution loops, structured debates, Docker sandbox.
CrewAIRole-driven multi-step workflows, YAML-driven config, business-friendly abstractions.
HaystackProduction RAG with strong evaluation, pipeline DAG visibility, REST deploys.
Semantic Kernel.NET / Java host applications, OpenTelemetry, dependency injection.
DSPyOptimising prompts against a metric across LLMs.
MCP (FastMCP / mcp-go)Tools consumed by multiple LLM clients.
Raw provider SDKSingle-purpose CLI / Worker, latency-critical, minimal deps.