Back to Blog

Seven principles for designing MCP apps

Seven practical principles for building ChatGPT apps and MCP apps that the model actually invokes, grounded in enterprise work and an in-depth analysis of 316 live apps covering 2,178 tools.

TL;DR

Building ChatGPT apps and MCP apps is fundamentally different from building for web or mobile. Drawing on enterprise work plus an analysis of 316 live ChatGPT apps across 2,178 tools, the same seven principles keep showing up: start with goals and intents, treat tool descriptions like a system prompt, let the model reason, design for both surfaces, build for conversation, test in the real client, and optimize for distribution.

Principles for designing MCP Apps - talk at MCP Dev Summit 2026

Watch the talk: Watch on YouTube.

From the Fragmented Web to the Intent-Based Web

We've moved from a world where you find the tool to one where the tool finds you.

Previously, you went to Google, searched for Skyscanner, navigated to its website, filled in flight details, and filtered results. Now a user can open ChatGPT and type "Expedia flights to Chicago Oct 12-15," and the right app can surface inside the conversation at the exact moment of intent.

This is not a minor UX evolution. It is a shift in how products get discovered, used, and designed. The ecosystem is moving quickly, and at the time of writing our AgentDiscoverability platform tracks 389 ChatGPT apps and 228 Claude connectors with daily refreshes.

The framework below comes from two sources: direct work building apps for enterprise clients like Statista, ManpowerGroup, and Mitchells & Butlers, and an in-depth analysis of 316 live ChatGPT apps covering 2,178 tools.

How ChatGPT Apps Actually Work

Before the seven principles mean anything, you need to understand the architecture. Every ChatGPT app has three components, and the relationships between them determine where the product logic should live.

The Three Components

Conversation (ChatGPT) is the interface where the user expresses intent in natural language. It is the entry point, and everything starts there.

Tools (the MCP server) are the capabilities your app exposes. The model reads tool descriptions to decide which tool to call and with what parameters. The server handles API logic, authentication, security, and business rules.

Widget (the interactive UI) is the visual experience rendered back into the conversation. It usually lives inside a sandboxed iframe, is built with web technology, and gives you full control over how results are displayed and interacted with.

The flow is simple: the user prompts, the model selects a tool, the tool calls your MCP server, and the widget renders the result back into the chat.

Take Booking.com as a concrete example. A user says, "show me hotels in New York for tomorrow." The model reads the tool descriptions, chooses the right hotel search tool, passes the parameters, and the widget renders filtered hotel listings directly inside ChatGPT.

The Triangle of Responsibility

The deeper architecture is a triangle between the model, the MCP server, and the widget, and the direction of communication matters.

The model sits at the top. It orchestrates the experience, has full chat context and user memory, and decides when and how to invoke your app.

The MCP server defines and responds to tool calls. It is responsible for security, authentication, and business logic. The model talks to it through public tools, so that relationship is bidirectional.

The widget is a sandboxed iframe that renders UI. It reads tool results from the server via private tools, and it can insert context back into the model. Critically, the model cannot query the widget directly. That is a one-way boundary, and for enterprise use cases it matters.

For businesses handling private data, that boundary is a feature. Sensitive data can stay inside the widget without being exposed to the model for training.

Tool Descriptions Are the New SEO

One detail in this architecture deserves extra emphasis: tool descriptions are how ChatGPT decides which apps and tools to use.

They are not documentation. They behave much more like a system prompt the model rereads whenever it considers your app. That is why tool metadata now sits somewhere between runtime instruction, product UX, and SEO.

The Seven Principles

1. Define Your Goals and Intents

Everything is downstream of goals and intents. The flow is Goal -> Intents -> Tools -> Design.

Start by being honest about what you are optimizing for. Are you trying to learn, build brand visibility, generate leads, or drive conversion? The right design changes depending on that answer.

Then define the intents you are targeting. What are users actually typing into ChatGPT? This is different from web search and different from mobile behavior. The biggest mistake we see is building the app before defining the intent.

2. Your Tool Set Is a System Prompt

In our analysis of 316 apps and 2,178 tools, a few patterns showed up repeatedly in the best-performing tool descriptions:

  • Negative constraints appeared in 57% of apps.
  • Positive trigger examples appeared in 55%.
  • Invocation guidance appeared in 35%.
  • Output formatting guidance appeared in 34%.
  • Scope guardrails appeared in 24%.
  • Hidden internal tools appeared in 18%.
  • Error recovery guidance appeared in 17%.
  • Prerequisite enforcement appeared in 14%.
  • Human-in-the-loop instructions appeared in 12%.
  • Multi-turn context guidance appeared in just 7%.

Your tool description is not documentation. It is a system prompt the model reads every time it considers your app. Treat it with that level of rigor.

3. Let the Model Do the Work

These models are extremely capable reasoners with access to context and memory you do not have on the server side. Structure your tools so the model can use that advantage.

Statista's MCP is a good example. One tool searches the catalogue and returns candidate statistics. The model then does the ranking work using user intent, conversation context, and memory. A second tool retrieves the selected chart.

The same principle applies to personalization. Your server should provide the data. The model should do the reasoning about which subset is most relevant for this user in this moment.

4. Design for Both Surfaces

Your app has two output surfaces: the widget and the chat response. The widget is the controlled display you design. The chat response is the model's interpretation. Both need to work together.

You fully control the widget. You influence the chat response through tool instructions. If that instruction layer is weak, the model can over-explain, paraphrase inaccurately, or editorialize on sensitive outputs.

Zillow's get_zestimate tool is a sharp example. Their tool instructions tightly constrain how the model should present the output so it does not speculate on valuations or add brand-risk commentary.

5. Design for Conversation, Not a One-Off Prompt

Most teams still design like they are building a REST endpoint. Real users do not interact that way.

A real conversation looks like this: "hotels in Rome next week," then "actually make it cheaper," then "what about the week after?" If your tools are not designed for re-invocation and refinement, the experience breaks.

Only 7% of the 316 apps we analyzed included any multi-turn guidance in their tool descriptions. That is a huge missed opportunity.

Expedia gets this right by explicitly instructing the model to call the search tool again for follow-up hotel intent instead of answering from general knowledge.

6. Make Testing Real

There is a material gap between staging behavior and live behavior, and you cannot hand-wave it away.

Across more than 15,000 test runs covering 83 apps, 8 models, and 4 personas, the "tool not called" rate was 3% with structured prompts and 22% with casual natural-language prompts.

That is the difference between a clean eval and a real user. Test at scale, use prompts that sound like your actual persona, and test inside the real client rather than relying only on staging or synthetic environments.

Also remember that many users are on smaller models than the one you are testing with. If your app only works on the ideal model, it is not actually production-ready.

7. Build for Distribution

You can be discovered in two ways: through static surfaces like app stores and registries, and through organic in-conversation surfacing.

Static discovery is your listing layer. Organic discovery is intent-based recommendation by the model. You need to optimize for both.

This is why tool descriptions matter so much. If your tools do not get called reliably, your visibility is effectively zero. We found a 14% average invocation failure rate across 83 apps. One in seven times, the app was invisible because no tool call happened at all.

Fewer, better tools consistently outperform larger tool sets. More surface area means more routing complexity and more chances for the model to miss.

Recommendations

Start with the triangle. Understand what belongs with the model, what belongs in the MCP server, and what belongs in the widget. That tells you which lever to pull when something breaks.

Then start with intent. Everything flows from what your user is actually typing into ChatGPT. Define the intent, map it to tools, and design backward from there.

Treat your tool description like a system prompt, because that is what it is in practice. It shapes whether you get called, how the model presents your output, and how often you win against competing apps.

Finally, test live. The gap between staged success and real-world invocation is too large to ignore.