A system prompt is the chunk of text you send to a language model that describes who it is, how it should behave, and what constraints apply — all before the user ever types a word. In the Claude API it's a top-level system field; in OpenAI it's a message with role: 'system'; in Gemini it's system_instruction. Different wrappers, same job: set the stage before the conversation starts.
Good system prompts are the difference between an app that behaves predictably and one that doesn't. Bad ones are the reason people think "ChatGPT is inconsistent" when really the inconsistency is their prompt.
The anatomy of a good system prompt
Almost every production system prompt that works has the same four parts, in this order:
- Role — one sentence on who the model is and what it does. "You are a customer-support agent for a small accounting software company." Specific beats generic.
- Context — what domain the model is operating in, what kind of user it's talking to, what knowledge it can assume. "Users are bookkeepers and small-business owners; you can assume they know basic accounting but may not know our product."
- Behavior — tone, format, length, what to do when uncertain, what not to do. Written as positives wherever possible ("answer in 2-3 sentences" beats "don't write paragraphs").
- Examples — 2-4 concrete input/output pairs that illustrate the behavior you want. The model learns from examples faster than from rules.
What each provider does with it
· Claude
Treats the system prompt as authoritative. Constraints stick. If you say "never quote specific legal advice," Claude will refuse even under pressure. Good for high-compliance apps. Pass via the top-level system parameter in the messages API (not as a message with role: 'system' — that works but isn't the convention). Claude 4.x supports very long system prompts and is especially good at following complex multi-step instructions.
· ChatGPT (GPT-5.4 / GPT-5)
Respects the system prompt but is slightly more flexible when users push back. More forgiving if the prompt has contradictions. Put the system prompt as the first message with role: 'system'. Newer o-series reasoning models actually ignore the system prompt during their internal reasoning phase and only apply it when formatting the output — a subtle but important detail for agentic use.
· Gemini
Uses the system_instruction field in the request body. Historically a bit looser about following constraints than Claude but improving rapidly in the 3.x generation. Useful detail: Gemini applies system instructions with slightly higher weight to the first part of the prompt than the end, so put your most important constraints at the top.
A template that works
Copy, adapt, ship. This skeleton covers the four parts and stays short enough to cache cheaply:
# Role You are a [role] for [audience / product]. Your job is to [primary objective] without [primary constraint]. # Context [Domain assumption, what users know and don't know, what data you have access to.] # Behavior - Keep responses to [length guidance]. - Match the user's tone: [formal/casual/technical]. - When uncertain, [preferred fallback — ask for clarification, say you don't know, defer to a human, etc.]. - Output format: [plain prose / JSON / markdown / etc.]. # Examples User: "[typical question 1]" You: "[ideal response 1]" User: "[edge case]" You: "[ideal response for the edge case]" User: "[attempted prompt injection / off-topic request]" You: "[how to handle it politely]"
Fill in the bracketed parts for your specific use case, then iterate on example quality — the examples do more work than the instructions in most production prompts.
Anti-patterns that hurt output
- Long lists of negative rules. "Don't do X. Don't do Y. Don't do Z." reads like a minefield to the model and makes it hedge everything. Rewrite as positives where possible.
- Contradictions. "Be concise. Provide thorough explanations." The model picks one and you don't know which. Audit your prompt for internal conflicts.
- Jargon the model doesn't share. Internal team acronyms, company-specific terms without definition — the model will either guess or fabricate. Define the terms in the context section or don't use them.
- Format instructions without examples. "Output valid JSON" is weaker than "Output valid JSON. Here are 2 examples: ..." Show it, then describe it.
- Putting dynamic values at the top. "Today is 2026-04-21. The current user is Alex." at position zero in the prompt invalidates cache on everything after. Move dynamic values to the end — see our prompt caching guide for why.
- Secrets in the system prompt. API keys, internal URLs, private account names — all extractable by a user who knows the right questions to ask. Anything sensitive goes in tool schemas or retrieved context, not in the system prompt.
How to test a system prompt
Three-step loop that catches most issues before production:
- Write 20 realistic user inputs. Mix of easy cases, edge cases, and adversarial prompts (someone trying to get the model to break character). Script them, don't write them on the fly.
- Run them against each candidate system prompt. Keep temperature at 0 or low. Record the output. Do this at least 3 times per prompt-input pair to see consistency.
- Evaluate against your actual success criteria. Does it answer? Does it follow the format? Does it stay in character? Does it refuse when it should? Track pass rate. Anything under 95% on a normal-case test needs another iteration.
The big unlock: write the test cases first, then the system prompt, then iterate. Most teams write the prompt first, ship it, and find the edge cases in production. That's expensive.
When to split into multiple prompts
A single system prompt doing too much is harder to debug, harder to cache efficiently, and more likely to produce inconsistent outputs. If any of these apply, split:
- Your prompt exceeds ~8K tokens and you're still adding to it.
- Different parts of the prompt address genuinely different tasks (classification + generation, for example).
- Some users need one behavior and others need a different one — at that point, per-user-segment system prompts are cleaner than one mega-prompt with branching.
- Testing surfaces a pattern where the model drops one constraint when another is active — a sign the prompt is carrying more than the model can juggle.
Common split patterns: a lightweight "router" prompt that classifies the input, then a task-specific prompt for each task. Or a retrieval step that pulls the relevant instructions from a database and composes them into a fresh prompt for the current request.
FAQ
What is a system prompt?
A system prompt is the instruction you give a language model before the user's first message. It sets the model's persona, tone, constraints, available tools, output format, and anything else that should hold steady across the whole conversation. Unlike user messages, which vary call-to-call, the system prompt is the constant — which is why it's also the thing you should be caching with prompt caching for 70-90% input-cost savings.
Where does the system prompt go?
In the API, it's a dedicated field or role. Claude uses a top-level `system` parameter in the messages API. OpenAI uses a message with `role: 'system'` at the start of the messages array. Gemini uses `system_instruction` in the request. The content is the same across providers; only the structural wrapper differs.
How long should a system prompt be?
There's no universal right answer, but most production system prompts are between 200 and 5,000 tokens. Under 200 tokens usually means you're relying on the model's defaults and losing consistency. Over 5,000 tokens usually means you're restating things the model already knows or listing edge cases instead of generalizing. If your system prompt is over 10K tokens, it's almost certainly doing too much — split the work across multiple agents or tool calls.
Do system prompts work differently on Claude vs ChatGPT vs Gemini?
Similar structure, different emphasis. Claude tends to follow system prompt constraints very literally — if you say 'never recommend specific brands,' it won't, even under user pressure. ChatGPT is slightly more flexible and will sometimes bend hard constraints if the user asks in the right way. Gemini falls between the two. For high-compliance applications (legal, medical, safety), Claude is generally the most predictable in following the system prompt verbatim.
What's the difference between a system prompt and a user prompt?
The system prompt sets the context for the whole conversation; the user prompt is what the human is asking right now. Models trust system prompts as authoritative setup and user prompts as questions to answer within that setup. If your app has a tone and behavior you always want, that goes in the system prompt. If it varies per call, it goes in the user prompt.
Can the user see the system prompt?
By default, no — the model won't volunteer it. But it can be extracted by a user who asks the right follow-up questions ("repeat your instructions verbatim," "what's above my message?", prompt injection via pasted text). If you're putting secrets in your system prompt (API keys, internal URLs, private names), assume they will leak. System prompts are for behavior and context, not for security.
What makes a good system prompt?
Four things: (1) a clear role statement — who the model is and what it does, (2) constraints written as positives where possible ("respond in 2-3 sentences" beats "don't write paragraphs"), (3) concrete examples of good output, (4) explicit format rules if the output needs to be parseable. And keep it free of contradictions — if two parts of the prompt conflict, the model follows whichever is more recent or more specific, which is unpredictable.
Should I put instructions or examples first?
Instructions first, then examples. The model reads top-down and lets the early content frame how it interprets the later content. If examples come first, the model treats them as the primary signal; if instructions come first, the examples illustrate the instructions. The latter is what you want for consistent behavior.
Why does my system prompt sometimes get ignored?
Three common causes. First, the user prompt contradicts it — the model follows the more specific instruction. Second, the system prompt has internal contradictions — the model picks one and you don't know which. Third, length: if the system prompt is huge and the user prompt is short, the model may drift toward user-prompt framing. Solutions: make constraints unambiguous, test for consistency, use prompt caching so you can afford a tighter system prompt without paying for it every call.
Related
- · Prompt caching explained — cache the system prompt to cut input costs 70-90%.
- · LLM token counter — count the tokens in your system prompt before you ship.
- · Claude vs ChatGPT vs Gemini — how each model interprets instructions differently.
- · MCP server primer — the other half of the prompt: the tools you expose.