If you've ever hit a usage limit mid-session, gotten slower responses as a conversation goes on, or wondered why the AI seems to "forget" things toward the end of a long chat — you're experiencing the same problem. And it's not random. It's a token problem. And once you understand it, you can fix it.
A token is the smallest unit of text an AI model processes — roughly 0.75 words. Every input you send and every output you receive is measured in tokens. Your usage limit is essentially a token budget, and how you spend it determines how far you get in a session.
This is the core issue most people don't know about. Every time you send a message, the AI doesn't just read your new message — it rereads the entire conversation from the beginning. Every single turn.
What that means in practice: message 1 might cost 500 tokens. Message 30 in the same conversation could cost 15,000 tokens — because it's reprocessing all 29 previous messages plus your new one. Costs grow exponentially, not linearly.
On top of conversation history, every message also reloads what's running in the background: system prompts, custom instructions, MCP servers, active skills and connectors. A single connected MCP server can consume approximately 18,000 tokens per message before you've typed a single word.
Run this command to see exactly what's eating your tokens:
This shows a full breakdown: conversation history, MCP overhead, loaded files, system prompts, and your current capacity percentage. Most people have never run this and have no idea what's actually consuming their budget.
AI companies often have an incentive to encourage "context window inflation" — long conversations, many connected tools, large file uploads. More tokens consumed means more revenue once subsidized pricing ends. Understanding this doesn't mean avoiding AI tools — it means using them strategically instead of casually.
The single most effective habit: start a new conversation for every unrelated task. Carrying context from one project into another is the fastest way to burn your limit. Use the /clear command or simply open a new chat.
Instead of sending three separate messages, combine your questions and instructions into one well-structured prompt. This alone can cut your token usage by 60% on complex tasks.
Never upload raw PDFs or images if you only need the text. Raw PDFs can turn a 4,500-word document into over 100,000 tokens. Converting to Markdown first can reduce token weight by up to 20x.
Every connected MCP server reloads its tool definitions into your context with every message — whether you're using it or not.
Your claude.md file is reprocessed with every single message. Keep it under 500 lines — ideally under 200. Every word adds to your per-message cost permanently.
Claude automatically compacts at 95% capacity — but by that point quality has already degraded. Run /compact manually when you hit 60%.
Add this to your claude.md: "Do not make any changes or write any code until you reach 95% confidence in what needs to be built."
Extended Thinking burns significantly more tokens. Turn it off for formatting, simple research, and content generation. Reserve it for genuinely difficult work.
| Task type | Recommended model | Why |
|---|---|---|
| Simple formatting, research sub-tasks | Haiku | Fast, cheap, handles simple tasks well |
| Standard coding, content creation | Sonnet | Best balance of quality and cost |
| Deep architectural planning | Opus | Reserve for tasks that genuinely need it — costs 5x more |
| Repetitive background tasks | OpenRouter free models | Qwen 2.5 Coder, DeepSeek, GLM 4.5 air — $0 cost |
openrouter/freeSchedule heavy builds for evenings and weekends. Saturday is already your power day — this is another reason it's the right choice.
The TradingView MCP server is connected on CPU 2 for morning briefs. When I'm not running a brief, I disconnect it — an active MCP server burns ~18,000 tokens per message even when idle.
Saturday sessions get the most generous usage rates. The session reset trick: a small throwaway prompt Friday evening staggers the 5-hour window to reset during Saturday's heavy work.
NotebookLM is already in the workflow for research. Gather in NotebookLM, bring a clean summary into Claude for execution. Every article on this site starts that way.
Project Static and miniLABEL are complex builds. The 95% confidence rule means Claude won't generate code until requirements are fully clear — preventing wasted builds and token bleed.