Part 4: Practical Setup
4.1 Configuring GitHub Copilot for Token Efficiency
Step 1: Create copilot-instructions.md
Create .github/copilot-instructions.md in your repository root. This file is loaded on every Copilot interaction in the project.
Starter template (token-optimized):
Terse like caveman. Technical substance exact. Only fluff die.
Drop: articles, filler (just/really/basically), pleasantries, hedging.
Fragments OK. Short synonyms. Code unchanged.
Pattern: [thing] [action] [reason]. [next step].
ACTIVE EVERY RESPONSE. No revert after many turns. No filler drift.
Code/commits/PRs: normal. Off: "stop caveman" / "normal mode".
This is ~50 tokens. A natural-English equivalent would be 120+ tokens. You save 70+ tokens on every single interaction.
Step 2: Add Project-Specific Instructions (Compressed)
Add your project context in the same compressed style:
Stack: Node.js 20, TypeScript 5.4, PostgreSQL 16, Redis.
Test: Vitest. Lint: ESLint flat config.
Style: functional core, imperative shell. No classes.
Naming: camelCase vars/fns, PascalCase types, UPPER_SNAKE constants.
Errors: Result<T,E> pattern, no thrown exceptions in business logic.
vs. the natural-English version:
This project uses Node.js version 20 with TypeScript 5.4. We use PostgreSQL 16
as our primary database and Redis for caching. For testing, we use Vitest, and
for linting, we use ESLint with the new flat configuration format.
We follow a functional core, imperative shell architecture. Please don't use
classes. For variable and function naming, use camelCase. Types should be in
PascalCase, and constants should be in UPPER_SNAKE_CASE.
For error handling, we use the Result<T,E> pattern. Don't throw exceptions in
business logic code.
Both convey identical information. The compressed version is ~40 tokens. The verbose version is ~110 tokens. 64% savings, applied to every interaction.
Step 3: Choose Your Default Mode
In VS Code, Copilot Chat offers mode selection. Default strategy:
| Task Type | Mode | Why |
|---|---|---|
| Quick questions | Ask | Single LLM call, no tool overhead |
| Code explanations | Ask | No file modification needed |
| Bug diagnosis | Ask (usually) | You provide the context |
| Single-file changes | Edit | Targeted, minimal overhead |
| Multi-file refactors | Agent | Needs to read/write across files |
| New feature implementation | Agent | Multi-step creation |
| Issue-to-PR automation | Coding Agent | Full autonomous workflow |
Step 4: Select Models Strategically
GitHub Copilot pricing depends on model choice and billing mode. Pick the model whose cost matches the level of effort the task actually needs. For the pricing details and billing timeline, see Model Selection & Pricing.
| Model Tier | Relative Token Cost | Use For |
|---|---|---|
| Lightweight (GPT-4.1 mini, Haiku) | Lowest | Autocomplete, simple syntax, lookup-style questions |
| Standard (GPT-4.1, Sonnet) | Mid | Most coding tasks — implementation, refactors, fixes |
| High-effort (Claude Opus, o-series reasoning) | Highest | Architecture, deep reasoning, novel problem decomposition |
| Auto | Low by default | Default: Copilot chooses from the supported Auto pool and applies the paid-plan discount where eligible |
Default to Auto. Auto is the best general-purpose baseline because it reduces picker fatigue and, on eligible paid-plan usage, applies the discount documented by GitHub. Treat Auto as the default lane, not as automatic escalation to every high-effort model. If you need a premium high-effort model, pin it manually. See Model Selection & Pricing.
Never burn a high-effort model on a "what's the syntax for X" question — you pay the higher token rate for an answer the cheapest model would have given you correctly.
Step 5: Mix Models by Task (Model Routing)
One useful cost lever: use different models for different subtasks within the same workflow. The detailed pricing context, historical multiplier references, plan availability, and official GitHub Docs links now live in Model Selection & Pricing. Keep this section focused on the practical routing habit.
The Model Mixing Strategy
Match the model to the cognitive demand of the task:
| Task Type | Recommended Model | Relative Cost | Why |
|---|---|---|---|
| "What does this function do?" | GPT-4.1 / GPT-5 mini | Included | Knowledge retrieval, no reasoning needed |
| "What's the syntax for X?" | GPT-4.1 / GPT-5 mini | Included | Memorized knowledge |
| Quick explanations, summaries | Claude Haiku 4.5 | 0.33x | Fast, cheap, good enough |
| Code review, linting suggestions | Claude Haiku 4.5 | 0.33x | Pattern matching, not deep reasoning |
| Implement a feature, fix a bug | Claude Sonnet 4.5 | 1x | SWE-bench suggests this is the practical default |
| Multi-file refactors | Claude Sonnet 4.5 | 1x | Matches Opus on real coding tasks |
| Architecture decisions, system design | Claude Opus 4.6 | 3x | Deep reasoning justifies the cost |
| Complex multi-step planning from spec | Claude Opus 4.6 | 3x | Novel problem decomposition |
| Security audits, threat modeling | Claude Opus 4.6 | 3x | Nuance and thoroughness matter |
Real-World Savings Example
Typical daily workflow (30 interactions), expressed in standard-tier-equivalent token cost:
| Without mixing (all Sonnet) | With mixing | Savings |
|---|---|---|
| 30 × 1x = 30 cost units | 10 × included + 8 × 0.33x + 10 × 1x + 2 × 3x = 18.6 cost units | 38% |
If you were using Opus for everything: 30 × 3x = 90 cost units. Mixing drops that to 18.6 — a 79% reduction in relative model cost.
Auto Model Selection Should Be Your Default
Copilot's Auto mode chooses from the supported Auto-selection pool based on real-time system health and model performance. On paid plans, GitHub documents a 10% discount for eligible Auto usage in Copilot Chat. Treat Auto as the default low-friction lane; higher-cost premium models still need to be pinned manually.
Default to Auto. Override only when needed. This is a high-leverage default for teams because it keeps the day-to-day default in the lower-cost lane. Use Auto unless you have a specific reason to pin a model — for example, you know the task is trivial (force the cheapest tier) or you know it needs deep reasoning (pin to a premium model manually). For the exact tradeoffs, see Model Selection & Pricing.
The Anti-Pattern: High-Effort Models for Everything
A costly habit: defaulting to Opus or another high-effort model for every interaction. People do this because "better model = better results." For most day-to-day coding tasks, the extra model cost is hard to justify.
Reserve high-effort models for their actual strengths: novel reasoning, architectural judgment, and tasks where a 1-2% quality difference justifies a 3-5x cost increase.
Reasoning Effort: Another Lever
Beyond model selection, a second cost dial exists on reasoning-capable models: thinking effort (or reasoning effort). This controls how many tokens the model spends thinking before responding — affecting text, tool calls, and extended thinking all at once.
| Effort Level | Behavior | Anthropic's Recommended Use |
|---|---|---|
max |
No constraints on token spending | Deepest possible reasoning, thorough analysis |
high (default) |
Always thinks deeply | Complex reasoning, difficult coding, agentic tasks |
medium |
Moderate token savings, may skip thinking | Anthropic's recommended default for Sonnet 4.6 — agentic coding, tool-heavy workflows, code generation |
low |
Significant token savings, skips thinking for simple tasks | High-volume, latency-sensitive, chat, simple classification |
Sources: Anthropic Effort Parameter Docs, April 2026; VS Code Language Models Docs, April 2026; GitHub Copilot CLI programmatic reference, April 2026.
Key facts from Anthropic's documentation:
- Effort affects everything, not just thinking tokens. Lower effort = shorter text responses, fewer tool calls, less preamble before acting. This is a broader lever than
budget_tokens. - Anthropic recommends
mediumas the default for Sonnet 4.6, nothigh. Their docs explicitly say medium is the "best balance of speed, cost, and performance for most applications" including agentic coding. - Exposed in Copilot on many reasoning-capable model families. In VS Code, thinking effort appears for reasoning models such as Claude Sonnet/Opus reasoning variants and GPT reasoning models when supported. Non-reasoning models such as GPT-4.1 and GPT-4o do not show the control. In Copilot CLI, some models also support a
reasoning_effortsetting in config. - Works without extended thinking enabled. You don't need to turn on a separate visible thinking mode to benefit — effort controls total token spend regardless.
- No published benchmarks. Anthropic provides qualitative guidance (the table above) but has not published specific numbers on quality-vs-effort tradeoffs. This is vendor-recommended, not independently benchmarked.
Copilot does expose this on many reasoning-capable models. In VS Code, select a reasoning model in the model picker, open its thinking-effort submenu, and choose the level. This applies to reasoning-oriented model families, including Claude Sonnet/Opus reasoning models and GPT reasoning models when the selected model supports it; non-reasoning models such as GPT-4.1 and GPT-4o do not show the submenu. In Copilot CLI, some models also allow a reasoning_effort setting in config, with GitHub's docs showing gpt-5.3-codex as an example. The same lever is available directly in the Claude API and related tools. The combination of Sonnet at medium effort vs. Opus at high effort could still represent a 3-5x+ total cost difference for equivalent coding tasks.
Step 6: Retune Instructions for the Model You Actually Use
When you change models, do not assume the old prompt stack is still optimal. Provider prompting guides are version-specific and frequently explain behavior changes: verbosity, tool eagerness, structure preferences, reasoning effort, and stop conditions.
Fast workflow:
Paste official guide URL into Copilot.
Name target model and files.
Ask Copilot to adapt prompts/instructions while preserving behavior.
Review diff; keep only concrete changes that reduce wrong turns.
Example:
Target model: Claude Sonnet 4.6.
Guide: https://platform.claude.com/docs/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices
Files: .github/copilot-instructions.md, .github/instructions/*.instructions.md
Adapt instructions for this model. Preserve repo behavior. Reduce rework. Keep terse.
See Workflow Optimization §2.5.5 for provider guide URLs and side-by-side Sonnet, GPT-5.5, and Gemini examples.
Step 7: Add Org Guardrails Outside the Prompt
Do not try to solve governance with prompt text alone. Prompt files shape behavior. Billing controls live elsewhere.
If you are guiding an organization or enterprise rollout, stop here and read Enterprise Governance. That chapter owns the admin guidance: AI-credit budgets, per-user tightening, model-access policy, org instructions, and separate-organization tradeoffs.
Use this page for practitioner setup. Use the enterprise chapter for customer governance decisions.
4.2 Keep Reusable Guidance Outside Always-On Context
This repo no longer ships installable workflow packs. Keep the same habit, though: put occasional workflow guidance outside the always-on prompt and pull it in only when the task needs it.
Good candidates:
- PR-review checklists
- release or rollback templates
- debugging playbooks
- migration notes for one subsystem
Store them wherever your team already keeps reusable prompts or operating notes. The token rule stays the same: if a rule is not needed on most interactions, do not keep paying for it on every interaction.
Optional for CLI-heavy users: CodeAct
If most of your long-running work happens in Copilot CLI, optional external plugin copilot-codeact-plugin is worth evaluating. It is not part of this repo and not a general Copilot Chat feature. The value proposition is workflow shape: collapse many grep / view / bash / MCP hops into one sandboxed execution so the full context and tool catalog replay fewer times. Use it for CLI-heavy sessions; skip it if your work is mostly IDE chat/edit or if you do not want an external plugin in the path.
MCPs vs. Skills: Eager vs. Lazy Context Loading
MCPs (Model Context Protocol servers) inject their full tool schema into context on every interaction — regardless of whether those tools are used. A server with 20 tools can add thousands of tokens to every request in your session.
Skills behave differently: only the title and description load upfront. The full skill content pulls on demand, only when the skill is actually relevant to the current task.
| Mechanism | What loads per turn | When full content loads |
|---|---|---|
| MCP | Complete tool schema (always) | N/A — always present |
| Skill | Title + description only | On demand, when invoked |
Rule: Use MCPs for capabilities needed on most interactions. Use skills for occasional capabilities — you pay the full schema cost per turn with MCPs, but only per invocation with skills. If a tool is used in 1 in 10 conversations, a skill is roughly 10× cheaper in context overhead.
4.3 GitHub Coding Agent Considerations
The Coding Agent runs autonomous sessions that can last minutes to hours. Token savings compound over those long sessions.
4.3.1 Compress copilot-instructions.md
The agent reads this file. A compressed instructions file saves tokens on every internal planning step — and the agent takes many steps per session.
4.3.2 Use copilot-setup-steps.yml
Pre-install dependencies deterministically:
# .github/copilot-setup-steps.yml
steps:
- name: Install dependencies
run: npm ci
- name: Build
run: npm run build
Without this, the agent discovers and installs dependencies through trial and error — each attempt costs LLM calls and tokens. Savings: 10-30% of total session tokens.
4.3.3 Write Precise Issue Descriptions
Vague issues cause the agent to explore the codebase extensively (reading many files = many tokens) and potentially misunderstand requirements (rework = more tokens).
Vague issue:
Precise issue:
Bug: login fails when email contains '+' character.
File: src/auth/login.ts, validateEmail() on L42.
Fix: URL-encode the email before passing to the OAuth provider.
Test: add case for "user+tag@example.com" in login.test.ts.
Savings: 20-50% of total session tokens by reducing exploration and rework.
4.3.4 Terse PR Comments and Commit Messages
The agent reads PR review comments and git history for context. Every verbose commit message or review comment costs tokens when the agent ingests them. Keep commit messages and review comments terse.
4.3.5 Custom Agent Profiles
Create focused instructions for different task types instead of one giant instruction file:
# For test-writing tasks
Stack: Vitest + Testing Library. AAA pattern.
Mock: external services only. No impl mocking.
Coverage: branch coverage ≥80%.
Focused agents carry less instruction overhead than a general-purpose instruction set.
4.3.6 Compress Shell Command Output with RTK
The Coding Agent runs many shell commands per session — git diff, test runs, grep, ls. Each command's raw output flows back as input tokens on the next agent step. A large failing test suite or verbose git diff can return tens of thousands of tokens.
RTK (Rust Token Killer) is a CLI proxy that filters those outputs before they reach the agent. It runs the original command, removes noise (passing tests, unchanged diff lines, build artifacts), and returns a compressed result. The agent behavior is unchanged; it sees smaller, signal-focused output.
Setup for VS Code Copilot — per-repo:
brew install rtk # or: curl -fsSL https://raw.githubusercontent.com/rtk-ai/rtk/refs/heads/master/install.sh | sh
cd your-repo
rtk init --copilot
# Restart VS Code
RTK installs a PreToolUse hook into the current repository. Repeat per repo — there is no global VS Code Copilot install. Once active, the hook is transparent: your terminal is unchanged; only the agent's Bash tool calls are intercepted.
Commands with verbose output (test failures, large diffs) see the biggest reductions. Short-output commands see smaller gains. Actual savings depend on your project's output volume.
Combine with copilot-setup-steps.yml (§4.3.2) and precise issue descriptions (§4.3.3) for maximum session efficiency. Full setup, command list, and other AI tool support: MCP & Tool Costs §2.7.7.
4.4 Building the Habit
Start Small
- Week 1: Add compressed
copilot-instructions.mdto your main project. Use Ask Mode for simple questions - Week 2: Practice caveman-lite in prompts. Drop filler words, be precise
- Week 3: Graduate to caveman-full. Drop articles, use fragments
- Week 4: Add "code only" to code generation prompts. Save reusable terse templates outside always-on context
Monthly Maintenance
- Review your
copilot-instructions.md— has it grown? Compress it back down - Check if any memory files have gotten verbose — compress them back down
- Audit which files are habitually open in your editor — close ones you're not working on (open tabs auto-feed context)
- (Business/Enterprise) Review repository / org Content Exclusion settings for new sensitive paths
- Check your model usage — are you pinning high-effort models for tasks Auto would route to a cheaper tier?
- Review budgets, user-level caps, and model policies before expanding premium access further
- When default model changes, retune prompts/instructions against that provider's current prompting guide
- Check token usage by user/team — are agents and power users driving outsized consumption? See Enterprise Governance
When to Adjust
| Signal | Action |
|---|---|
| Getting wrong results | Back off one compression level |
| Re-explaining frequently | Instructions may be too terse — add one clarifying line |
| Hitting rate limits | Apply more techniques from the matrix |
| New team member confused | Add full-English comments in code, keep instructions compressed |
| Long agent sessions failing | Check issue description precision, add copilot-setup-steps.yml |
4.5 Configuring Agent Mode for Efficiency
4.5.1 Agent Mode vs. Ask Mode vs. Edit Mode
Each mode has a fundamentally different token cost profile:
| Mode | LLM Calls per Action | Tool Use | Context Loaded | Best For |
|---|---|---|---|---|
| Ask | 1 | None | Conversation + instructions | Questions, explanations |
| Edit | 1-2 | File read/write | Target file + instructions | Single-file changes |
| Agent | 5-25 | Full toolset | Everything + tool schemas | Multi-step, multi-file tasks |
The cost multiplier: Agent mode costs 5-25x more than Ask mode for the same prompt. A simple question in Agent mode triggers file reads, tool evaluations, and multi-step reasoning — all unnecessary for "what does this function do?"
4.5.2 The Agent Mode Internal Loop
Understanding the loop helps you minimize steps:
Step 1: Load context
├── System prompt (~500 tokens)
├── copilot-instructions.md (~50-1500 tokens)
├── Tool definitions (~2,000-20,000 tokens)
├── Conversation history (growing)
└── YOUR prompt
→ Send to LLM → Get response
Step 2: LLM decides to call a tool
├── Tool call (function + params) → output tokens
├── Tool result → input tokens (next step)
└── Reasoning about result → output tokens
Step 3: Another tool call (or generate response)
├── ALL of Step 1's context reloaded
├── + Step 2's tool call and result
└── + growing conversation
→ Send to LLM again
... repeat 5-25 times
Key insight: Context grows with every step. Step 15 carries all the context from steps 1-14 plus the original prompt. This is why long agent sessions get expensive fast.
4.5.3 Minimizing Agent Steps
Every step avoided saves one full context reload. Techniques:
Precise prompts with acceptance criteria:
# Bad — agent will explore, read files, guess requirements
"Fix the user registration"
# Good — agent knows exactly what to do
"File: src/auth/register.ts L42.
Bug: email validation rejects valid '+' chars.
Fix: use RFC 5322 regex.
Test: add 'user+tag@example.com' case in register.test.ts.
Done when: test passes, no other tests break."
The precise version might complete in 3-5 steps. The vague one: 10-20 steps of exploration.
Plan files for complex tasks:
Create a plan before invoking agent mode:
# plan.md
1. Add `validateEmail()` to src/utils/validation.ts
2. Import and use in src/auth/register.ts L42
3. Add test cases in tests/auth/register.test.ts
4. Run `npm test` — expect all pass
Then prompt: "Execute plan.md." The agent follows the plan instead of discovering the path itself. Fewer exploration steps = fewer tokens.
Use CLI composition instead of agentic tool loops for deterministic operations:
Multi-step browser or data operations dispatched through an agent trigger one LLM call per step — and each step reloads the full accumulated context. A single generated CLI command executes the same work in one shell invocation:
# Browser automation — one LLM call generates this; one shell call runs it
playwright goto https://example.com && wait 1000 && click '#submit-btn' && screenshot out.png
# Chained with filters — no agent loop needed
gh issue list --json number,title,labels | jq '.[] | select(.title | test("bug"; "i"))'
# Piped data transforms
cat logs/app.log | grep ERROR | awk '{print $1, $5}' | sort | uniq -c | sort -rn | head -20
CLI commands are composable, inspectable, re-runnable, and version-controllable. Changing a selector, a filter, or a URL means editing one line of text — not re-prompting an agent through another multi-step loop. Reserve agentic tool use for tasks that genuinely require dynamic decision-making; offload deterministic sequences to the shell.
4.5.4 VS Code Settings for Token Efficiency
Relevant settings that affect agent token usage:
{
// Limit maximum agent steps (default varies by model)
"github.copilot.chat.agent.maxTurns": 10,
// Use auto model selection — cheaper models for simple sub-tasks
"github.copilot.chat.agent.model": "auto"
}
maxTurns caps how many steps the agent takes. Lower = fewer tokens, but the agent might not finish complex tasks. Start at 10-15, increase only when needed.
4.5.5 Custom Instructions for Agent Efficiency
Add to .github/copilot-instructions.md:
Minimize tool calls. Read files only when necessary.
Batch related changes. Don't read-modify-read-modify when read-modify-modify works.
Prefer grep_search over sequential read_file for discovery.
These directives reduce unnecessary tool calls. Each skipped tool call saves 100-2,000+ tokens of tool input/output.
4.5.6 Decision Framework: When to Use Each Mode
Question about code/syntax/concept?
→ Ask Mode (1 call, ~500-2,000 tokens)
Change to a single file?
→ Edit Mode (1-2 calls, ~1,000-4,000 tokens)
Multi-file change with clear scope?
→ Agent Mode with precise prompt (~5-10 steps, ~15,000-50,000 tokens)
Vague "fix this" or "improve that"?
→ DON'T use Agent Mode yet. Clarify scope first in Ask Mode.
→ Then switch to Agent with precise prompt.
A costly pattern: Using Agent Mode for a vague prompt, watching it explore for 20 steps, then realizing it misunderstood and starting over. That can double token use without improving the result.
4.6 Admin Guardrails Live Elsewhere
This page is for practitioner setup. If you are making customer or enterprise rollout decisions, use Enterprise Governance instead.
That chapter owns:
- user-level AI-credit budgets
- heavy-usage monitoring
- model-access policy
- org-level custom instructions
- June 1 cutover guidance
Next: Enterprise Governance →