Part 4: Practical Setup

4.1 Configuring GitHub Copilot for Token Efficiency

Step 1: Create `copilot-instructions.md`

Create .github/copilot-instructions.md in your repository root. This file is loaded on every Copilot interaction in the project.

mkdir -p .github
touch .github/copilot-instructions.md

Starter template (token-optimized):

Terse like caveman. Technical substance exact. Only fluff die.
Drop: articles, filler (just/really/basically), pleasantries, hedging.
Fragments OK. Short synonyms. Code unchanged.
Pattern: [thing] [action] [reason]. [next step].
ACTIVE EVERY RESPONSE. No revert after many turns. No filler drift.
Code/commits/PRs: normal. Off: "stop caveman" / "normal mode".

This is ~50 tokens. A natural-English equivalent would be 120+ tokens. You save 70+ tokens on every single interaction.

Step 2: Add Project-Specific Instructions (Compressed)

Add your project context in the same compressed style:

Stack: Node.js 20, TypeScript 5.4, PostgreSQL 16, Redis.
Test: Vitest. Lint: ESLint flat config.
Style: functional core, imperative shell. No classes.
Naming: camelCase vars/fns, PascalCase types, UPPER_SNAKE constants.
Errors: Result<T,E> pattern, no thrown exceptions in business logic.

vs. the natural-English version:

This project uses Node.js version 20 with TypeScript 5.4. We use PostgreSQL 16
as our primary database and Redis for caching. For testing, we use Vitest, and
for linting, we use ESLint with the new flat configuration format.

We follow a functional core, imperative shell architecture. Please don't use
classes. For variable and function naming, use camelCase. Types should be in
PascalCase, and constants should be in UPPER_SNAKE_CASE.

For error handling, we use the Result<T,E> pattern. Don't throw exceptions in
business logic code.

Both convey identical information. The compressed version is ~40 tokens. The verbose version is ~110 tokens. 64% savings, applied to every interaction.

Step 3: Choose Your Default Mode

In VS Code, Copilot Chat offers mode selection. Default strategy:

Task Type	Mode	Why
Quick questions	Ask	Single LLM call, no tool overhead
Code explanations	Ask	No file modification needed
Bug diagnosis	Ask (usually)	You provide the context
Single-file changes	Edit	Targeted, minimal overhead
Multi-file refactors	Agent	Needs to read/write across files
New feature implementation	Agent	Multi-step creation
Issue-to-PR automation	Coding Agent	Full autonomous workflow

Step 4: Select Models Strategically

GitHub Copilot pricing depends on model choice and billing mode. Pick the model whose cost matches the level of effort the task actually needs. For the pricing details and billing timeline, see Model Selection & Pricing.

Model Tier	Relative Token Cost	Use For
Lightweight (GPT-4.1 mini, Haiku)	Lowest	Autocomplete, simple syntax, lookup-style questions
Standard (GPT-4.1, Sonnet)	Mid	Most coding tasks — implementation, refactors, fixes
High-effort (Claude Opus, o-series reasoning)	Highest	Architecture, deep reasoning, novel problem decomposition
Auto	Low by default	Default: Copilot chooses from the supported Auto pool and applies the paid-plan discount where eligible

Default to Auto. Auto is the best general-purpose baseline because it reduces picker fatigue and, on eligible paid-plan usage, applies the discount documented by GitHub. Treat Auto as the default lane, not as automatic escalation to every high-effort model. If you need a premium high-effort model, pin it manually. See Model Selection & Pricing.

Never burn a high-effort model on a "what's the syntax for X" question — you pay the higher token rate for an answer the cheapest model would have given you correctly.

Step 5: Mix Models by Task (Model Routing)

One useful cost lever: use different models for different subtasks within the same workflow. The detailed pricing context, historical multiplier references, plan availability, and official GitHub Docs links now live in Model Selection & Pricing. Keep this section focused on the practical routing habit.

The Model Mixing Strategy

Match the model to the cognitive demand of the task:

Task Type	Recommended Model	Relative Cost	Why
"What does this function do?"	GPT-4.1 / GPT-5 mini	Included	Knowledge retrieval, no reasoning needed
"What's the syntax for X?"	GPT-4.1 / GPT-5 mini	Included	Memorized knowledge
Quick explanations, summaries	Claude Haiku 4.5	0.33x	Fast, cheap, good enough
Code review, linting suggestions	Claude Haiku 4.5	0.33x	Pattern matching, not deep reasoning
Implement a feature, fix a bug	Claude Sonnet 4.5	1x	SWE-bench suggests this is the practical default
Multi-file refactors	Claude Sonnet 4.5	1x	Matches Opus on real coding tasks
Architecture decisions, system design	Claude Opus 4.6	3x	Deep reasoning justifies the cost
Complex multi-step planning from spec	Claude Opus 4.6	3x	Novel problem decomposition
Security audits, threat modeling	Claude Opus 4.6	3x	Nuance and thoroughness matter

Real-World Savings Example

Typical daily workflow (30 interactions), expressed in standard-tier-equivalent token cost:

Without mixing (all Sonnet)	With mixing	Savings
30 × 1x = 30 cost units	10 × included + 8 × 0.33x + 10 × 1x + 2 × 3x = 18.6 cost units	38%

If you were using Opus for everything: 30 × 3x = 90 cost units. Mixing drops that to 18.6 — a 79% reduction in relative model cost.

Auto Model Selection Should Be Your Default

Copilot's Auto mode chooses from the supported Auto-selection pool based on real-time system health and model performance. On paid plans, GitHub documents a 10% discount for eligible Auto usage in Copilot Chat. Treat Auto as the default low-friction lane; higher-cost premium models still need to be pinned manually.

Default to Auto. Override only when needed. This is a high-leverage default for teams because it keeps the day-to-day default in the lower-cost lane. Use Auto unless you have a specific reason to pin a model — for example, you know the task is trivial (force the cheapest tier) or you know it needs deep reasoning (pin to a premium model manually). For the exact tradeoffs, see Model Selection & Pricing.

The Anti-Pattern: High-Effort Models for Everything

A costly habit: defaulting to Opus or another high-effort model for every interaction. People do this because "better model = better results." For most day-to-day coding tasks, the extra model cost is hard to justify.

Reserve high-effort models for their actual strengths: novel reasoning, architectural judgment, and tasks where a 1-2% quality difference justifies a 3-5x cost increase.

Reasoning Effort: Another Lever

Beyond model selection, a second cost dial exists on reasoning-capable models: thinking effort (or reasoning effort). This controls how many tokens the model spends thinking before responding — affecting text, tool calls, and extended thinking all at once.

Effort Level	Behavior	Anthropic's Recommended Use
`max`	No constraints on token spending	Deepest possible reasoning, thorough analysis
`high` (default)	Always thinks deeply	Complex reasoning, difficult coding, agentic tasks
`medium`	Moderate token savings, may skip thinking	Anthropic's recommended default for Sonnet 4.6 — agentic coding, tool-heavy workflows, code generation
`low`	Significant token savings, skips thinking for simple tasks	High-volume, latency-sensitive, chat, simple classification

Sources: Anthropic Effort Parameter Docs, April 2026; VS Code Language Models Docs, April 2026; GitHub Copilot CLI programmatic reference, April 2026.

Key facts from Anthropic's documentation:

Effort affects everything, not just thinking tokens. Lower effort = shorter text responses, fewer tool calls, less preamble before acting. This is a broader lever than budget_tokens.
Anthropic recommends medium as the default for Sonnet 4.6, not high. Their docs explicitly say medium is the "best balance of speed, cost, and performance for most applications" including agentic coding.
Exposed in Copilot on many reasoning-capable model families. In VS Code, thinking effort appears for reasoning models such as Claude Sonnet/Opus reasoning variants and GPT reasoning models when supported. Non-reasoning models such as GPT-4.1 and GPT-4o do not show the control. In Copilot CLI, some models also support a reasoning_effort setting in config.
Works without extended thinking enabled. You don't need to turn on a separate visible thinking mode to benefit — effort controls total token spend regardless.
No published benchmarks. Anthropic provides qualitative guidance (the table above) but has not published specific numbers on quality-vs-effort tradeoffs. This is vendor-recommended, not independently benchmarked.

Copilot does expose this on many reasoning-capable models. In VS Code, select a reasoning model in the model picker, open its thinking-effort submenu, and choose the level. This applies to reasoning-oriented model families, including Claude Sonnet/Opus reasoning models and GPT reasoning models when the selected model supports it; non-reasoning models such as GPT-4.1 and GPT-4o do not show the submenu. In Copilot CLI, some models also allow a reasoning_effort setting in config, with GitHub's docs showing gpt-5.3-codex as an example. The same lever is available directly in the Claude API and related tools. The combination of Sonnet at medium effort vs. Opus at high effort could still represent a 3-5x+ total cost difference for equivalent coding tasks.

Step 6: Retune Instructions for the Model You Actually Use

When you change models, do not assume the old prompt stack is still optimal. Provider prompting guides are version-specific and frequently explain behavior changes: verbosity, tool eagerness, structure preferences, reasoning effort, and stop conditions.

Fast workflow:

Paste official guide URL into Copilot.
Name target model and files.
Ask Copilot to adapt prompts/instructions while preserving behavior.
Review diff; keep only concrete changes that reduce wrong turns.

Example:

Target model: Claude Sonnet 4.6.
Guide: https://platform.claude.com/docs/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices
Files: .github/copilot-instructions.md, .github/instructions/*.instructions.md
Adapt instructions for this model. Preserve repo behavior. Reduce rework. Keep terse.

See Workflow Optimization §2.5.5 for provider guide URLs and side-by-side Sonnet, GPT-5.5, and Gemini examples.

Step 7: Add Org Guardrails Outside the Prompt

Do not try to solve governance with prompt text alone. Prompt files shape behavior. Billing controls live elsewhere.

If you are guiding an organization or enterprise rollout, stop here and read Enterprise Governance. That chapter owns the admin guidance: AI-credit budgets, per-user tightening, model-access policy, org instructions, and separate-organization tradeoffs.

Use this page for practitioner setup. Use the enterprise chapter for customer governance decisions.

4.2 Keep Reusable Guidance Outside Always-On Context

This repo no longer ships installable workflow packs. Keep the same habit, though: put occasional workflow guidance outside the always-on prompt and pull it in only when the task needs it.

Good candidates:

PR-review checklists
release or rollback templates
debugging playbooks
migration notes for one subsystem

Store them wherever your team already keeps reusable prompts or operating notes. The token rule stays the same: if a rule is not needed on most interactions, do not keep paying for it on every interaction.

Optional for CLI-heavy users: CodeAct

If most of your long-running work happens in Copilot CLI, optional external plugin copilot-codeact-plugin is worth evaluating. It is not part of this repo and not a general Copilot Chat feature. The value proposition is workflow shape: collapse many grep / view / bash / MCP hops into one sandboxed execution so the full context and tool catalog replay fewer times. Use it for CLI-heavy sessions; skip it if your work is mostly IDE chat/edit or if you do not want an external plugin in the path.

MCPs vs. Skills: Eager vs. Lazy Context Loading

MCPs (Model Context Protocol servers) inject their full tool schema into context on every interaction — regardless of whether those tools are used. A server with 20 tools can add thousands of tokens to every request in your session.

Skills behave differently: only the title and description load upfront. The full skill content pulls on demand, only when the skill is actually relevant to the current task.

Mechanism	What loads per turn	When full content loads
MCP	Complete tool schema (always)	N/A — always present
Skill	Title + description only	On demand, when invoked

Rule: Use MCPs for capabilities needed on most interactions. Use skills for occasional capabilities — you pay the full schema cost per turn with MCPs, but only per invocation with skills. If a tool is used in 1 in 10 conversations, a skill is roughly 10× cheaper in context overhead.

4.3 GitHub Coding Agent Considerations

The Coding Agent runs autonomous sessions that can last minutes to hours. Token savings compound over those long sessions.

4.3.1 Compress `copilot-instructions.md`

The agent reads this file. A compressed instructions file saves tokens on every internal planning step — and the agent takes many steps per session.

4.3.2 Use `copilot-setup-steps.yml`

Pre-install dependencies deterministically:

# .github/copilot-setup-steps.yml
steps:
  - name: Install dependencies
    run: npm ci
  - name: Build
    run: npm run build

Without this, the agent discovers and installs dependencies through trial and error — each attempt costs LLM calls and tokens. Savings: 10-30% of total session tokens.

4.3.3 Write Precise Issue Descriptions

Vague issues cause the agent to explore the codebase extensively (reading many files = many tokens) and potentially misunderstand requirements (rework = more tokens).

Vague issue:

Fix the login bug

Precise issue:

Bug: login fails when email contains '+' character.
File: src/auth/login.ts, validateEmail() on L42.
Fix: URL-encode the email before passing to the OAuth provider.
Test: add case for "user+tag@example.com" in login.test.ts.

Savings: 20-50% of total session tokens by reducing exploration and rework.

4.3.4 Terse PR Comments and Commit Messages

The agent reads PR review comments and git history for context. Every verbose commit message or review comment costs tokens when the agent ingests them. Keep commit messages and review comments terse.

4.3.5 Custom Agent Profiles

Create focused instructions for different task types instead of one giant instruction file:

# For test-writing tasks
Stack: Vitest + Testing Library. AAA pattern.
Mock: external services only. No impl mocking.
Coverage: branch coverage ≥80%.

Focused agents carry less instruction overhead than a general-purpose instruction set.

4.3.6 Compress Shell Command Output with RTK

The Coding Agent runs many shell commands per session — git diff, test runs, grep, ls. Each command's raw output flows back as input tokens on the next agent step. A large failing test suite or verbose git diff can return tens of thousands of tokens.

RTK (Rust Token Killer) is a CLI proxy that filters those outputs before they reach the agent. It runs the original command, removes noise (passing tests, unchanged diff lines, build artifacts), and returns a compressed result. The agent behavior is unchanged; it sees smaller, signal-focused output.

Setup for VS Code Copilot — per-repo:

brew install rtk   # or: curl -fsSL https://raw.githubusercontent.com/rtk-ai/rtk/refs/heads/master/install.sh | sh

cd your-repo
rtk init --copilot
# Restart VS Code

RTK installs a PreToolUse hook into the current repository. Repeat per repo — there is no global VS Code Copilot install. Once active, the hook is transparent: your terminal is unchanged; only the agent's Bash tool calls are intercepted.

Commands with verbose output (test failures, large diffs) see the biggest reductions. Short-output commands see smaller gains. Actual savings depend on your project's output volume.

Combine with copilot-setup-steps.yml (§4.3.2) and precise issue descriptions (§4.3.3) for maximum session efficiency. Full setup, command list, and other AI tool support: MCP & Tool Costs §2.7.7.

4.4 Building the Habit

Start Small

Week 1: Add compressed copilot-instructions.md to your main project. Use Ask Mode for simple questions
Week 2: Practice caveman-lite in prompts. Drop filler words, be precise
Week 3: Graduate to caveman-full. Drop articles, use fragments
Week 4: Add "code only" to code generation prompts. Save reusable terse templates outside always-on context

Monthly Maintenance

Review your copilot-instructions.md — has it grown? Compress it back down
Check if any memory files have gotten verbose — compress them back down
Audit which files are habitually open in your editor — close ones you're not working on (open tabs auto-feed context)
(Business/Enterprise) Review repository / org Content Exclusion settings for new sensitive paths
Check your model usage — are you pinning high-effort models for tasks Auto would route to a cheaper tier?
Review budgets, user-level caps, and model policies before expanding premium access further
When default model changes, retune prompts/instructions against that provider's current prompting guide
Check token usage by user/team — are agents and power users driving outsized consumption? See Enterprise Governance

When to Adjust

Signal	Action
Getting wrong results	Back off one compression level
Re-explaining frequently	Instructions may be too terse — add one clarifying line
Hitting rate limits	Apply more techniques from the matrix
New team member confused	Add full-English comments in code, keep instructions compressed
Long agent sessions failing	Check issue description precision, add `copilot-setup-steps.yml`

4.5 Configuring Agent Mode for Efficiency

4.5.1 Agent Mode vs. Ask Mode vs. Edit Mode

Each mode has a fundamentally different token cost profile:

Mode	LLM Calls per Action	Tool Use	Context Loaded	Best For
Ask	1	None	Conversation + instructions	Questions, explanations
Edit	1-2	File read/write	Target file + instructions	Single-file changes
Agent	5-25	Full toolset	Everything + tool schemas	Multi-step, multi-file tasks

The cost multiplier: Agent mode costs 5-25x more than Ask mode for the same prompt. A simple question in Agent mode triggers file reads, tool evaluations, and multi-step reasoning — all unnecessary for "what does this function do?"

4.5.2 The Agent Mode Internal Loop

Understanding the loop helps you minimize steps:

Step 1: Load context
  ├── System prompt (~500 tokens)
  ├── copilot-instructions.md (~50-1500 tokens)
  ├── Tool definitions (~2,000-20,000 tokens)
  ├── Conversation history (growing)
  └── YOUR prompt
  → Send to LLM → Get response

Step 2: LLM decides to call a tool
  ├── Tool call (function + params) → output tokens
  ├── Tool result → input tokens (next step)
  └── Reasoning about result → output tokens

Step 3: Another tool call (or generate response)
  ├── ALL of Step 1's context reloaded
  ├── + Step 2's tool call and result
  └── + growing conversation
  → Send to LLM again

... repeat 5-25 times

Key insight: Context grows with every step. Step 15 carries all the context from steps 1-14 plus the original prompt. This is why long agent sessions get expensive fast.

4.5.3 Minimizing Agent Steps

Every step avoided saves one full context reload. Techniques:

Precise prompts with acceptance criteria:

# Bad — agent will explore, read files, guess requirements
"Fix the user registration"

# Good — agent knows exactly what to do
"File: src/auth/register.ts L42.
 Bug: email validation rejects valid '+' chars.
 Fix: use RFC 5322 regex.
 Test: add 'user+tag@example.com' case in register.test.ts.
 Done when: test passes, no other tests break."

The precise version might complete in 3-5 steps. The vague one: 10-20 steps of exploration.

Plan files for complex tasks:

Create a plan before invoking agent mode:

# plan.md
1. Add `validateEmail()` to src/utils/validation.ts
2. Import and use in src/auth/register.ts L42
3. Add test cases in tests/auth/register.test.ts
4. Run `npm test` — expect all pass

Then prompt: "Execute plan.md." The agent follows the plan instead of discovering the path itself. Fewer exploration steps = fewer tokens.

Use CLI composition instead of agentic tool loops for deterministic operations:

Multi-step browser or data operations dispatched through an agent trigger one LLM call per step — and each step reloads the full accumulated context. A single generated CLI command executes the same work in one shell invocation:

# Browser automation — one LLM call generates this; one shell call runs it
playwright goto https://example.com && wait 1000 && click '#submit-btn' && screenshot out.png

# Chained with filters — no agent loop needed
gh issue list --json number,title,labels | jq '.[] | select(.title | test("bug"; "i"))'

# Piped data transforms
cat logs/app.log | grep ERROR | awk '{print $1, $5}' | sort | uniq -c | sort -rn | head -20

CLI commands are composable, inspectable, re-runnable, and version-controllable. Changing a selector, a filter, or a URL means editing one line of text — not re-prompting an agent through another multi-step loop. Reserve agentic tool use for tasks that genuinely require dynamic decision-making; offload deterministic sequences to the shell.

4.5.4 VS Code Settings for Token Efficiency

Relevant settings that affect agent token usage:

{
  // Limit maximum agent steps (default varies by model)
  "github.copilot.chat.agent.maxTurns": 10,

  // Use auto model selection — cheaper models for simple sub-tasks
  "github.copilot.chat.agent.model": "auto"
}

maxTurns caps how many steps the agent takes. Lower = fewer tokens, but the agent might not finish complex tasks. Start at 10-15, increase only when needed.

4.5.5 Custom Instructions for Agent Efficiency

Add to .github/copilot-instructions.md:

Minimize tool calls. Read files only when necessary.
Batch related changes. Don't read-modify-read-modify when read-modify-modify works.
Prefer grep_search over sequential read_file for discovery.

These directives reduce unnecessary tool calls. Each skipped tool call saves 100-2,000+ tokens of tool input/output.

4.5.6 Decision Framework: When to Use Each Mode

Question about code/syntax/concept?
  → Ask Mode (1 call, ~500-2,000 tokens)

Change to a single file?
  → Edit Mode (1-2 calls, ~1,000-4,000 tokens)

Multi-file change with clear scope?
  → Agent Mode with precise prompt (~5-10 steps, ~15,000-50,000 tokens)

Vague "fix this" or "improve that"?
  → DON'T use Agent Mode yet. Clarify scope first in Ask Mode.
  → Then switch to Agent with precise prompt.

A costly pattern: Using Agent Mode for a vague prompt, watching it explore for 20 steps, then realizing it misunderstood and starting over. That can double token use without improving the result.

4.6 Admin Guardrails Live Elsewhere

This page is for practitioner setup. If you are making customer or enterprise rollout decisions, use Enterprise Governance instead.

That chapter owns:

user-level AI-credit budgets
heavy-usage monitoring
model-access policy
org-level custom instructions
June 1 cutover guidance

Next: Enterprise Governance →