Coach Guide — Advanced: Tracing & Observability
Coach-only. Do not share with students. This guide holds the verified answers, the pitfalls teams hit, and the facilitation arc. The student
README.mddeliberately stops at the gotcha box.
What this challenge proves
By the end a team can take one student question and account for its full execution: model span, retrieval span, optional tool span, with tokens, latency, and an estimated cost — read two ways (portal Tracing tab + KQL). The pedagogy is deliberately FrontierWeekHack’s “same data, two lenses” pattern: the portal teaches the shape of a trace, KQL teaches querying it.
The challenge assumes the Foundations end-state (a deployed, grounded Northfield IQ Assistant) or the bootstrap skip-path. If a team can’t get a grounded answer at all, that’s a Foundations problem — send them to validate-foundations.py before debugging tracing.
The one thing that makes or breaks this challenge
Set env before import. 80% of failures are here. The two flags — AZURE_EXPERIMENTAL_ENABLE_GENAI_TRACING=true and OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true — are read at SDK import time. If a student imports azure.ai.projects (directly or transitively via another module) before setting them, instrumentation initializes without message capture and the spans show up with empty prompt/response fields, or GenAI spans don’t appear at all.
Tell-tale signs and the fix:
| Symptom | Cause | Fix |
|---|---|---|
| Spans exist, prompt/answer fields blank | flags set after import | move os.environ[...] to the very top, above all azure.ai.* imports |
| No GenAI spans at all | configure_azure_monitor never called, or wrong conn string | confirm App Insights conn string resolves; check enable_tracing() actually ran |
traced_run.py blank fields but trace_setup.py fine | another import at top of traced_run.py pulled the SDK in first | ensure from trace_setup import enable_tracing is the first project import |
Step-by-step coaching
Step 1 — Enable instrumentation
-
App Insights connection string: Foundations writes
APPLICATIONINSIGHTS_CONNECTION_STRINGto.env. If a team is on the bootstrap path and it’s missing, the SDK callproject.telemetry.get_application_insights_connection_string()resolves it (shown in the README). Portal path: project → Monitoring → Application analytics → copy connection string. -
SDK import note: the verified instrumentor import in the current stack is
from azure.ai.projects.telemetry import AIProjectInstrumentorthen.instrument(). Some teams may see tutorials usingproject.telemetry.enable()— that’s the convenience wrapper around the same instrumentor and is equally acceptable. Either passes the checkpoint as long as message-content capture is on. -
If a team hits an
ImportErroronazure.ai.projects.telemetry, they’re on an old pinned version — have them reinstallrequirements.txt(azure-ai-projects>=2.1.0).
Step 2 — Emit spans
-
The grounded question is intentional: it forces a retrieval span so the trace has more than a bare model call. If a team’s question doesn’t trigger the knowledge base, the span tree is thin — steer them to a question that’s clearly answerable only from the FAQ corpus (financial aid docs, housing deadlines, registration holds).
-
Latency expectation: spans take 1–3 min to land. Teams will re-run thinking it failed and create duplicate traces. Tell them to wait, then refresh. The
response.idis the anchor for finding it. -
Auth errors here are almost always
DefaultAzureCredential(not logged in / wrong subscription), not tracing. Have them runaz account showfirst.
Step 3 — Portal Tracing tab
-
The span hierarchy: a parent agent/response span, with child model (
gen_aiattributes) and retrieval spans; a tool span appears only if they attached the Action Tools MCP tool. -
Token attributes to point at:
gen_ai.usage.input_tokens,gen_ai.usage.output_tokens,gen_ai.usage.total_tokens. Latency = the spanduration. -
This step’s checkpoint is portal-state;
validate.py --step 3confirms via App Insights that a multi-span trace exists for a recent run (it can’t read the portal UI directly).
Step 4 — KQL correlation (the answers)
The README ships three queries as scaffolding. The graded artifact is correlate.kql — a working end-to-end correlation for one operation_Id. A complete, correct answer looks like:
// correlate.kql — end-to-end trace for one student question, with token/latency/cost rollup
let opId = "abc123def456..."; // the run's operation_Id
let price_per_1k = 0.005; // model $ / 1K tokens
let spans =
union dependencies, requests, traces
| where operation_Id == opId;
spans
| project timestamp, itemType, span = name, duration_ms = duration,
input_tokens = toint(customDimensions["gen_ai.usage.input_tokens"]),
output_tokens = toint(customDimensions["gen_ai.usage.output_tokens"]),
total_tokens = toint(customDimensions["gen_ai.usage.total_tokens"])
| order by timestamp asc;
// rollup
spans
| summarize total_tokens = sum(toint(customDimensions["gen_ai.usage.total_tokens"])),
total_latency_ms = sum(duration),
span_count = count()
| extend est_cost_usd = round(total_tokens / 1000.0 * price_per_1k, 6)
Coaching notes:
-
dependenciesvstracesvsrequests: OTel spans map todependencies(outbound calls like model/retrieval) andrequests(the inbound agent invocation);tracesholds log events. Theunionis what gives the full picture — teams that query onlydependenciesmiss the parent request. -
customDimensionskeys vary slightly by SDK version. Ifgen_ai.usage.total_tokensis null, have them rundependencies | where operation_Id == opId | project customDimensionsand read the actual key names — don’t let them assume. -
Cost is a calculation, not a lookup. Any sane per-1K rate passes; the learning objective is deriving cost from token telemetry, not knowing the exact price.
Timing (60 min)
- 0–15 min: Step 1 instrumentation (most of the budget goes to the env-before-import gotcha).
- 15–30 min: Step 2 run + the 1–3 min export wait.
- 30–45 min: Step 3 portal span tree.
- 45–60 min: Step 4 KQL correlation + save
correlate.kql.
If time is tight, prioritize Steps 1–2 (instrumentation working) and the starter KQL query over the full cost rollup.
Expected questions
- “Why are my prompt/answer fields empty?” → set-env-before-import. Walk the file top-down.
-
“Nothing shows in App Insights.” → wait 1–3 min; verify the conn string; verify
configure_azure_monitorran. Check they didn’t point at a different App Insights resource. - “Which table has the tokens?” →
dependencies, incustomDimensions["gen_ai.usage.*"]. - “Do I need the tool span?” → no; it only exists if they did Action Tools. The challenge is complete with model + retrieval spans.
Success definition
A team is done when validate.py --step 4 passes, correlate.kql returns all spans for one run, and they can verbally walk the trace from question → model → retrieval → answer with tokens and latency.