Coach Guide · Extra — Give It a Voice (Voice Live API)
Coach-only. Highest wow-per-effort Extra and the best demo-day closer, but it has two hard dependencies that fail silently: regional API access and a working mic/speaker on the client. Verify both before anyone starts.
What this challenge is really teaching
That the agent and the interface are separable. Students have only ever typed; Voice Live shows the same grounded agent behind a real-time speech channel. The conceptual win is full-duplex streaming (incremental audio in and out) versus the naive STT → agent → TTS pipeline — that’s where the sub-second latency and barge-in come from. Don’t let teams build three separate calls; Voice Live is one session.
Infra to pre-provision (do this BEFORE the session)
-
Voice Live API access on an Azure AI Speech / Foundry resource in a supported region — confirm availability for your event subscription/region weeks ahead; it’s newer and not everywhere.
- Keyless RBAC for the project identity (or a key as workshop fallback) on the Speech/Voice resource.
-
Client hardware: each team needs a laptop with a working mic + speakers (headset is better — avoids feedback/echo in a noisy room). This is the #1 silent failure.
- Confirm
azure-ai-voiceliveinstalls cleanly in the devcontainer/Codespace (it’s in the extras pins).
Flag for the coordinator: if the venue is loud, headsets or a quiet breakout corner make or break the demo. Echo cancellation only goes so far.
Search-Before-Implement
azure-ai-voicelive connect/session signatures are new and changing. Send teams to microsoft-docs for the current connect call and event names before coding. The event names in Step 2 (session.created, response audio deltas, response-done) are illustrative — confirm the live ones.
Per-step facilitation
Step 1 — connect
-
Pitfall: binding to a bare model instead of the agent → spoken answers lose persona/grounding. They must pass
AZURE_FOUNDRY_AGENT_NAMEso turns run through the Northfield agent. -
Auth errors here are usually region/access, not code. Check the resource supports Voice Live.
Step 2 — duplex loop
-
Pitfall: buffering the entire response before playback → kills the latency demo. Play audio deltas as they stream. If it sounds laggy, this is why.
-
Mic permissions on macOS/Linux trip people up — confirm the OS granted the terminal/IDE mic access.
Step 3 — natural conversation
-
VAD/turn-detection and barge-in are the “feels real” features. If a team is short on time, VAD is the priority; barge-in is the flourish.
-
On the Step 4 grounded agent, confirm the spoken answer is actually grounded (cite-worthy content), proving the voice path didn’t bypass the knowledge base.
Why no validate.py
The deliverable is audio — inherently a live/portal demo. Verify by watching/hearing a spoken multi-turn conversation with one barge-in. A short screen+audio recording is the artifact for the readout.
Common failure modes
| Symptom | Cause | Fix |
|---|---|---|
| No audio out | playback waits for full response | play streamed audio deltas incrementally |
| Auth/region error on connect | Voice Live not available in region | use a supported region / resource |
| Echo / feedback loop | open speakers + mic | use a headset; enable echo cancellation |
| Answers ignore the corpus | bound to model not agent | bind session to AZURE_FOUNDRY_AGENT_NAME |
| Push-to-talk feels clunky | VAD not enabled | enable server-side turn detection |