Extra · Give It a Voice — Voice Live API
Tier 2 · Extra — modular. You can attempt this in any order with the other Extras. Prerequisite: the Foundations end-state (a deployed, grounded Northfield IQ Assistant). Complete Foundations, or run the bootstrap skip-path:
azd up && ./scripts/setup-foundations.sh && python scripts/validate-foundations.py.Specific prereq: Foundations Step 3 (a working agent). It works on the Step 3 agent; it’s better on the Step 4 grounded agent (spoken answers gain citations).
⚙️ Infra prerequisite (coach must pre-provision): Voice Live API access (Azure AI Speech / Foundry voice) in a supported region, and a microphone-capable client machine (laptop mic + speakers, or headset). See solution.md → Infra to pre-provision. Confirm regional availability before the event.
🎤 Demo wow-factor: a literal talking campus assistant — speak a question, hear it answer in a natural voice with sub-second latency. The single strongest crowd demo of the event.
Why this challenge
Every challenge so far has been typed. But a student walking across campus doesn’t want to type — they want to ask. The Voice Live API turns your text assistant into a spoken one: it streams mic audio in, runs your agent, and streams synthesized speech back out, all over a single low-latency WebSocket. No stitching together separate speech-to-text, agent, and text-to-speech calls — Voice Live orchestrates the full duplex loop for you.
mic --> Voice Live (STT) --> Northfield IQ Assistant --> Voice Live (TTS) --> speaker
\---------------------- one low-latency streaming session -------------------/
Step 1 — Connect a Voice Live session to your agent
Goal: Open a Voice Live session bound to your Northfield agent and confirm the handshake.
Tasks:
-
Install the client SDK (
pip install azure-ai-voicelive) and confirm mic + speaker access on your machine. -
Using the
azure-aispeech skill pattern, open a Voice Live session against your Foundry endpoint. Search before you implement: querymicrosoft-docsfor the currentazure-ai-voiceliveconnect signature — this API is new and moves. -
Bind the session to your existing agent (
AZURE_FOUNDRY_AGENT_NAME) so spoken turns run through your grounded agent, not a generic model. Configure a voice (e.g. a neural voice) and the audio formats.
Env you’ll use (authoritative names): AZURE_AI_PROJECT_ENDPOINT, AZURE_FOUNDRY_AGENT_NAME, AZURE_AI_MODEL_DEPLOYMENT_NAME.
Success Criteria:
- The client establishes a Voice Live session without auth errors.
- The session is bound to your Northfield agent (not a bare model).
Checkpoint: Console state — the client prints session.created (or equivalent) and a chosen voice id.
Step 2 — Speak in, hear out (the full duplex loop)
Goal: Ask a question out loud and hear the assistant answer.
Tasks:
-
Stream microphone audio into the session and handle the streamed audio response, playing it back on your speakers.
-
Handle the core session events: input audio started/stopped, response audio deltas, and response-done. Play audio deltas as they arrive (don’t wait for the full response — that’s the latency win).
-
Ask: “When does fall registration open?” and listen to the spoken answer.
Success Criteria:
- Speaking a question produces an audible spoken answer.
- Audio plays back incrementally (you hear it start before the full answer is computed).
Checkpoint: Live demo — speak a Northfield question and the assistant answers out loud. Capture a short screen+audio recording for the readout.
Step 3 — Tune for natural conversation
Goal: Make it feel like a conversation, not a walkie-talkie.
Tasks:
-
Enable server-side voice activity detection (VAD) / turn detection so you don’t push-to-talk — the assistant detects when you’ve stopped speaking.
- Enable barge-in (interrupt): if you start talking while it’s answering, it stops and listens.
- (If on the Step 4 grounded agent) Ask a corpus question (“What’s the tuition refund policy?”) and confirm the spoken answer reflects grounded content — the voice path still uses your knowledge base.
Success Criteria:
- Turn-taking works without manual push-to-talk.
- You can interrupt (barge-in) mid-answer and it yields.
- (Grounded agent) a spoken answer reflects FAQ-corpus content.
Checkpoint: Live demo — a multi-turn spoken conversation with at least one barge-in, confirmed with your coach.
What you built
A hands-free, spoken Northfield IQ Assistant. Same grounded brain, new interface — the agent now listens and talks back in real time, which is the demo people remember.