Our original voice pipeline routed audio to a cloud STT provider. Round-trip latency hovered between 800ms and 1.4s : too slow for a child who expects immediate feedback. Moving recognition on-device dropped that to under 200ms, and the full time-to-first-audio came in at 850ms or below in 95th-percentile testing.
Building a Sub-Second Voice Pipeline with On-Device STT
How we achieved sub-second time-to-first-audio by moving speech recognition on-device, eliminating cloud round-trip latency entirely.