Ask HN: What's the current best local/open speech-to-speech setup?
Posted by dsrtslnd23 6 days ago
I’m trying to do the “voice assistant” thing fully locally: mic → model → speaker, low latency, ideally streaming + interruptible (barge-in).
Qwen3 Omni looks perfect on paper (“real-time”, speech-to-speech, etc). But I’ve been poking around and I can’t find a single reproducible “here’s how I got the open weights doing real speech-to-speech locally” writeup. Lots of “speech in → text out” or “audio out after the model finishes”, but not a usable realtime voice loop. Feels like either (a) the tooling isn’t there yet, or (b) I’m missing the secret sauce.
What are people actually using in 2026 if they want open + local voice?
Is anyone doing true end-to-end speech models locally (streaming audio out), or is the SOTA still “streaming ASR + LLM + streaming TTS” glued together?
If you did get Qwen3 Omni speech-to-speech working: what stack (transformers / vLLM-omni / something else), what hardware, and is it actually realtime?
What’s the most “works today” combo on a single GPU?
Bonus: rough numbers people see for mic → first audio back
Would love pointers to repos, configs, or “this is the one that finally worked for me” war stories.
Comments
Comment by d4rkp4ttern 5 days ago
STT: Handy [1] (open-source), with Parakeet V3 - stunningly fast, near-instant transcription. The slight accuracy drop relative to bigger models is immaterial when you're talking to an AI. I always ask it to restate back to me what it understood, and it gives back a nicely structured version -- this helps confirm understanding as well as likely helps the CLI agent stay on track.
TTS: Pocket-TTS [2], just 100M params, and amazing speech quality (English only). I made a voice plugin [3] based on this, for Claude Code so it can speak out short updates whenever CC stops. It uses a non-blocking stop hook that calls a headless agent to create the 1/2-sentence summary. Turns out to be surprisingly useful. It's also fun as you can customize the speaking style and mirror your vibe etc.
The voice plugin gives commands to control it:
/voice:speak stop
/voice:speak azelma (change the voice)
/voice:speak <your arbitrary prompt to control the style or other aspects>
[1] Handy https://github.com/cjpais/Handy[2] Pocket-TTS https://github.com/kyutai-labs/pocket-tts
[3] Voice plugin for Claude Code: https://github.com/pchalasani/claude-code-tools?tab=readme-o...
Comment by freakynit 6 hours ago
Comment by d4rkp4ttern 6 hours ago
Comment by freakynit 6 hours ago
Comment by skrebbel 5 days ago
Comment by indigodaddy 5 days ago
Have any thoughts?
Comment by d4rkp4ttern 4 days ago
Comment by indigodaddy 4 days ago
Comment by 3dsnano 5 days ago
thanks for sharing your knowledge; can’t wait to try out your voice plugin
Comment by d4rkp4ttern 4 days ago
Feel free to file a gh issue if you have problems with the voice plugin
Comment by mpaepper 5 days ago
It has dual channel input / output and a very permissible license
Comment by cbrews 5 days ago
The most challenging part in my build was tuning the inference batch sizing here. I was able to get it working well for Speech-to-Text down to batch sizes of 200ms. I even implement a basic local agreement algorithm and it was still very fast (inferencing time, I think, was around 10-20ms?). You're basically limited by the minimum batch size, NOT inference time. Maybe that's a missing "secret sauce" suggested in the original post?
In the use case listed above, the TTS probably isn't a bottleneck as long as OP can generate tokens quickly.
All this being said a wrapped model like this that is able to handle hand-offs between these parts of the process sounds really useful and I'll definitely be interested in seeing how it performs.
Let me know if you guys play with this and find success.
Comment by zaken 5 days ago
Comment by albert_e 5 days ago
and the "Customer Service - Banking" scenario claims that it demos "accent control" and the prompt gives the agent a definitely non-indian name, yet the agents sounds 100% Indian - I found that hilarious but also isn't it a bad example given they are claiming accent control as a feature?
Comment by mikkupikku 5 days ago
Comment by adabyron 5 days ago
Comment by hnlmorg 5 days ago
Comment by dsrtslnd23 5 days ago
Comment by vulkoingim 5 days ago
On iOS I'm also using the same app, with the Apple Speech model, which I found out to be better performing for me than the parakeet/whisper. One drawback for the apple model is that you need iOS/Mac 26+ - and I haven't bothered to update to Tahoe on my mac.
Both of the models work instantly for me (Mac M1, iphone 17 Pro).
Edit: Aaaand I just saw that you're looking for speech-to-speech. Oops, still sleeping.
Comment by jauntywundrkind 5 days ago
Kyutai some very interesting work always. Their delayed streams work is bleeding edge & sounds very promising especially for low latency. Not sure why I have not yet tried it tbh. https://github.com/kyutai-labs/delayed-streams-modeling
There's also a really nice elegant simple app Handy. Only supports Whisper and Parakeet V3 but nice app & those are amazing models. https://github.com/cjpais/Handy
Comment by supermatt 5 days ago
Discussion: https://news.ycombinator.com/item?id=46528045
Article: https://www.daily.co/blog/building-voice-agents-with-nvidia-...
Comment by dfajgljsldkjag 5 days ago
Not sure if there's any turnkey setups that are preconfigured for local install where you can just press play and go though.
Last I heard E2E speech to speech models are still pretty weak. I've had pretty bad results from gpt-realtime and that's a proprietary model, I'm assuming open source is a bit behind.
Comment by storystarling 5 days ago
Comment by gunalx 4 days ago
Comment by dsrtslnd23 5 days ago
Comment by timwis 5 days ago
You can run it on a raspberry pi (or ideally an N100+), and for the microphone/speaker part, you can make your own or buy their off the shelf voice hardware, which works really well.
Comment by stavros 5 days ago
I looked at their Wyoming docs online but couldn't really see how to even let it find the server, and the ESPhome firmware it runs offered similarly few hints.
Comment by amelius 5 days ago
Comment by nsbk 5 days ago
The work is based on a repo by pipecat that I forked and modified to be more comfortable to run (docker compose for the server and client), added Spanish support via canary models, and added Nvidia Ampere support so it can run on my 3090.
The use case is a conversation partner for my gf who is learning Spanish, and it works incredibly well. For LLM I settled with Mistral-Small-3.2-24B-Instruct-2506-Q4_K_S.gguf
Comment by nsbk 4 days ago
Comment by soulofmischief 5 days ago
If you want something simple that runs in browser, look at vosk-browser[0] and vits-web[1].
I'd also recommend checking out KittenTTS[2], I use it and it's great for the size/performance. However, you'd need to implement a custom JavaScript harness for the model since it's a python project. If you need help with that, shoot me an email and I can share some code.
There are other great approaches too if you don't mind python, personally I chose the web as a platform in order to make my agent fully portable and remote once I release it.
And of course, NVIDIA's new model just came out last week[3] but I haven't gotten to test it out just yet, and also there was the recent Sparrow-1[4] announcement which shows people are finally putting money into the problems plaguing voice agents that are rigged up from several models and glue infrastructure, vs a single end-to-end model or at least a conversational turn-taking model to keep things on rails.
[0] https://www.npmjs.com/package/vosk-browser
[1] https://github.com/diffusionstudio/vits-web
[2] https://github.com/KittenML/KittenTTS
[3] https://research.nvidia.com/labs/adlr/personaplex/
[4] https://www.tavus.io/post/sparrow-1-human-level-conversation...
Comment by andhuman 5 days ago
Comment by PhilippGille 5 days ago
> Is anyone doing true end-to-end speech models locally (streaming audio out), or is the SOTA still “streaming ASR + LLM + streaming TTS” glued together?
Your setup is the latter, not the former.
Comment by schobi 5 days ago
Do you have the GPU running all day at 200W to scan for wake words? Or is that running on the machine you are working on anyway?
Is this running from a headset microphone (while sitting at the desk?) or more like a USB speakerphone? Is there an Alexa jailbreak / alternative firmware as a frontend and run this on a GPU hidden away?
Comment by butvacuum 5 days ago
Theres even microphone ADCs and DSPs(if you use a mic that outputs PCM/i2S instead of analog) that do the processing internally.
Comment by dwa3592 5 days ago
Comment by marsbars241 5 days ago
Comment by doonielk 5 days ago
I was able to conversational latency with the ability to interrupt the pipeline on a Mac, using a variety of tricks. It's MLX, so only relevant if you have a Mac.
https://github.com/andrewgph/local_voice
For MLX speech to speech, I've seen:
The mlx-audio package has some MLX implementations of speech to speech models: https://github.com/Blaizzy/mlx-audio/tree/main
kyutai Moshi, maybe old now but has a MLX implementation of their speech to speech model: https://github.com/kyutai-labs/moshi
Comment by zahlman 5 days ago
Comment by varik77 5 days ago
Comment by beauregardener 5 days ago
Comment by sails 5 days ago
It can’t be too far off considering Siri and TTS has been on devices for ages
Comment by sgt 5 days ago
Comment by nemima 5 days ago
(Full disclosure I'm an engineer there)
Comment by sgt 5 days ago
PS if you can share your email I'll pop you an email about Speechmatics. I tried the English version and it's impressive.
Comment by nemima 5 days ago
https://docs.speechmatics.com/speech-to-text/languages#trans...
Drop me an email at mattn@speechmatics.com and we can chat about further details :)
Comment by dvfjsdhgfv 5 days ago
An API call to GPT4o works quite well (it basically handles both transcription and diarization), but I wanted a local model.
Whisper is really good for 1 person speaking. With more people you get repetitions. Qwen and other open multimodal models gives subpar results.
I tried multipass approach, with the first one identifying the language and chunking and the next one the actual transcription, but this tended to miss a lot of content.
I'm going to give canary-1b-v2 a try next weekend. But it looks like in spite of enormous development in other areas, speech recognition stalled since Whisper's release (more than 3 years already?).
Comment by ripped_britches 5 days ago
Comment by vidarh 5 days ago
Comment by ripped_britches 5 days ago
Comment by DANmode 5 days ago
Local, FOSS
Comment by benatkin 5 days ago
Comment by joshribakoff 5 days ago
Comment by tuananh 5 days ago
Comment by benatkin 5 days ago
Comment by hedgehog 5 days ago
Comment by Johnny_Bonk 5 days ago
Comment by woudsma 5 days ago
Comment by d4rkp4ttern 5 days ago
Comment by garblegarble 5 days ago
Comment by Johnny_Bonk 5 days ago
Comment by garblegarble 5 days ago
Comment by BiraIgnacio 5 days ago
Comment by masardigital 5 days ago
Comment by mrwrightphoto 3 days ago
Comment by bakeryenjoyer 5 days ago
Comment by hackomorespacko 5 days ago