Toaster Voice Agent Online

You know that feeling when you build something and it just clicks? When all the pieces fall into place and suddenly you have a system that feels like magic instead of just another technical project? That's exactly what happened with the toaster voice agent system.

The Architecture That Made Sense

From day one, Redis was the obvious choice. When you're building a voice agent that needs to process speech in real-time, you need something faster than HTTP calls, more reliable than file-based communication, and flexible enough to handle the async nature of voice processing. Redis pub/sub wasn't just a good choice - it was the only choice that made sense.

The pipeline flows like this:

Microphone → Voice Detector → Redis → Speech-to-Text → Redis → Voice Agent → Redis → TTS → Your Voice

Each service is completely independent. They can crash, restart, or get upgraded without affecting the others. That's not just good architecture - that's resilient architecture.

The Echo Cancellation Challenge

But here's where things got interesting. When you're building a voice agent, you have a fundamental problem: how do you prevent the agent from hearing its own voice and getting into an infinite loop?

The initial approach was simple - just don't listen while speaking. But then we hit a wall with Kokoro TTS. Long responses needed to be chunked for proper processing, which meant the TTS service would be speaking while the voice detector was trying to listen for the next user input.

The solution? Echo suppression through Redis pub/sub. The TTS service publishes its status to voice:tts_status, and the voice detector subscribes to it. When TTS is speaking, the voice detector knows to ignore audio input. When TTS finishes, it publishes a "silent" status, and the voice detector resumes listening.

It's elegant, it's effective, and it works flawlessly.

The VRAM Management Dance

Here's where the real engineering challenge lives. The toaster has four Blackwell GPUs with 96GB VRAM each, but managing that memory across multiple AI models is an art form.

Currently, the LLM (GLM 4.5 Air) runs in VRAM because it needs those fast response times for natural conversation flow. Everything else - voice detection, speech-to-text, TTS - runs on CPU. The latency is manageable, but it's a constant balancing act.

# The memory management challenge in practice
if gpu_memory_available > required_for_llm:
    run_llm_on_gpu()  # Fast responses
else:
    run_llm_on_cpu()  # Slower but stable

We're still learning this dance. The goal is to maximize performance while ensuring reliability. Sometimes that means moving models between GPU and CPU based on current load.

The Magic Moment

The first time I had a complete conversation with the system, I knew we'd built something special. Not just because the technology worked, but because it felt natural. I was asking questions about my upcoming trip to Japan, and the agent was responding in my voice with practical travel advice.

That's when it hit me - this isn't just a technical achievement. This is the beginning of truly natural human-AI interaction.

See It In Action

Don't just take my word for it. Here's a quick demo of the voice agent in action:

Voice Agent Demo - YouTube

Watch how naturally the conversation flows. Notice how the agent responds in the actual user's voice. This isn't some robotic text-to-speech - this is conversational AI that sounds like a real person.

The Bigger Picture

This voice agent isn't just a cool tech demo - it's part of something much bigger. We're building the future of home AI assistance at Orenda Technologies. Our goal is simple: create the best home use agent available.

That means voice interaction that feels natural, computer vision that actually understands what it sees, and an AI assistant that can help with real tasks instead of just answering trivia questions.

The Technical Elegance

What I love about this architecture is how clean it is. Each service has exactly one responsibility:

  • Voice Detector: Find speech in audio stream
  • Speech-to-Text: Convert speech to text
  • Voice Agent: Generate intelligent responses
  • TTS Service: Convert responses back to speech

No service knows about the others. They communicate only through Redis channels with well-defined message formats. This makes the system incredibly maintainable and scalable.

The Future Is Conversational

We're not just building a voice interface - we're building the foundation for truly conversational AI. The same architecture that handles voice today will handle multimodal interactions tomorrow. Show the agent a photo, ask questions about it, get responses in natural speech.

The infrastructure is ready. The models are getting faster and more capable. And most importantly, the user experience is finally catching up to what we've always imagined AI could be.

What's Next

The voice agent is just the beginning. We're working on integrating computer vision, adding more sophisticated reasoning capabilities, and expanding the types of tasks the agent can help with.

But right now, there's something magical about having a conversation with an AI that responds in your own voice, understands your questions, and provides genuinely helpful answers.

The future of AI isn't just about being smarter - it's about being more human. And this voice agent is a huge step in that direction.

Try asking it about your next vacation. You might be surprised by how useful it can be.