Qwen3 Coder 30b Rules!

I spend a lot of time on the LocalLLaMA subreddit hunting for the next breakthrough in agentic coding models. It's become a bit of an obsession—testing every new model that claims to write production-quality code. Most disappoint. Some surprise. But every once in a while, you find one that just rules.

Enter Qwen3 Coder 30b.

The Challenge: Replace AWS Polly

For Forever Fantasy, I needed to ditch AWS Polly and move to local text-to-speech. The goal was simple: build a production-ready TTS service using the Kokoro model that could slot right into my game's audio pipeline. I handed Qwen3 Coder the hello world example from the Kokoro README—a notebook example with Python requirements—and asked it to build me a complete service.

What happened next blew me away.

One Shot, No Bugs

Qwen3 Coder delivered a complete, production-ready FastAPI service in one shot. No iterations. No debugging. No "oops, I forgot the Dockerfile." It understood the full scope of what I needed and delivered everything:

@app.post("/v1/audio/speech")
async def generate_speech(request: TextRequest):
    """
    Generate speech from text using kokoro TTS model and return as WAV audio.
    """
    text = request.text
    voice = request.voice

    if not text or len(text.strip()) == 0:
        raise HTTPException(status_code=400, detail="Text cannot be empty")

    try:
        # Generate audio with kokoro model - returns a generator
        generator = tts_pipeline(text, voice=voice)
        
        # Process the first (and typically only) audio output from generator
        for gs, ps, audio in generator:
            # Convert to numpy array if needed and ensure proper format
            if not isinstance(audio, np.ndarray):
                audio = np.array(audio)
            
            # kokoro outputs at 24000Hz as shown in example
            sample_rate = 24000
            
            # Convert to bytes for streaming response - using soundfile
            audio_buffer = io.BytesIO()
            
            # Write to buffer with proper WAV format at 24000Hz
            sf.write(audio_buffer, audio, sample_rate, format='WAV')
            
            # Get bytes from buffer
            audio_buffer.seek(0)
            audio_bytes = audio_buffer.read()
            
            return StreamingResponse(
                iter([audio_bytes]),
                media_type="audio/wav",
                headers={"Content-Disposition": "inline; filename=\"tts_output.wav\""}
            )

    except Exception as e:
        raise HTTPException(status_code=500, detail=f"TTS generation failed: {str(e)}")

Look at that error handling. Look at the proper audio streaming. Look at the OpenAI API compatibility with the /v1/audio/speech endpoint. This isn't just working code—it's thoughtful code.

Docker Done Right

But it didn't stop at Python. The Dockerfile it generated shows real understanding of containerization best practices:

FROM ubuntu:24.04

# Set environment variables to avoid interactive prompts
ENV DEBIAN_FRONTEND=noninteractive
ENV TZ=UTC

# Update package list and install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    python3 \
    python3-pip \
    python3-dev \
    python3-venv \
    libsndfile1 \
    espeak-ng \
    && rm -rf /var/lib/apt/lists/*

# Create and activate virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Set working directory
WORKDIR /app

# Copy requirements first (for better caching)
COPY requirements.txt .

# Install Python dependencies in virtual environment
RUN pip install --no-cache-dir -r requirements.txt

Virtual environments. Proper layer caching. System dependencies for audio processing. Even the PyTorch MPS fallback environment variable. This is production-grade containerization.

The MoE Advantage

Here's what makes Qwen3 Coder special: it's a Mixture of Experts model, not a dense model like Devstral Small. On my 128GB MacBook Pro, I'm getting 80 tokens per second—significantly faster than dense models of similar capability. Initial prompt processing takes a few seconds (thanks, Cline's extensive context!), but once it's running, it's usably fast.

The performance puts it on par with Claude 4 Sonnet for code generation, but running entirely local. No API costs. No rate limits. No sending my code to the cloud.

The Bigger Picture

This isn't just about one impressive code generation session. It's about what happens when local AI reaches production quality. When you can run Claude-level coding assistance on your laptop, the entire development workflow changes. No more waiting for API responses. No more worrying about sensitive code leaving your machine.

We're not quite there yet—I still rely on Anthropic's deep understanding of the Forever Fantasy codebase for complex architectural decisions. But for focused tasks like "build me a TTS service," local models are becoming genuinely competitive.

The LocalLLaMA Hunt Continues

Every week brings new models, new architectures, new possibilities. Qwen3 Coder 30b is just the latest in a string of impressive releases that are pushing the boundaries of what's possible with local AI.

The hunt continues, but for now, I'm enjoying having a model that just rules.