AnsibleAutomation: The Crash That Taught Us Everything

Sometimes the best lessons come from disaster. That's this story.

The Before Times: Manual Everything

When we first built toaster, we were flying fast and loose. Services? Just run the docker command. Configuration? Edit the file and restart. Need to deploy something new? SSH in, pull the repo, install dependencies, configure the service, hope nothing breaks.

It worked great. Right up until it didn't.

The Crash

I won't name names, but let's just say a full system crash turned into the kind of learning experience you only appreciate in hindsight.

Everything was gone. All the carefully configured services. All the docker containers. All the little tweaks and environment variables that made toaster... well, toaster.

And I'm sitting there looking at a blank terminal, thinking: "I have to rebuild all of this from scratch?"

The Realization

That's when it hit me: if you can't automate it, you can't scale it.

We had been building infrastructure like it was permanent. But infrastructure isn't permanent - it's temporary. Servers crash. Disks fail. Things break. What matters isn't the server - it's the knowledge of how to recreate it.

That's when we discovered Ansible.

Why Ansible?

We looked at the options:

  • Terraform - Great for cloud infrastructure, overkill for a single server
  • Docker Compose - Good for containers, doesn't handle systemd services
  • Bash scripts - We already had too many of these

Ansible clicked because:

  1. Agentless - Just SSH and Python
  2. Idempotent - Run the same playbook twice, nothing breaks
  3. Declarative - Say WHAT you want, not HOW to do it
  4. Human readable - YAML, not magic

The Structure We Built

Here's the anatomy of our Ansible setup:

ansible/
├── deployments/
│   └── toaster/
│       ├── openwebui/          # Bash startup script
│       ├── openwebui.service   # Systemd service file
│       ├── qwen35/             # Model startup script
│       └── qwen35.service      # Systemd service file
├── roles/
│   └── deploy/
│       └── tasks/
│           └── main.yml        # The deployment logic
└── deploy-openwebui.yml        # Dedicated playbook

The Service File Pattern

Every service on toaster now follows the same pattern:

[Unit]
Description=Service Name
After=docker.service network.target
Requires=docker.service

[Service]
Type=simple
User=<username>
Group=<username>
WorkingDirectory=<working_dir>
ExecStart=<startup_script>
Restart=on-failure
RestartSec=10s
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

The Bash Script Pattern

And every service has a bash script that holds the actual docker command:

#!/usr/bin/env bash

docker run --rm \
  -p <port>:<port> \
  -v <volume_name>:/app/data \
  --name <service_name> \
  <docker_image>:<tag>

Why bash scripts? Because the command is right there. No guessing. No digging through systemd files. Just read the script and you know exactly what's running.

The Deployment Playbook

The real magic is in the playbook:

- name: Deploy service to toaster
  hosts: toaster
  gather_facts: true

  tasks:
    - name: Sync files to remote
      ansible.posix.synchronize:
        src: deployments/toaster/
        dest: <destination_path>/
      
    - name: Install systemd service
      ansible.builtin.copy:
        src: deployments/toaster/<service>.service
        dest: /etc/systemd/system/<service>.service
      
    - name: Enable and start service
      ansible.builtin.systemd:
        name: <service>
        state: restarted
        enabled: true

One command. That's all it takes:

ansible-playbook -i inventory.ini deploy-<service>.yml

What This Means

Before Ansible, deploying a new service meant:

  1. SSH to toaster
  2. Create the docker command
  3. Create the systemd service file
  4. Copy both files
  5. Reload systemd
  6. Enable the service
  7. Start the service
  8. Hope nothing broke

Now? One command.

The Bigger Picture

This isn't just about convenience. It's about resilience.

If toaster crashes tomorrow - really crashes, disk failure, nothing boots - we're not starting from scratch. We have the knowledge. We have the playbooks. We can rebuild everything in minutes.

That's the lesson the crash taught us: automation isn't about laziness, it's about recovery.

What's Next

We're not stopping here. The next step is expanding Ansible to handle:

  • GPU driver management
  • Docker configuration
  • Network setup
  • Backup automation

Because the next crash? We'll be ready.

Sometimes the best way to learn is to break something completely. Then build it back better.


This blog post was written with the help of Qwen 3.5.