Taming The Toaster

You know what's more fun than building a 4-GPU beast of a machine? Figuring out why it crashes every single day like clockwork.

If you've been following along, you know the toaster — our 96-core Threadripper Pro 7995WX with 4x Blackwell Max Q GPUs packing 384GB of VRAM. It's an absolute unit. And for the last few weeks, it's been locking up hard about once a day. Not a graceful crash. Not a kernel panic. A full "black screen, network dead, power button does nothing" lockup. The kind where you have to walk over and flip the PSU switches off and on.

The POST code on the motherboard? 00. That's the hardware equivalent of ¯\_(ツ)_/¯.

The Investigation

The first thing we suspected was the PCIe link retraining issue. NVIDIA's DPM (Dynamic Power Management) drops GPUs down to PCIe Gen1 when they're idle to save power. On the WRX90E motherboard with Blackwell GPUs, the Gen1→Gen5 retrain when inference kicked back in was triggering "Surprise Link Down" errors. We fixed that with kernel params:

pcie_aspm=off pcie_port_pm=off

This keeps the PCIe ports in the D0 active state so the link never has to retrain. Problem solved, right?

Nope. Still crashing. Every. Single. Day.

The IPMI Breadcrumbs

Time to dig deeper. We went into the IPMI System Event Log — the motherboard's black box flight recorder. And there it was, buried among normal sensor readings:

Temperature sensor #0x10 | Upper Non-critical going high
Temperature sensor #0x0e | Upper Non-critical going high

These aren't GPU temps. They're not CPU temps. They're PCIe slot retimer temps — sensors labeled PCIE01 through PCIE07 on the motherboard itself. The retimer chips that handle PCIe Gen5 signaling at 32 GT/s.

Now we had a lead. But we needed to see it happen in real-time.

Building the Monitor

We already had a monitoring service logging GPU power, temps, and PCIe link status every 60 seconds. We upgraded it to also pull IPMI sensor data for the motherboard's PCIe slot temperatures:

ipmitool sensor | grep -i "PCIE.*Temp"

That gave us the full picture — GPU stats AND motherboard slot temps, side by side, in real-time. And when we fired up our GPU stress test (all 4 cards at max power), the data told a very clear story:

Time	GPU Power	PCIE01	PCIE03	PCIE05	PCIE07
Idle	~15W	44°C	47°C	41°C	32°C
Model loaded	~70W	48°C	52°C	45°C	37°C
+1min load	~300W	82°C	87°C	80°C	73°C
+2min load	~190W	85°C	88°C	87°C	81°C

PCIE03 hit 88°C in two minutes and was still climbing. The "Upper Non-critical" alarm threshold? 90°C.

The GPUs themselves were running at 85-90°C with their fans at... 30%. That's the stock NVIDIA VBIOS fan curve for you. Thirty percent fan speed. At ninety degrees. On a blower card.

Root Cause

Here's what was happening: 4 GPUs pulling 300W each (1200W total) were radiating massive amounts of heat. The stock NVIDIA fan curve was barely spinning the fans, so all that thermal energy was just cooking the motherboard's PCIe slot retimer circuitry. When those retimer chips hit 90°C, PCIe Gen5 signal integrity degrades — 32 GT/s doesn't leave much margin for error. The PCIe fabric hangs, and the entire system locks up hard enough that even the power button can't recover it.

The kicker? nvidia-smi -pl 250 (250W) is the minimum power limit these GPUs allow. And at 250W without better cooling, the slots were still hitting 86°C. We needed to fix the actual cooling.

The Fix: Aggressive Fan Control

A friend who runs a similar setup pointed us to a custom fan control daemon. The script overrides the VBIOS fan curve with something much more aggressive using pynvml:

FAN_CURVE = [
    (40, 30),   # 40°C → 30% (same as stock at idle)
    (50, 40),   # 50°C → 40% (stock: still 30%)
    (60, 60),   # 60°C → 60% (stock: STILL 30%)
    (75, 85),   # 75°C → 85% (stock: maybe 35%?)
    (85, 100),  # 85°C → 100% (stock: ¯\_(ツ)_/¯)
]

The key feature is hysteresis — 3°C of buffer so the fans don't oscillate at temperature boundaries. It also has per-GPU error isolation (if NVML fails on one GPU, the others keep going) and automatic restoration of VBIOS fan control on shutdown so your GPUs don't get stuck at a fixed speed.

The full script is available as a GitHub Gist.

We deployed it as a systemd service that starts before the model server, so the fans are already aggressive before any GPU load hits.

The Power Limit Daemon

Since 250W combined with aggressive fans keeps everything safe, we also made the power limit persistent with a simple one-shot systemd service:

[Unit]
Description=Set NVIDIA GPU Power Limit to 250W
After=nvidia-persistenced.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/nvidia-smi -pl 250

This runs on every boot, automatically capping the GPUs before any workload starts. No more manually running nvidia-smi -pl after every reboot or crash.

The Results

With both fixes in place — aggressive fan curve + 250W power cap — here's what sustained load looks like now:

GPU0 pcie_gen=5/5 width=x16 power=250W temp=81°C
GPU1 pcie_gen=5/5 width=x16 power=250W temp=69°C
GPU2 pcie_gen=5/5 width=x16 power=250W temp=78°C
GPU3 pcie_gen=5/5 width=x16 power=250W temp=73°C

PCIe_slots: PCIE01=72°C PCIE03=81°C PCIE05=78°C PCIE07=69°C

Max slot temp: 81°C — well under the 90°C alarm threshold. The system has been rock solid since.

The Full Stability Stack

Everything is codified in Ansible for automated deployment and rebuild resilience:

Kernel params (pcie_aspm=off pcie_port_pm=off) — prevents idle PCIe link retrain crashes
nvidia-powerlimit.service — caps all GPUs at 250W on boot
gpu-fans.service — aggressive fan curve daemon with hysteresis and safety fallbacks
pcie-monitor.service — continuous logging of GPU stats + IPMI PCIe slot temps

Lessons Learned

The wildest part of this whole saga? The GPUs were fine. The GPUs were always fine. It was the motherboard's PCIe slot retimer chips overheating from radiated GPU heat that was killing the system. That's a failure mode you won't find in most troubleshooting guides.

When you're pushing PCIe Gen5 at 32 GT/s with 4 GPUs each pulling 250-300W in adjacent slots, thermal management isn't just about keeping the GPUs cool — it's about keeping everything around the GPUs cool too. The stock NVIDIA blower fan curve is simply not aggressive enough for a quad-GPU configuration in a workstation case. If you're building something similar, get that fan curve sorted out from day one.

The toaster is finally tamed. For now.