Speedrunning NanoGPT training runs

Explore speedrunning LLM training with NanoGPT, integrating differential attention and the Muon optimizer. Learn about achieving training parity with LLM.c and breaking speed records on H100 instances.

Overview

NanoGPT is a small scale (124m parameter) transformer architecture initially made by Andrei Karpathy for learning how to build/train LLMs.

There is an active, ongoing effort to integrate various architectural changes and optimizer improvements into NanoGPT in order to break training speed records. These speedruns are tracked by Keller Jordan (who started the speedrunning effort) here: https://github.com/KellerJordan/modded-nanogpt/tree/master.

As of last week, with PyTorch 2.5 and the new Muon optimizer, NanoGPT speedruns have reached parity with LLM.c in training speeds for 120m scale models.

My current experiment/attempt at speedrunning is taking Keller Jordan’s speedrun and integrating differential attention layers (https://arxiv.org/abs/2410.05258) into it. DIfferential attention splits the attention heads into excitatory and inhibitory attention heads, allowing the attention layer to focus more clearly on context which it deems relevant. The paper showed a 25-30% improvement in loss for models of the same parameter count, so I am hoping that combining it with the Muon optimizer will break the latest speedrunning record.

My fork is here: https://github.com/RyanPersson/modded-nanogpt/tree/differential-flash-attention (Differential-flash-attention branch, master is in sync with Keller Jordan’s repo.)

I’ve been training/testing on Lambda Cloud H100 instances. I got my varient running on a single H100 node last night, but it still segfaults on an 8xH100 DGX cluster. I am hoping to get that resolved and get it working by the meetup next Tuesday, but figured I would go ahead and put in a request to present because it’d be fun to talk about either way.

Links

https://github.com/RyanPersson/modded-nanogpt/tree/differential-fla...
Muon optimization and Rotary embeddings accelerate NanoGPT training to 3.277 validation loss.
https://github.com/KellerJordan/modded-nanogpt/tree/master
NanoGPT speedrun uses Muon optimizer, FlexAttention, FP8 to train 124M LLM.
https://arxiv.org/abs/2410.05258)

Tech stack