ARTFEED — Contemporary Art Intelligence

Optimizing LLM Training in Swift: From 2.8 Gflop/s to 1.1 Tflop/s

ai-technology · 2026-05-11

A developer documents the process of optimizing handwritten matrix multiplication code in Swift for training a GPT-2 style LLM on Apple Silicon. Starting from a naive Swift implementation that achieved only 2.8 Gflop/s, the author applied a series of optimizations: replacing Array with MutableSpan to avoid copy-on-write overhead, using Relaxed.multiplyAdd from Swift Numerics for fused-multiply-add instructions, manual loop unrolling with InlineArray, multithreading via DispatchQueue.concurrentPerform, reverse-engineered AMX instructions, and finally custom Metal GPU compute shaders. The final implementation reached 1.1 Tflop/s, a 382x improvement. The reference model is Andrej Karpathy's llm.c, a plain C implementation of GPT-2 with 124 million parameters. The author notes that Apple's Accelerate framework and other libraries would be more efficient for production use, and that the AMX unit is undocumented and subject to breaking changes. The article is part one of a series exploring neural network training in Swift on Apple Silicon.

Key facts

  • Initial Swift implementation achieved 2.8 Gflop/s.
  • Final optimized implementation reached 1.1 Tflop/s, a 382x improvement.
  • Optimizations included MutableSpan, Relaxed.multiplyAdd, InlineArray, DispatchQueue.concurrentPerform, AMX instructions, and Metal GPU kernels.
  • Reference model is Andrej Karpathy's llm.c, a plain C GPT-2 implementation with 124,439,808 weights.
  • Apple's AMX unit is undocumented and accessed via reverse-engineered instructions.
  • Metal GPU implementation achieved 1.1 Tflop/s, but theoretical GPU peak is 15 Tflop/s.
  • Swift ended up slightly faster than C in some benchmarks due to better SIMD instruction selection.
  • The article is part one of a series; future articles will cover BLAS, BNNS, CoreML, MPSGraph.

Entities

Artists

  • Andrej Karpathy
  • James Thompson

Institutions

  • Apple
  • Swift Numerics
  • Accelerate framework
  • Metal
  • OpenMP
  • Cocoa with Love

Sources