locomp
A Python GPU Kernel Compiler for Apple Silicon, NVIDIA CUDA, and RISC-V.
Write kernels once in Python. Compile to Metal, CUDA, or RISC-V RVV. Think Triton — but hardware-agnostic.
pip install locompimport locomp
import numpy as np
@locomp.kernel
def vector_add(X: locomp.Tensor, Y: locomp.Tensor, O: locomp.Tensor,
N: locomp.constexpr):
i = locomp.program_id(0)
locomp.store(O + i, locomp.load(X + i) + locomp.load(Y + i))
x = locomp.tensor(np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32))
y = locomp.tensor(np.array([5.0, 6.0, 7.0, 8.0], dtype=np.float32))
o = locomp.empty(4)
vector_add[(4,)](x, y, o, N=4)
print(o.numpy()) # [6. 8. 10. 12.]Compiler Pipeline
@locomp.kernel
Python
IR (SSA)
60+ opcodes
Opt passes
CSE, DCE, fold
Codegen
MSL / CUDA C / RVV
Dispatch
Metal · CUDA · RISC-V
Your function is compiled to a native shader — Metal, CUDA, or RISC-V RVV. Cached per constexpr config.
One kernel. Three hardware targets.
@locomp.kernel compiles to Metal (Apple), CUDA (NVIDIA), or RISC-V RVV from the same Python source.
SSA IR + Optimization passes
CSE, DCE, constant folding, constexpr inlining, type inference. Compiled, not interpreted.
Full kernel language
SIMD reductions, shared memory, atomics, simdgroup matrix ops (AMX), wmma Tensor Cores (CUDA).
63 production kernels
Flash Attention v1/v2/v3, INT4/INT8 matmul, paged attention, RoPE, SwiGLU, KV cache.
Auto-tuning built-in
locomp.autotune benchmarks configs per GPU and caches the winner to disk permanently.
SmolLM2-135M end-to-end
A real 135M-param LLM running entirely on locomp kernels. No PyTorch, no MLX, no Metal C++.
SmolLM2-135M on locomp
Full LLM on pure @locomp.kernel Python. No PyTorch. No MLX. No Metal C++.
programs in a structured way...
10 GPU kernels — all pure Python @locomp.kernel · Validated on M1 bare metal & M4 GitHub CI
Apple M1
227 tests · bare metal
Apple M4
227 tests · GitHub CI
NVIDIA A100
64/64 execution checks
RISC-V rv64gcv
9/9 QEMU tests
Start building
GPU kernels today
One pip install. Runs on Apple Silicon, NVIDIA GPU, and RISC-V.
pip install locomp