Apple Silicon · NVIDIA CUDA · RISC-V

locomp

A Python GPU Kernel Compiler for Apple Silicon, NVIDIA CUDA, and RISC-V.

Write kernels once in Python. Compile to Metal, CUDA, or RISC-V RVV. Think Triton — but hardware-agnostic.

$pip install locomp
Apache 2.0 · v1.0.0 · Python 3.10+
vector_add.py
import locomp
import numpy as np

@locomp.kernel
def vector_add(X: locomp.Tensor, Y: locomp.Tensor, O: locomp.Tensor,
               N: locomp.constexpr):
    i = locomp.program_id(0)
    locomp.store(O + i, locomp.load(X + i) + locomp.load(Y + i))

x = locomp.tensor(np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32))
y = locomp.tensor(np.array([5.0, 6.0, 7.0, 8.0], dtype=np.float32))
o = locomp.empty(4)

vector_add[(4,)](x, y, o, N=4)
print(o.numpy())  # [6. 8. 10. 12.]

Compiler Pipeline

1

@locomp.kernel

Python

2

IR (SSA)

60+ opcodes

3

Opt passes

CSE, DCE, fold

4

Codegen

MSL / CUDA C / RVV

5

Dispatch

Metal · CUDA · RISC-V

Your function is compiled to a native shader — Metal, CUDA, or RISC-V RVV. Cached per constexpr config.

One kernel. Three hardware targets.

@locomp.kernel compiles to Metal (Apple), CUDA (NVIDIA), or RISC-V RVV from the same Python source.

SSA IR + Optimization passes

CSE, DCE, constant folding, constexpr inlining, type inference. Compiled, not interpreted.

Full kernel language

SIMD reductions, shared memory, atomics, simdgroup matrix ops (AMX), wmma Tensor Cores (CUDA).

63 production kernels

Flash Attention v1/v2/v3, INT4/INT8 matmul, paged attention, RoPE, SwiGLU, KV cache.

Auto-tuning built-in

locomp.autotune benchmarks configs per GPU and caches the winner to disk permanently.

SmolLM2-135M end-to-end

A real 135M-param LLM running entirely on locomp kernels. No PyTorch, no MLX, no Metal C++.

SmolLM2-135M on locomp

Full LLM on pure @locomp.kernel Python. No PyTorch. No MLX. No Metal C++.

$ python examples/54_smollm2_inference.py
Loading weights... 272 tensors, 538 MB
Uploading to GPU done
► "The meaning of life is"
The meaning of life is to be found in the meaning of the universe.
► "Once upon a time"
Once upon a time, there was a little girl named Lily...
► "Python is a programming language that"
Python is a programming language that allows you to write
programs in a structured way...
6.5 – 7.4 tok/s · Apple M1 · no PyTorch

10 GPU kernels — all pure Python @locomp.kernel · Validated on M1 bare metal & M4 GitHub CI

Apple M1

227 tests · bare metal

Apple M4

227 tests · GitHub CI

NVIDIA A100

64/64 execution checks

RISC-V rv64gcv

9/9 QEMU tests

Start building
GPU kernels today

One pip install. Runs on Apple Silicon, NVIDIA GPU, and RISC-V.

$pip install locomp