API Reference
Complete reference for all locomp APIs and functions.
Kernel Decorator
@locomp.kernel
def my_kernel(A: locomp.Tensor, B: locomp.Tensor, N: locomp.constexpr):
i = locomp.program_id(0)
locomp.store(B + i, locomp.load(A + i) * 2.0)
# Dispatch: grid=(N,), threadgroup size default 1
my_kernel[(N,)](a, b, N=N)
# With explicit threadgroup size
my_kernel[(N,), (128,)](a, b, N=N)- All pointer arguments must be typed
locomp.Tensor - Compile-time constants must be typed
locomp.constexpr - Compiled pipeline cached on first call per constexpr configuration
Types
| Type | Description |
|---|---|
locomp.Tensor | GPU buffer pointer — kernel parameter type for buffer arguments |
locomp.constexpr | Compile-time constant — inlined as a literal in the generated shader |
locomp.Float16 | 16-bit float (half in Metal, __half in CUDA, _Float16 in RISC-V) |
locomp.BFloat16 | Brain float16 (bfloat16 in Metal, __nv_bfloat16 in CUDA) |
locomp.Int8 | Signed 8-bit integer |
locomp.UInt8 | Unsigned 8-bit integer |
locomp.Int32 | 32-bit signed integer |
locomp.Bool | Boolean type |
Tensor API
| Call | Description |
|---|---|
locomp.tensor(numpy_array) | Create GPU tensor from a NumPy array (backend="auto") |
locomp.tensor(array, backend="cuda") | Create tensor on specific backend: "auto" | "metal" | "cuda" | "riscv" |
locomp.empty(shape) | Allocate uninitialized GPU tensor |
locomp.zeros(shape) | Zero-filled GPU tensor |
locomp.ones(shape) | Ones-filled GPU tensor |
t.numpy() | Read back to CPU — auto-syncs with GPU |
t.reshape(new_shape) | Zero-copy reshape |
t.transpose(dim0, dim1) | Transpose two dimensions |
t.permute(*dims) | Arbitrary dimension reorder |
t.contiguous() | Materialize a contiguous copy |
t[slices] | NumPy-style slicing |
Built-in Functions
| Function | Description |
|---|---|
locomp.program_id(axis) | Threadgroup index along axis (0/1/2) |
locomp.local_id(axis) | Thread index within threadgroup |
locomp.load(ptr) | Load a value from a GPU pointer |
locomp.store(ptr, val) | Store a value to a GPU pointer |
locomp.atomic_add(ptr, val) | Atomic add — returns old value |
locomp.atomic_max(ptr, val) | Atomic max |
locomp.cast(val, dtype) | Type cast — e.g. Float32 → Float16 |
locomp.sqrt(x) | Square root |
locomp.rsqrt(x) | Reciprocal square root (fast) |
locomp.exp(x) | Exponential |
locomp.log(x) | Natural log |
locomp.abs(x) | Absolute value |
locomp.max(a, b) | Element-wise max |
locomp.min(a, b) | Element-wise min |
locomp.simd_sum(val) | Warp/SIMD-group reduction sum |
locomp.simd_max(val) | Warp/SIMD-group reduction max |
locomp.simd_lane_id() | Lane index within SIMD group [0..31] |
locomp.simd_group_id() | SIMD group index within threadgroup |
locomp.shared_memory(size) | Allocate threadgroup shared memory |
locomp.shared_load(smem, idx) | Load from shared memory slot |
locomp.shared_store(smem, idx, val) | Store to shared memory slot |
locomp.barrier() | Threadgroup barrier — sync all threads |
Auto-Tune API
from locomp import autotune, Config
@autotune(
configs=[Config(BLOCK_M=16), Config(BLOCK_M=32), Config(BLOCK_M=64)],
key=["M", "N"],
)
@locomp.kernel
def my_kernel(..., BLOCK_M: locomp.constexpr):
...configs— list of Config objects, each specifying constexpr values to trykey— list of argument names whose values determine the problem shapeConfig(**kwargs)— specify any constexpr values as keyword arguments- Results cached to
~/.cache/locomp/autotune.json
Kernel Graph
Chain multiple kernels into one GPU command buffer. Eliminates per-kernel CPU↔GPU sync overhead.
# Context manager — runs on exit
with locomp.graph() as g:
g.add(rms_norm, (rows,), (128,), x, w, h, N=dim, eps=1e-5)
g.add(matmul, (M * N,), h, w2, out, M=M, N=K, K=dim)
# Manual build + run (re-runnable)
g = locomp.graph()
g.add(kernel_a, (N,), x, tmp, N=N)
g.add(kernel_b, (N,), tmp, out, N=N)
g.run() # first run
g.run() # re-run same sequence on same tensorsProfiler
with locomp.profile() as p:
rms_norm[(rows,), (128,)](x, w, out, N=dim, eps=1e-5)
matmul[(M * N,)](a, b, c, M=M, N=N, K=K)
print(p.report())
# Kernel Grid ms
# rms_norm (128,) 0.245
# matmul (4096,) 1.103Autograd
CPU Autograd — locomp.ag
Tape-based reverse-mode autodiff on NumPy. 15 ops.
a = locomp.ag.tensor(np.random.randn(N), requires_grad=True)
b = locomp.ag.tensor(np.random.randn(N), requires_grad=True)
loss = locomp.ag.sum(locomp.ag.mul(a, b))
locomp.ag.backward(loss)
print(a.grad) # dL/da = bGPU Autograd — locomp.gpu_ag
Forward and backward passes run as real locomp kernels on Metal or CUDA. 14 ops.
ga = locomp.gpu_ag
x = ga.tensor(np.random.randn(N), requires_grad=True) # on GPU
y = ga.relu(ga.exp(x))
ga.backward(ga.sum(y))
print(x.grad.numpy())
# Supported ops: add sub mul div exp log relu pow
# sigmoid tanh sum mean matvec matmul softmax cross_entropyCUDA API
| Call | Description |
|---|---|
locomp.cuda_available() | Returns True if CUDA is available |
locomp.cuda_set_device(idx) | Set active CUDA device (0-indexed) |
locomp.cuda_device_count() | Number of available CUDA GPUs |
locomp.CUDATensor | Tensor class backed by a CUDA device pointer |
locomp.tensor(arr, backend="cuda") | Upload NumPy array to CUDA device |
if locomp.cuda_available():
print(f"{locomp.cuda_device_count()} CUDA GPUs available")
t = locomp.tensor(np.ones(1024, np.float32), backend="cuda")
my_kernel[(1024,)](t, out, N=1024)
print(out.numpy())Example Index
| # | Name | Description |
|---|---|---|
01–05 | Vector ops | add, mul, relu, gelu, sigmoid |
06–10 | Reductions | sum, max, softmax, layer norm, rms norm |
11–15 | Matrix ops | matmul, batched matmul, flash attention v1/v2/v3 |
16–20 | Attention | multi-head attention, causal mask, ALiBi, RoPE |
21–25 | Quantization | INT4, INT8, block-wise quantized matmul, dequant |
26–30 | Autograd | CPU ag ops, GPU ag ops, backward pass |
31–35 | Attention v3 | paged attention, KV cache, group-query attention |
36–40 | LLM kernels | SwiGLU, SiLU, Mistral RoPE, parallel embeddings |
41–45 | CUDA kernels | CUDA vector add, fp16 matmul, softmax, wmma |
46–50 | Benchmarks | Metal vs MLX, latency sweep, memory bandwidth |
51–54 | Full LLM | SmolLM2-135M tokenizer, weights, generate, inference |
55–57 | BFloat16 + HW | BFloat16 kernels, hardware validation |
58 | Benchmark Suite (Metal) | Full benchmark vs MLX on Apple M1 |
59 | Benchmark Suite (CUDA) | Full benchmark on NVIDIA A100 |
60 | MPS vs locomp | PyTorch MPS vs locomp comparison |
61 | Kernel Graph | locomp.graph() multi-kernel fusion |
62 | SmolLM2 fp16 | SmolLM2-135M in float16 |
63 | RISC-V Codegen | RISC-V RVV codegen demo |