API Reference

Complete reference for all locomp APIs and functions.

Kernel Decorator

@locomp.kernel
def my_kernel(A: locomp.Tensor, B: locomp.Tensor, N: locomp.constexpr):
    i = locomp.program_id(0)
    locomp.store(B + i, locomp.load(A + i) * 2.0)

# Dispatch: grid=(N,), threadgroup size default 1
my_kernel[(N,)](a, b, N=N)

# With explicit threadgroup size
my_kernel[(N,), (128,)](a, b, N=N)

All pointer arguments must be typed locomp.Tensor
Compile-time constants must be typed locomp.constexpr
Compiled pipeline cached on first call per constexpr configuration

Types

Type	Description
`locomp.Tensor`	GPU buffer pointer — kernel parameter type for buffer arguments
`locomp.constexpr`	Compile-time constant — inlined as a literal in the generated shader
`locomp.Float16`	16-bit float (half in Metal, __half in CUDA, _Float16 in RISC-V)
`locomp.BFloat16`	Brain float16 (bfloat16 in Metal, __nv_bfloat16 in CUDA)
`locomp.Int8`	Signed 8-bit integer
`locomp.UInt8`	Unsigned 8-bit integer
`locomp.Int32`	32-bit signed integer
`locomp.Bool`	Boolean type

Tensor API

Call	Description
`locomp.tensor(numpy_array)`	Create GPU tensor from a NumPy array (backend="auto")
`locomp.tensor(array, backend="cuda")`	Create tensor on specific backend: "auto" \| "metal" \| "cuda" \| "riscv"
`locomp.empty(shape)`	Allocate uninitialized GPU tensor
`locomp.zeros(shape)`	Zero-filled GPU tensor
`locomp.ones(shape)`	Ones-filled GPU tensor
`t.numpy()`	Read back to CPU — auto-syncs with GPU
`t.reshape(new_shape)`	Zero-copy reshape
`t.transpose(dim0, dim1)`	Transpose two dimensions
`t.permute(*dims)`	Arbitrary dimension reorder
`t.contiguous()`	Materialize a contiguous copy
`t[slices]`	NumPy-style slicing

Built-in Functions

Function	Description
`locomp.program_id(axis)`	Threadgroup index along axis (0/1/2)
`locomp.local_id(axis)`	Thread index within threadgroup
`locomp.load(ptr)`	Load a value from a GPU pointer
`locomp.store(ptr, val)`	Store a value to a GPU pointer
`locomp.atomic_add(ptr, val)`	Atomic add — returns old value
`locomp.atomic_max(ptr, val)`	Atomic max
`locomp.cast(val, dtype)`	Type cast — e.g. Float32 → Float16
`locomp.sqrt(x)`	Square root
`locomp.rsqrt(x)`	Reciprocal square root (fast)
`locomp.exp(x)`	Exponential
`locomp.log(x)`	Natural log
`locomp.abs(x)`	Absolute value
`locomp.max(a, b)`	Element-wise max
`locomp.min(a, b)`	Element-wise min
`locomp.simd_sum(val)`	Warp/SIMD-group reduction sum
`locomp.simd_max(val)`	Warp/SIMD-group reduction max
`locomp.simd_lane_id()`	Lane index within SIMD group [0..31]
`locomp.simd_group_id()`	SIMD group index within threadgroup
`locomp.shared_memory(size)`	Allocate threadgroup shared memory
`locomp.shared_load(smem, idx)`	Load from shared memory slot
`locomp.shared_store(smem, idx, val)`	Store to shared memory slot
`locomp.barrier()`	Threadgroup barrier — sync all threads

Auto-Tune API

from locomp import autotune, Config

@autotune(
    configs=[Config(BLOCK_M=16), Config(BLOCK_M=32), Config(BLOCK_M=64)],
    key=["M", "N"],
)
@locomp.kernel
def my_kernel(..., BLOCK_M: locomp.constexpr):
    ...

configs — list of Config objects, each specifying constexpr values to try
key — list of argument names whose values determine the problem shape
Config(**kwargs) — specify any constexpr values as keyword arguments
Results cached to ~/.cache/locomp/autotune.json

Kernel Graph

Chain multiple kernels into one GPU command buffer. Eliminates per-kernel CPU↔GPU sync overhead.

# Context manager — runs on exit
with locomp.graph() as g:
    g.add(rms_norm, (rows,), (128,), x, w, h, N=dim, eps=1e-5)
    g.add(matmul,   (M * N,),        h, w2, out, M=M, N=K, K=dim)

# Manual build + run (re-runnable)
g = locomp.graph()
g.add(kernel_a, (N,), x, tmp, N=N)
g.add(kernel_b, (N,), tmp, out, N=N)
g.run()   # first run
g.run()   # re-run same sequence on same tensors

Profiler

with locomp.profile() as p:
    rms_norm[(rows,), (128,)](x, w, out, N=dim, eps=1e-5)
    matmul[(M * N,)](a, b, c, M=M, N=N, K=K)

print(p.report())
# Kernel          Grid     ms
# rms_norm        (128,)   0.245
# matmul          (4096,)  1.103

Autograd

CPU Autograd — locomp.ag

Tape-based reverse-mode autodiff on NumPy. 15 ops.

a = locomp.ag.tensor(np.random.randn(N), requires_grad=True)
b = locomp.ag.tensor(np.random.randn(N), requires_grad=True)
loss = locomp.ag.sum(locomp.ag.mul(a, b))
locomp.ag.backward(loss)
print(a.grad)   # dL/da = b

GPU Autograd — locomp.gpu_ag

Forward and backward passes run as real locomp kernels on Metal or CUDA. 14 ops.

ga = locomp.gpu_ag
x  = ga.tensor(np.random.randn(N), requires_grad=True)  # on GPU
y  = ga.relu(ga.exp(x))
ga.backward(ga.sum(y))
print(x.grad.numpy())

# Supported ops: add sub mul div exp log relu pow
#                sigmoid tanh sum mean matvec matmul softmax cross_entropy

CUDA API

Call	Description
`locomp.cuda_available()`	Returns True if CUDA is available
`locomp.cuda_set_device(idx)`	Set active CUDA device (0-indexed)
`locomp.cuda_device_count()`	Number of available CUDA GPUs
`locomp.CUDATensor`	Tensor class backed by a CUDA device pointer
`locomp.tensor(arr, backend="cuda")`	Upload NumPy array to CUDA device

if locomp.cuda_available():
    print(f"{locomp.cuda_device_count()} CUDA GPUs available")
    t = locomp.tensor(np.ones(1024, np.float32), backend="cuda")
    my_kernel[(1024,)](t, out, N=1024)
    print(out.numpy())

Example Index

#	Name	Description
`01–05`	Vector ops	add, mul, relu, gelu, sigmoid
`06–10`	Reductions	sum, max, softmax, layer norm, rms norm
`11–15`	Matrix ops	matmul, batched matmul, flash attention v1/v2/v3
`16–20`	Attention	multi-head attention, causal mask, ALiBi, RoPE
`21–25`	Quantization	INT4, INT8, block-wise quantized matmul, dequant
`26–30`	Autograd	CPU ag ops, GPU ag ops, backward pass
`31–35`	Attention v3	paged attention, KV cache, group-query attention
`36–40`	LLM kernels	SwiGLU, SiLU, Mistral RoPE, parallel embeddings
`41–45`	CUDA kernels	CUDA vector add, fp16 matmul, softmax, wmma
`46–50`	Benchmarks	Metal vs MLX, latency sweep, memory bandwidth
`51–54`	Full LLM	SmolLM2-135M tokenizer, weights, generate, inference
`55–57`	BFloat16 + HW	BFloat16 kernels, hardware validation
`58`	Benchmark Suite (Metal)	Full benchmark vs MLX on Apple M1
`59`	Benchmark Suite (CUDA)	Full benchmark on NVIDIA A100
`60`	MPS vs locomp	PyTorch MPS vs locomp comparison
`61`	Kernel Graph	locomp.graph() multi-kernel fusion
`62`	SmolLM2 fp16	SmolLM2-135M in float16
`63`	RISC-V Codegen	RISC-V RVV codegen demo