API Reference

Complete reference for all locomp APIs and functions.

Kernel Decorator

@locomp.kernel
def my_kernel(A: locomp.Tensor, B: locomp.Tensor, N: locomp.constexpr):
    i = locomp.program_id(0)
    locomp.store(B + i, locomp.load(A + i) * 2.0)

# Dispatch: grid=(N,), threadgroup size default 1
my_kernel[(N,)](a, b, N=N)

# With explicit threadgroup size
my_kernel[(N,), (128,)](a, b, N=N)
  • All pointer arguments must be typed locomp.Tensor
  • Compile-time constants must be typed locomp.constexpr
  • Compiled pipeline cached on first call per constexpr configuration

Types

TypeDescription
locomp.TensorGPU buffer pointer — kernel parameter type for buffer arguments
locomp.constexprCompile-time constant — inlined as a literal in the generated shader
locomp.Float1616-bit float (half in Metal, __half in CUDA, _Float16 in RISC-V)
locomp.BFloat16Brain float16 (bfloat16 in Metal, __nv_bfloat16 in CUDA)
locomp.Int8Signed 8-bit integer
locomp.UInt8Unsigned 8-bit integer
locomp.Int3232-bit signed integer
locomp.BoolBoolean type

Tensor API

CallDescription
locomp.tensor(numpy_array)Create GPU tensor from a NumPy array (backend="auto")
locomp.tensor(array, backend="cuda")Create tensor on specific backend: "auto" | "metal" | "cuda" | "riscv"
locomp.empty(shape)Allocate uninitialized GPU tensor
locomp.zeros(shape)Zero-filled GPU tensor
locomp.ones(shape)Ones-filled GPU tensor
t.numpy()Read back to CPU — auto-syncs with GPU
t.reshape(new_shape)Zero-copy reshape
t.transpose(dim0, dim1)Transpose two dimensions
t.permute(*dims)Arbitrary dimension reorder
t.contiguous()Materialize a contiguous copy
t[slices]NumPy-style slicing

Built-in Functions

FunctionDescription
locomp.program_id(axis)Threadgroup index along axis (0/1/2)
locomp.local_id(axis)Thread index within threadgroup
locomp.load(ptr)Load a value from a GPU pointer
locomp.store(ptr, val)Store a value to a GPU pointer
locomp.atomic_add(ptr, val)Atomic add — returns old value
locomp.atomic_max(ptr, val)Atomic max
locomp.cast(val, dtype)Type cast — e.g. Float32 → Float16
locomp.sqrt(x)Square root
locomp.rsqrt(x)Reciprocal square root (fast)
locomp.exp(x)Exponential
locomp.log(x)Natural log
locomp.abs(x)Absolute value
locomp.max(a, b)Element-wise max
locomp.min(a, b)Element-wise min
locomp.simd_sum(val)Warp/SIMD-group reduction sum
locomp.simd_max(val)Warp/SIMD-group reduction max
locomp.simd_lane_id()Lane index within SIMD group [0..31]
locomp.simd_group_id()SIMD group index within threadgroup
locomp.shared_memory(size)Allocate threadgroup shared memory
locomp.shared_load(smem, idx)Load from shared memory slot
locomp.shared_store(smem, idx, val)Store to shared memory slot
locomp.barrier()Threadgroup barrier — sync all threads

Auto-Tune API

from locomp import autotune, Config

@autotune(
    configs=[Config(BLOCK_M=16), Config(BLOCK_M=32), Config(BLOCK_M=64)],
    key=["M", "N"],
)
@locomp.kernel
def my_kernel(..., BLOCK_M: locomp.constexpr):
    ...
  • configs — list of Config objects, each specifying constexpr values to try
  • key — list of argument names whose values determine the problem shape
  • Config(**kwargs) — specify any constexpr values as keyword arguments
  • Results cached to ~/.cache/locomp/autotune.json

Kernel Graph

Chain multiple kernels into one GPU command buffer. Eliminates per-kernel CPU↔GPU sync overhead.

# Context manager — runs on exit
with locomp.graph() as g:
    g.add(rms_norm, (rows,), (128,), x, w, h, N=dim, eps=1e-5)
    g.add(matmul,   (M * N,),        h, w2, out, M=M, N=K, K=dim)

# Manual build + run (re-runnable)
g = locomp.graph()
g.add(kernel_a, (N,), x, tmp, N=N)
g.add(kernel_b, (N,), tmp, out, N=N)
g.run()   # first run
g.run()   # re-run same sequence on same tensors

Profiler

with locomp.profile() as p:
    rms_norm[(rows,), (128,)](x, w, out, N=dim, eps=1e-5)
    matmul[(M * N,)](a, b, c, M=M, N=N, K=K)

print(p.report())
# Kernel          Grid     ms
# rms_norm        (128,)   0.245
# matmul          (4096,)  1.103

Autograd

CPU Autograd — locomp.ag

Tape-based reverse-mode autodiff on NumPy. 15 ops.

a = locomp.ag.tensor(np.random.randn(N), requires_grad=True)
b = locomp.ag.tensor(np.random.randn(N), requires_grad=True)
loss = locomp.ag.sum(locomp.ag.mul(a, b))
locomp.ag.backward(loss)
print(a.grad)   # dL/da = b

GPU Autograd — locomp.gpu_ag

Forward and backward passes run as real locomp kernels on Metal or CUDA. 14 ops.

ga = locomp.gpu_ag
x  = ga.tensor(np.random.randn(N), requires_grad=True)  # on GPU
y  = ga.relu(ga.exp(x))
ga.backward(ga.sum(y))
print(x.grad.numpy())

# Supported ops: add sub mul div exp log relu pow
#                sigmoid tanh sum mean matvec matmul softmax cross_entropy

CUDA API

CallDescription
locomp.cuda_available()Returns True if CUDA is available
locomp.cuda_set_device(idx)Set active CUDA device (0-indexed)
locomp.cuda_device_count()Number of available CUDA GPUs
locomp.CUDATensorTensor class backed by a CUDA device pointer
locomp.tensor(arr, backend="cuda")Upload NumPy array to CUDA device
if locomp.cuda_available():
    print(f"{locomp.cuda_device_count()} CUDA GPUs available")
    t = locomp.tensor(np.ones(1024, np.float32), backend="cuda")
    my_kernel[(1024,)](t, out, N=1024)
    print(out.numpy())

Example Index

#NameDescription
01–05Vector opsadd, mul, relu, gelu, sigmoid
06–10Reductionssum, max, softmax, layer norm, rms norm
11–15Matrix opsmatmul, batched matmul, flash attention v1/v2/v3
16–20Attentionmulti-head attention, causal mask, ALiBi, RoPE
21–25QuantizationINT4, INT8, block-wise quantized matmul, dequant
26–30AutogradCPU ag ops, GPU ag ops, backward pass
31–35Attention v3paged attention, KV cache, group-query attention
36–40LLM kernelsSwiGLU, SiLU, Mistral RoPE, parallel embeddings
41–45CUDA kernelsCUDA vector add, fp16 matmul, softmax, wmma
46–50BenchmarksMetal vs MLX, latency sweep, memory bandwidth
51–54Full LLMSmolLM2-135M tokenizer, weights, generate, inference
55–57BFloat16 + HWBFloat16 kernels, hardware validation
58Benchmark Suite (Metal)Full benchmark vs MLX on Apple M1
59Benchmark Suite (CUDA)Full benchmark on NVIDIA A100
60MPS vs locompPyTorch MPS vs locomp comparison
61Kernel Graphlocomp.graph() multi-kernel fusion
62SmolLM2 fp16SmolLM2-135M in float16
63RISC-V CodegenRISC-V RVV codegen demo