A high-performance memory allocator inspired by mimalloc, written in Zig.
zmemalloc is a thread-local memory allocator using segment-based memory management. Designed for high throughput and low fragmentation in multi-threaded applications.
Key optimizations:
- 64-byte Page struct (fits in single cache line)
- Three-level free list sharding (free, local_free, xthread_free)
- Bump pointer allocation for fresh pages
- Lock-free cross-thread free
- Direct pointer array for O(1) small allocations
- Automatic page recycling via full->non-full transitions
+-----------------------------------------------------------------------------+
| MEMORY HIERARCHY |
+-----------------------------------------------------------------------------+
| |
| Application |
| | |
| v |
| +----------+ +----------+ +----------+ |
| | Thread 1 | | Thread 2 | | Thread N | |
| +----+-----+ +----+-----+ +----+-----+ |
| | | | |
| v v v |
| +----------+ +----------+ +----------+ |
| | Heap | | Heap | | Heap | Thread-Local Heaps |
| | (bins) | | (bins) | | (bins) | |
| +----+-----+ +----+-----+ +----+-----+ |
| | | | |
| v v v |
| +---------------------------------------------+ |
| | SegmentsTLD | Per-thread segment |
| | small_free | medium_free | large_free | cache |
| +---------------------------------------------+ |
| | |
| +---------------+---------------+ |
| v v v |
| +----------+ +----------+ +----------+ |
| | Segment | | Segment | | Segment | 32MB aligned |
| | (small) | | (medium) | | (large) | |
| | 64KB pg | | 512KB pg | | full pg | |
| +----+-----+ +----+-----+ +----+-----+ |
| | | | |
| v v v |
| +----------+ +----------+ +----------+ |
| | Pages | | Pages | | Page | Fixed-size blocks |
| | blocks | | blocks | | block | |
| +----------+ +----------+ +----------+ |
| |
+-----------------------------------------------------------------------------+
Page = struct {
// Hot path fields (32 bytes):
free_head: ?*IntrusiveLifoLink, // 8: head of free list (typed pointer, no cast needed)
page_start: ?[*]u8, // 8: start of data area
block_size: usize, // 8: size of each block
capacity: u16, // 2: max blocks
reserved: u16, // 2: bump pointer position
used: u16, // 2: blocks in use
flags_and_idx: u16, // 2: packed flags + segment_idx
// Queue fields (32 bytes):
local_free_head: ?*IntrusiveLifoLink, // 8: same-thread freed blocks (typed)
xthread_free: Atomic(?*Block), // 8: cross-thread freed (atomic)
next: ?*Page, // 8: bin queue link
prev: ?*Page, // 8: bin queue link
}
// Packed flags (6 bits) + segment_idx (10 bits) in u16:
PackedInfo = packed struct(u16) {
in_full: bool, // page is full, not in bin queue
in_bin: bool, // page is in heap's bin queue
is_huge: bool, // huge allocation (> 16MB)
segment_in_use: bool, // page claimed by segment
is_commited: bool, // memory is committed
is_zero_init: bool, // memory is zeroed
segment_idx: u10, // index in segment.pages[]
}SEGMENT LAYOUT (32MB)
+-----------------------------------------------------------------+
| Segment Header (~40KB) |
| +------------------------------------------------------------+ |
| | memid, segment_size, page_kind, capacity, used | |
| | thread_id (cache-line aligned for cross-thread check) | |
| | heap (pointer to owning heap) | |
| | all_next, all_prev (for xthread_free collection) | |
| | pages[512] - array of 64-byte Page structs | |
| +------------------------------------------------------------+ |
+----------------------------------------------------------------+
| Page Data Area |
| +--------+ +--------+ +--------+ +--------+ |
| | Page 0 | | Page 1 | | Page 2 | | ... | (64KB or 512KB) |
| | data | | data | | data | | | |
| +--------+ +--------+ +--------+ +--------+ |
+----------------------------------------------------------------+
Pointer -> Segment: segment = ptr & ~(32MB - 1)
Pointer -> Page: idx = (ptr - segment) >> page_shift
Allocated: Free:
+----------------------+ +----------------------+
| | | next ---------------+---> next free block
| User Data | | |
| | | (unused space) |
+----------------------+ +----------------------+
When free: first 8 bytes = pointer to next free block (IntrusiveLifoLink)
When allocated: user owns all memory
IntrusiveLifoLink: The free list uses typed pointers (?*IntrusiveLifoLink) instead of ?*anyopaque. This eliminates pointer casts in the hot allocation path:
// In queue.zig
pub const IntrusiveLifoLink = struct {
next: ?*IntrusiveLifoLink = null,
};
// Hot path - no @ptrCast needed
inline fn popFromFreeList(self: *Page) ?[*]u8 {
if (self.free_head) |link| {
self.free_head = link.next;
if (link.next) |next| {
@prefetch(next, .{ .locality = 3 });
}
return @ptrCast(link); // Only cast on return
}
return null;
}Heap = struct {
pages_free_direct: [129]?*Page, // Direct pointers for wsize 0-128
pages: [74]Page.Queue, // Bin queues for size classes
tld: ?*TLD, // Thread-local data
thread_id: usize,
}malloc(size)
|
v
+-------------------+
| size <= 1KB? |--NO--> mallocSlowPath()
+-------------------+
|YES
v
+-------------------+
| mallocSmallFast() |
| pages_free_direct |
| [wsize] lookup |
+-------------------+
|
+-- page found --> popFreeBlock() --> SUCCESS
|
+-- page null --> mallocSlowPath()
|
+-- page exhausted --> mark full, remove from bin --> mallocSlowPath()
pub fn malloc(size: usize) ?[*]u8 {
if (size == 0) return null;
// Single threadlocal read — also initializes TLD on first call
const heap = getHeapOrInit();
// Fast path: direct page lookup for small sizes (< 1KB)
if (size <= MAX_FAST_SIZE) {
@branchHint(.likely);
if (mallocSmallFast(heap, size)) |ptr| {
return ptr;
}
}
// Slow path: bin queue lookup, page allocation, etc.
return mallocSlowPath(heap, size);
}This is the hottest path, optimized for minimal instructions:
inline fn mallocSmallFast(heap: *Heap, size: usize) ?[*]u8 {
// 1. Calculate word size (8-byte granularity)
const wsize = (size + INTPTR_SIZE - 1) / INTPTR_SIZE;
// 2. Direct O(1) lookup - no bin calculation needed
const page = heap.pages_free_direct[wsize] orelse {
@branchHint(.unlikely);
return null;
};
// 3. Try to pop from free list or bump allocate
if (page.popFreeBlock()) |block| {
@branchHint(.likely);
return @ptrCast(block); // SUCCESS - fast path complete
}
// 4. Page exhausted - mark as full, remove from bin queue
const bin = binFromSize(size);
page.set_in_full(true);
page.pageRemoveFromBin(heap, bin);
heap.pages_free_direct[wsize] = null;
return null; // Fall through to slow path
}Key insight: The pages_free_direct array provides O(1) lookup without bin calculation. Each entry points directly to a page that can satisfy allocations of that word size.
Called when fast path fails (no page or page exhausted). Marked noinline to reduce register pressure in malloc hot path:
noinline fn mallocSlowPath(heap: *Heap, size: usize) ?[*]u8 {
// 1. Try medium path (bin queue lookup)
if (size <= MEDIUM_OBJ_SIZE_MAX) {
if (mallocMedium(heap, size)) |ptr| {
return ptr;
}
}
// 2. Fall back to generic allocation (may allocate new page)
return mallocGeneric(heap, size, false);
}inline fn mallocMedium(heap: *Heap, size: usize) ?[*]u8 {
const bin = binFromSize(size);
const pq = &heap.pages[bin];
// Try current bin queue page
if (pq.tail) |pg| {
@branchHint(.likely);
if (tryAllocFromPage(heap, pg, bin)) |ptr| {
return ptr;
}
}
// Bin empty - try to reclaim full pages with pending xthread_free
if (pending_xthread_free.load(.monotonic) > 0) {
@branchHint(.unlikely);
if (heap.tld) |t| {
_ = reclaimFullPages(t, heap);
// Retry after reclaim
if (pq.tail) |pg| {
return tryAllocFromPage(heap, pg, bin);
}
}
}
return null;
}Core allocation logic with cross-thread free collection:
inline fn tryAllocFromPage(heap: *Heap, pg: *Page, bin: usize) ?[*]u8 {
// 1. Try direct pop first
if (pg.popFreeBlock()) |block| {
@branchHint(.likely);
setDirectPointerForBlockSize(heap, pg);
if (!pg.hasFreeQuick()) {
@branchHint(.unlikely);
handlePageExhausted(heap, pg, bin);
}
return @ptrCast(block);
}
// 2. Page appears empty - collect xthread_free and retry
if (pg.xthread_free.load(.monotonic) != null) {
_ = pg.collectXthreadFree();
if (pg.popFreeBlock()) |block| {
setDirectPointerForBlockSize(heap, pg);
if (!pg.hasFree()) {
handlePageExhausted(heap, pg, bin);
}
return @ptrCast(block);
}
}
// 3. Truly exhausted
handlePageExhausted(heap, pg, bin);
return null;
}
inline fn handlePageExhausted(heap: *Heap, pg: *Page, bin: usize) void {
// Last chance: collect xthread_free before marking full
if (pg.xthread_free.load(.monotonic) != null) {
_ = pg.collectXthreadFree();
}
if (!pg.hasFree()) {
pg.pageRemoveFromBin(heap, bin);
pg.set_in_full(true);
clearDirectPointersForPage(heap, pg);
}
}Allocates new pages when needed:
inline fn mallocGeneric(heap: *Heap, size: usize, zero: bool) ?[*]u8 {
const t = heap.tld orelse return null;
// 1. Collect pending cross-thread frees if threshold exceeded
const pending = pending_xthread_free.load(.monotonic);
if (pending > XTHREAD_COLLECT_THRESHOLD) {
@branchHint(.unlikely);
_ = t.segments.collectAllXthreadFree();
_ = reclaimFullPages(t, heap);
}
// 2. Execute pending arena purges (deferred decommit)
globalArenas().tryPurge(false);
// 3. Calculate bin and block_size
const bin = binFromSize(size);
const block_size = if (size > LARGE_OBJ_SIZE_MAX)
size // Huge: use exact size
else
blockSizeForBin(bin);
// 4. Try bin queue first
if (bin < heap.pages.len) {
if (heap.pages[bin].tail) |pg| {
if (pg.block_size >= block_size) {
if (pg.popFreeBlock()) |block| {
@branchHint(.likely);
if (!pg.hasFree()) {
pg.pageRemoveFromBin(heap, bin);
pg.set_in_full(true);
clearDirectPointersForPage(heap, pg);
}
if (zero) @memset(block[0..block_size], 0);
return block;
}
}
}
}
// 5. Allocate new page from segment
var page = t.segments.allocPage(block_size);
if (page == null) {
// Try to reclaim abandoned segments from other threads
if (global_subproc.abandoned_count.load(.monotonic) > 0) {
_ = reclaimAbandoned(ABANDON_RECLAIM_LIMIT);
page = t.segments.allocPage(block_size);
}
}
const pg = page orelse return null;
// 6. Initialize page if needed (page_start is null for fresh pages)
if (pg.page_start == null) {
@branchHint(.cold);
const s = Segment.fromPtr(pg);
var page_size: usize = undefined;
const page_start = s.pageStart(pg, &page_size);
pg.init(block_size, page_start, page_size);
}
// 7. Allocate block
const block = pg.popFreeBlock() orelse return null;
if (zero) @memset(block[0..block_size], 0);
// 8. Add to bin queue if not already there
if (bin < heap.pages.len and !pg.is_in_bin()) {
@branchHint(.likely);
pg.set_in_full(false);
pg.pagePushBin(heap, bin);
}
// 9. Update direct pointers
setDirectPointerForBlockSize(heap, pg);
return block;
}Critical for performance: When a page is added to a bin, we set direct pointers for ALL word sizes that map to that bin:
fn setDirectPointerForBlockSize(heap: *Heap, page: *Page) void {
const block_wsize = (page.block_size + 7) / 8;
if (block_wsize > SMALL_WSIZE_MAX) return;
// For small bins (1-8), wsize maps 1:1 to bin
heap.pages_free_direct[block_wsize] = page;
const bin = binFromWsize(block_wsize);
if (bin <= 8) return; // Already handled above
// For larger bins, multiple wsizes map to same bin
// Use direct calculation instead of loop with binFromWsize
// This avoids expensive lzcnt/shift operations per iteration
const min_wsize = minWsizeForBin(bin);
const max_wsize = @min(block_wsize, SMALL_WSIZE_MAX);
// Set pointers for all wsizes in this bin's range
var wsize = min_wsize;
while (wsize <= max_wsize) : (wsize += 1) {
heap.pages_free_direct[wsize] = page;
}
}Optimization: The minWsizeForBin(bin) function computes the minimum wsize for a bin directly using inverse logarithmic formula, avoiding expensive binFromWsize calls in a loop. This reduced slow path overhead by ~40%.
Why this matters: Without setting all direct pointers, allocations of different sizes in the same bin would miss and fall through to slow path. This was a major performance bug - 83% of allocations were hitting slow path unnecessarily!
When a page becomes full, its direct pointers must be cleared:
fn clearDirectPointersForPage(heap: *Heap, page: *Page) void {
const block_wsize = (page.block_size + 7) / 8;
if (block_wsize > SMALL_WSIZE_MAX) return;
// Clear for block_wsize
if (heap.pages_free_direct[block_wsize] == page) {
heap.pages_free_direct[block_wsize] = null;
}
const bin = binFromWsize(block_wsize);
if (bin <= 8) return;
// Use direct calculation (same optimization as setDirectPointerForBlockSize)
const min_wsize = minWsizeForBin(bin);
const max_wsize = @min(block_wsize, SMALL_WSIZE_MAX);
var wsize = min_wsize;
while (wsize <= max_wsize) : (wsize += 1) {
if (heap.pages_free_direct[wsize] == page) {
heap.pages_free_direct[wsize] = null;
}
}
}fn popFreeBlock(self: *Page) ?[*]u8 {
// Hot path 1: pop from free list (recycled blocks)
if (self.popFromFreeList()) |block| {
self.used += 1;
return block;
}
// Hot path 2: bump allocation (contiguous, cache-friendly)
if (self.reserved < self.capacity) {
const start = self.page_start orelse return self.popFreeBlockSlow();
const offset = self.reserved * self.block_size;
@prefetch(start + offset, .{ .rw = .write }); // Prefetch for write
self.reserved += 1;
self.used += 1;
return start + offset;
}
// Cold path: collect from local_free/xthread_free
return self.popFreeBlockSlow();
}
/// Pop from intrusive linked list - O(1)
inline fn popFromFreeList(self: *Page) ?[*]u8 {
if (self.free_head) |link| {
self.free_head = link.next;
// Prefetch next block for subsequent allocation
if (link.next) |next| {
@prefetch(next, .{ .cache = .data, .locality = 3, .rw = .read });
}
return @ptrCast(link); // IntrusiveLifoLink* -> [*]u8
}
return null;
}
fn popFreeBlockSlow(self: *Page) ?[*]u8 {
// Try local_free (same-thread freed blocks) - merge lists
if (!self.localFreeEmpty()) {
self.free_head = self.local_free_head;
self.local_free_head = null;
return self.popFromFreeList();
}
// Try xthread_free (cross-thread freed blocks) - atomic swap + splice
if (self.collectXthreadFree() > 0) {
return self.popFromFreeList();
}
// Final fallback: bump allocate
if (self.reserved < self.capacity) {
const start = self.page_start orelse return null;
const offset = self.reserved * self.block_size;
self.reserved += 1;
self.used += 1;
return start + offset;
}
return null;
}free(ptr)
|
v
+-------------------+
| Find segment/page | segment = ptr & ~(32MB-1)
| from pointer | page = segment.pageFromPtr(ptr)
+-------------------+
|
v
+-------------------+
| page_thread == |--YES--> pushFreeBlock() --> check if was full
| fastThreadId()? |
+-------------------+
|NO
v
+-------------------+
| page_thread == 0? |--YES--> pushFreeBlock() (abandoned page)
| (abandoned) |
+-------------------+
|NO
v
+-------------------+
| xthread_free | Cross-thread: atomic CAS
| (atomic) |
+-------------------+
After pushFreeBlock():
+-------------------+
| Page was full? |--NO--> DONE
+-------------------+
|YES
v
+-------------------+
| freeColdPath() | Re-add page to bin queue
+-------------------+
inline fn freeImpl(ptr: ?*anyopaque) void {
const p = ptr orelse return;
const segment = Segment.fromPtr(p);
const page = Segment.pageFromPtrWithSegment(segment, p);
const block: *Block = @ptrCast(@alignCast(p));
const page_thread = segment.thread_id.load(.monotonic);
if (page_thread == fastThreadId()) {
@branchHint(.likely);
// Hot path: same thread - push to local_free
page.pushFreeBlock(block);
// Handle full -> non-full transition
if (page.is_in_full()) {
@branchHint(.unlikely);
freeColdPath(page); // Re-add to bin queue
}
} else if (page_thread == 0) {
// Abandoned page - push directly to free list
page.pushFreeBlock(block);
} else {
@branchHint(.unlikely);
// Cross-thread free (see below)
page.xthreadFree(block);
}
}
/// Push block to local_free list and decrement used - O(1)
pub inline fn pushFreeBlock(self: *Page, block: *Block) void {
self.pushToLocalFreeList(@ptrCast(block));
self.used -= 1;
}
/// Push to local_free list - O(1), no atomics
inline fn pushToLocalFreeList(self: *Page, block: [*]u8) void {
const link: *IntrusiveLifoLink = @ptrCast(@alignCast(block));
link.next = self.local_free_head;
self.local_free_head = link;
}Three cases:
- Same thread (
page_thread == fastThreadId()): Hot path, push to local_free - Abandoned (
page_thread == 0): Segment has no owner, push directly - Cross-thread (else): Use atomic xthread_free list
fn xthreadFree(self: *Page, block: *Block) void {
while (true) {
const old_head = self.xthread_free.load(.acquire);
// Store old_head's link pointer (convert Block* to Link*)
block.link.next = if (old_head) |h| &h.link else null;
if (self.xthread_free.cmpxchgWeak(old_head, block, .release, .acquire) == null) {
// Success - increment global pending counter
_ = pending_xthread_free.fetchAdd(1, .monotonic);
break;
}
// CAS failed, retry
}
}Note: The xthread_free list stores Block* (not IntrusiveLifoLink*) because we need the full Block type for the atomic CAS operation. The link field within Block points to the next block's link.
Critical for page reuse: When a full page receives a free, it must be re-added to the bin queue:
fn freeColdPath(page: *Page) void {
const bin = binFromSize(page.block_size);
// Page is completely empty - return to segment
if (page.used == 0) {
page.set_in_full(false);
segments.freePage(page);
return;
}
// Page has space now - re-add to bin queue for allocation
page.set_in_full(false);
if (!page.is_in_bin()) {
page.pagePushBin(heap, bin);
setDirectPointerForBlockSize(heap, page); // Set ALL direct pointers!
}
}Collected during allocation when threshold exceeded:
fn collectXthreadFree(self: *Page) usize {
// Atomic swap - get entire list at once
const head = self.xthread_free.swap(null, .acquire) orelse return 0;
// Find tail and count in single pass
var tail = head;
var count: usize = 1;
while (tail.link.next) |next| {
tail = next;
count += 1;
}
// Splice entire list to free in O(1)
tail.link.next = self.free_head;
self.free_head = head;
self.used -= count;
pending_xthread_free.fetchSub(count, .monotonic);
return count;
} +---------------------------+
| |
v |
+-------+ +---------+ +----------+ |
| FRESH | ---> | IN_BIN | ---> | FULL | ---+
+-------+ +---------+ +----------+
| | |
| | | free() called
| | v
| | +----------+
| +--------> | REUSED | (back to IN_BIN)
| +----------+
| |
| | used == 0
v v
+-------+ +---------+
| FREED | <------------------ | EMPTY |
+-------+ +---------+
State transitions:
- FRESH -> IN_BIN: Page allocated, added to bin queue, direct pointers set
- IN_BIN -> FULL: All blocks allocated, removed from bin queue, marked
in_full - FULL -> REUSED: Block freed via
freeColdPath, re-added to bin queue - IN_BIN/REUSED -> EMPTY: All blocks freed (
used == 0) - EMPTY -> FREED: Page returned to segment
Memory flows back through the hierarchy in stages:
Block → Page → Segment → Arena/OS
free(ptr)
|
v
+------------------+
| Push to | Block added to page's free list
| page.local_free | (or xthread_free if cross-thread)
+------------------+
|
| page.used == 0
v
+------------------+
| freePage() | Page cleared, returned to segment
+------------------+
|
| segment.used == 0
v
+------------------+
| freeSegment() | Segment returned to arena or OS
+------------------+
|
v
+------------------+ +------------------+
| Arena free() | OR | OS munmap() |
| (bitmap update) | | (physical free) |
+------------------+ +------------------+
|
| tryPurge()
v
+------------------+
| MADV_DONTNEED | Physical pages released to OS
+------------------+
Called when a page becomes completely empty (used == 0):
fn freePage(self: *SegmentsTLD, pg: *Page, force: bool) void {
const segment = Segment.fromPtr(pg);
segment.pageClear(pg); // Reset all page fields
segment.used -= 1;
if (segment.used == 0) {
if (force) {
// Forced free - return segment immediately
self.freeSegment(segment, true);
} else if (self.empty_segment_count >= EMPTY_SEGMENT_CACHE_MAX * 3) {
// Too many empty segments cached
self.freeSegment(segment, true);
} else {
// Cache empty segment for reuse
self.empty_segment_count += 1;
self.insertInFreeQueue(segment);
}
} else if (was_full) {
// Segment was full, now has free space
self.insertInFreeQueue(segment);
}
}pageClear resets all page fields:
fn pageClear(self: *Segment, pg: *Page) void {
pg.flags_and_idx = @bitCast(PackedInfo{ .segment_idx = pg.segment_idx() });
pg.xthread_free.store(null, .release);
pg.free_head = null;
pg.local_free_head = null;
pg.block_size = 0;
pg.page_start = null;
pg.capacity = 0;
pg.reserved = 0;
pg.used = 0;
pg.next = null;
pg.prev = null;
}Called when a segment becomes completely empty:
fn freeSegment(self: *SegmentsTLD, segment: *Segment, force: bool) void {
// Remove from tracking lists
self.removeFromFreeQueue(segment);
self.all_segments.remove(segment);
// Purge uncommitted pages (MADV_DONTNEED)
if (force and !segment.memid.flags.is_pinned) {
for (segment.pages[0..segment.capacity]) |*pg| {
if (!pg.is_segment_in_use() and pg.is_commited()) {
segment.pagePurge(pg);
}
}
}
segment.thread_id.store(0, .release);
// Return memory based on source
if (segment.memid.memkind == .MEM_ARENA) {
// Return to arena (bitmap update, no syscall)
const arena = getArena(segment.memid.arena_id);
arena.free(block_index, block_count, initially_committed);
} else {
// Direct OS free (munmap syscall)
os_alloc.free(segment[0..segment_size]);
}
}Releases physical memory while keeping virtual address mapping:
fn pagePurge(self: *Segment, pg: *Page) void {
if (!self.allow_purge) return;
if (!pg.is_commited()) return;
const start = self.rawPageStart(pg, &psize);
// Free NOW!
std.posix.madvise(start, psize, MADV.DONTNEED);
pg.set_commited(false);
}| Operation | Syscall | Effect |
|---|---|---|
| commit | mprotect(RW) |
Make memory accessible |
| decommit | madvise(MADV_DONTNEED) |
Release physical pages immediately |
| purge/reset | madvise(MADV_DONTNEED) |
Mark pages as reusable (lazy free) |
| protect | mprotect(NONE) |
Make memory inaccessible (guard pages) |
var reset_advice: Atomic(u32) = MADV.FREE;
fn prim_reset(mem: []u8) !void {
const advice = reset_advice.load(.monotonic);
const result = madvise(mem, advice);
if (result == EINVAL and advice == MADV.FREE) {
// MADV_FREE not supported, fall back to DONTNEED
reset_advice.store(MADV.DONTNEED, .release);
return madvise(mem, MADV.DONTNEED);
}
}Arenas use deferred purge for efficiency:
Arena = struct {
blocks_inuse: DynamicBitmap, // Which blocks are allocated
blocks_dirty: DynamicBitmap, // Which blocks have been used
blocks_purge: DynamicBitmap, // Which blocks need purging
purge_expire: Atomic(i64), // When to execute purge
}Segment freed to arena
|
v
+------------------+
| arena.free() | Mark blocks in blocks_purge bitmap
| Set purge_expire | Schedule purge for later (delay = 10ms)
+------------------+
|
| Time passes...
v
+------------------+
| tryPurge() | Called during allocation
| if now > expire |
+------------------+
|
v
+------------------+
| For each bit in | Scan blocks_purge bitmap
| blocks_purge: |
| MADV_DONTNEED | Decommit physical memory
| Clear bit |
+------------------+
Called directly during mallocGeneric. Each Arena tracks its own purge_expire timestamp:
const PURGE_DELAY_MS: i64 = 10;
// In mallocGeneric:
globalArenas().tryPurge(false);
// Implementation:
pub fn tryPurge(self: *Arenas, force: bool) void {
const now = std.time.milliTimestamp();
const expire = self.purge_expire.load(.acquire);
// Time-based throttling: skip if not expired
if (!force and (expire == 0 or now < expire)) return;
// Iterate all arenas and purge pending blocks
for (0..self.getCount()) |i| {
if (self.arenas[i].load(.acquire)) |arena| {
arena.tryPurge(force);
}
}
}Purge scheduling: When segments are freed back to arena, schedulePurge() sets purge_expire = now + PURGE_DELAY_MS. This delays physical memory release to allow quick reuse.
Empty segments are cached for reuse instead of immediately freed:
const SEGMENT_CACHE_THRESHOLD: usize = 4; // Start freeing after 4 segments
const EMPTY_SEGMENT_CACHE_MAX: usize = 1; // Keep up to 1 empty per queue
// Three queues by page kind:
SegmentsTLD = struct {
small_free: SegmentQueue, // Segments with 64KB pages
medium_free: SegmentQueue, // Segments with 512KB pages
large_free: SegmentQueue, // Segments with full pages
empty_segment_count: usize, // Total empty segments cached
}Caching benefits:
- Avoids expensive
mmap/munmapsyscalls - Segments already have page tables set up
- Quick reuse for same-sized allocations
Eviction policy:
- When
empty_segment_count >= EMPTY_SEGMENT_CACHE_MAX * 3(3) - AND
total_segments > SEGMENT_CACHE_THRESHOLD(4) - → Free segment immediately instead of caching
| Type | Page Size | Block Size | Pages/Segment | Use Case |
|---|---|---|---|---|
| Small | 64 KB | 8 - 1024 B | ~500 | Most allocations |
| Medium | 512 KB | 1KB - 128KB | ~60 | Medium objects |
| Large | Full segment | 128KB - 16MB | 1 | Large objects |
| Huge | Variable | > 16MB | 1 (custom) | Huge allocations |
wsize = (size + 7) / 8 // size in 8-byte words
Bin assignment:
wsize 1-8: bin = wsize (linear: 8, 16, 24, 32, 40, 48, 56, 64 bytes)
wsize > 8: bin = logarithmic (4 bins per octave)
Total: 74 bins (0-73, where 73 = BIN_HUGE)
Direct pointer array: 129 entries (wsize 0-128)
- Provides O(1) lookup for sizes up to 1024 bytes
- Multiple wsizes may point to same page (same bin)
For bins > 8, multiple wsizes map to the same bin. These functions compute ranges directly without iteration:
/// Get minimum wsize that maps to a bin (O(1), no iteration)
pub inline fn minWsizeForBin(bin: usize) usize {
if (bin <= 8) return bin;
// Reverse binFromWsize formula:
// bin + 3 = (b << 2) | ((w >> (b-2)) & 3)
const b = (bin + 3) >> 2;
const rem = (bin + 3) & 3;
const min_w = (1 << b) | (rem << (b - 2));
return min_w + 1;
}
/// Get maximum wsize that maps to a bin (O(1))
pub inline fn maxWsizeForBin(bin: usize) usize {
if (bin <= 8) return bin;
const b = (bin + 3) >> 2;
const rem = (bin + 3) & 3;
const max_w = (1 << b) | (rem << (b - 2)) | ((1 << (b - 2)) - 1);
return max_w + 1;
}
/// Block size = max_wsize * 8 bytes
pub fn blockSizeForBin(bin: usize) usize {
return maxWsizeForBin(bin) * INTPTR_SIZE;
}Why these matter: Setting direct pointers for all wsizes in a bin requires knowing the range [min_wsize, max_wsize]. Computing this via binFromWsize loop costs ~40% of slow path time due to lzcnt/shift per iteration.
| Operation | Mechanism |
|---|---|
| Same-thread alloc/free | No synchronization needed |
| Cross-thread free | Lock-free atomic CAS on xthread_free |
| Segment reclaim | Atomic CAS on thread_id |
| Abandoned segments | Global lock for queue operations |
When a thread exits without calling threadExit(), its segments become orphaned. The allocator handles this through lazy reclamation.
// In Subproc (global process state)
Subproc = struct {
abandoned_os_list: SegmentAbandonedQueue, // Queue of abandoned segments
abandoned_os_list_count: Atomic(usize), // Count for quick check
abandoned_os_lock: Mutex, // Protects list operations
}Flow:
Thread Exit (without cleanup)
|
v
+-------------------+
| Segment still has |
| thread_id set |
+-------------------+
|
| (Another thread tries cross-thread free)
v
+-------------------+
| Detects orphaned |---> markAbandoned() sets thread_id = 0
| segment |---> Adds to abandoned_os_list
+-------------------+
|
v
+-------------------+
| reclaimAbandoned()| Called during allocation when
| on other threads | abandoned_count > 0
+-------------------+
|
v
+-------------------+
| tryReclaim() via | Atomic CAS: 0 -> new_thread_id
| CAS on thread_id |
+-------------------+
fn reclaimAbandoned(max_count: usize) usize {
// Quick check without lock
if (global_subproc.abandoned_os_list_count.load(.monotonic) == 0) {
return 0;
}
// Lock and iterate abandoned list
global_subproc.abandoned_os_lock.lock();
defer global_subproc.abandoned_os_lock.unlock();
var reclaimed: usize = 0;
var segment = abandoned_os_list.head;
while (segment) |seg| {
if (seg.canReclaim()) { // has free pages + is abandoned
abandoned_os_list.remove(seg);
if (tryReclaimSegment(seg)) {
reclaimed += 1;
// Segment now owned by current thread
}
}
if (reclaimed >= max_count) break;
}
return reclaimed;
}Reclamation triggers:
- During
mallocGenericwhenabandoned_count > 0 - When allocation fails and segments need to be reclaimed
Each thread maintains a list of ALL its segments for cross-thread free collection:
// In SegmentsTLD (per-thread segment cache)
SegmentsTLD = struct {
all_segments: AllSegmentsList, // Doubly-linked list of all owned segments
// ...
}
// In Segment
Segment = struct {
all_next: ?*Segment, // Link for all_segments list
all_prev: ?*Segment,
// ...
}Purpose: When collecting xthread_free, we need to iterate ALL segments (including those with full pages). The all_segments list provides O(1) insertion/removal and O(n) iteration:
fn collectAllXthreadFree(self: *SegmentsTLD) usize {
var total_collected: usize = 0;
// Iterate ALL segments owned by this thread
var segment = self.all_segments.tail;
while (segment) |seg| {
const next = seg.all_prev;
// Collect xthread_free from ALL pages (including full ones)
for (seg.pages[0..seg.capacity]) |*page| {
if (page.xthread_free.load(.acquire) != null) {
total_collected += page.collectXthreadFree();
}
}
segment = next;
}
return total_collected;
}Why not use bin queues? Bin queues only contain non-full pages. But cross-thread frees can happen to ANY page, including full ones. Without all_segments, we'd have no way to find and collect blocks from full pages.
+-------------+
| ACTIVE | thread_id = owner_thread
+------+------+
|
| Thread exits without cleanup
v
+-------------+
| ORPHANED | thread_id still set, but thread is gone
+------+------+
|
| Detected during xthread_free or allocation
v
+-------------+
| ABANDONED | thread_id = 0, in abandoned_os_list
+------+------+
|
| Another thread calls reclaimAbandoned()
v
+-------------+
| RECLAIMED | thread_id = new_owner, removed from abandoned list
+-------------+
SEGMENT_SIZE = 32 MiB
SEGMENT_SHIFT = 25
SMALL_PAGE_SIZE = 64 KiB
SMALL_PAGE_SHIFT = 16
MEDIUM_PAGE_SIZE = 512 KiB
MEDIUM_PAGE_SHIFT = 19
SMALL_OBJ_SIZE_MAX = 1 KiB // Fast path threshold
MEDIUM_OBJ_SIZE_MAX = 128 KiB
LARGE_OBJ_SIZE_MAX = 16 MiB
BIN_HUGE = 73
PAGES_DIRECT = 129 // Direct pointer array size
SMALL_WSIZE_MAX = 128 // Max wsize for direct lookup| Operation | Complexity | Cache Misses | Notes |
|---|---|---|---|
| malloc (fast) | O(1) | 1-2 | direct pointer + free list pop |
| malloc (medium) | O(1) | 2-3 | bin queue + pop |
| malloc (slow) | O(n) | many | segment/page allocation |
| free (same-thread) | O(1) | 1-2 | push to local_free |
| free (cross-thread) | O(1) | 1 | atomic CAS |
| ptr -> segment | O(1) | 0 | bitmask only |
| ptr -> page | O(1) | 1 | bitmask + array lookup |
Cache optimizations:
- Page struct = 64 bytes = 1 cache line
- Hot fields in first 32 bytes
- Bump allocation = sequential memory access
- Prefetch hints for next block in free list
- Direct pointer array eliminates bin calculation in hot path
- Typed free list pointers (
IntrusiveLifoLink*) avoid casts in hot path
Computational optimizations:
minWsizeForBin/maxWsizeForBinuse O(1) inverse formula instead of O(n) loop- Avoids expensive
lzcnt/shrxper wsize when setting direct pointers - Time-based throttling for arena purge (10ms delay)
-
Direct pointer mismatch: When adding a page to a bin, ALL wsizes that map to that bin must have their direct pointers updated. Use
minWsizeForBin(bin)andmaxWsizeForBin(bin)to get the range. Otherwise allocations of different sizes in the same bin will miss the fast path. -
Page full flag: When a page is exhausted in fast path, it must be marked
in_fulland removed from bin queue. When it receives a free,freeColdPathmust re-add it. -
Cross-thread free accounting: The
usedcounter must be decremented when collectingxthread_free, andpending_xthread_freeglobal counter must be updated. -
Page size constraint: Page struct must remain exactly 64 bytes (one cache line). Adding fields requires removing or compressing existing ones.
-
Bin vs wsize: For small allocations,
bin == wsize. For larger allocations, multiple wsizes map to the same bin.blockSizeForBin(bin)returns the maximum size for that bin. -
Typed pointers: Use
IntrusiveLifoLink*for free lists instead ofanyopaque. This eliminates pointer casts in the hot allocation/free paths. -
binFromWsize in loops: Never call
binFromWsizein a loop over wsizes - useminWsizeForBin/maxWsizeForBininstead. Thelzcntinstruction per iteration was causing 40%+ slow path overhead.