Skip to content

Feature request: borrowed/read-only references to eliminate refcount overhead in method calls #1163

@bikallem

Description

@bikallem

Summary

When a method only reads from self and does not store it into any heap-allocated structure, the compiler should be able to elide all refcount operations on self and its fields. This "borrowed reference" optimization would eliminate the single largest performance bottleneck in struct-heavy, performance-critical MoonBit code.

Real-world use case

I'm porting Go's image standard library to MoonBit (bikallem/image). The library defines image types like:

pub(all) struct ImageAlpha {
  pix : FixedArray[Byte]
  pix_offset : Int
  stride : Int
  rect : Rectangle   // #valtype, 4 Ints
}

With a simple pixel accessor:

pub fn ImageAlpha::alpha_at(self : ImageAlpha, x : Int, y : Int) -> @color.Alpha {
  let r = self.rect
  if x < r.min_x || x >= r.max_x || y < r.min_y || y >= r.max_y {
    return { a: b'\x00' }
  }
  let i = self.pix_offset + (y - r.min_y) * self.stride + (x - r.min_x)
  { a: self.pix.unsafe_get(i) }
}

This function is read-only — it reads self.rect, self.pix_offset, self.stride, and self.pix[i], then returns a #valtype. It never stores self anywhere.

The problem: 47 instructions for a 1-byte read

I profiled with valgrind --tool=callgrind (1 million iterations). alpha_at costs 47 instructions per call. Go's equivalent is ~3 instructions (bounds check + array load). The 12x gap is almost entirely refcount overhead.

Generated C code (from moon build --target native --release)

Caller (bench loop):

while (iter < 1000000) {
    moonbit_incref(m);                    // +1 rc on ImageAlpha before call
    result = alpha_at(m, 4, 5);
    sum += result.a;
    iter++;
}
moonbit_decref(m);                        // -1 rc at loop exit

Inside alpha_at:

struct Alpha alpha_at(struct ImageAlpha* self, int32_t x, int32_t y) {
    // Copy rect (valtype) — fine, ~4 insn
    Rectangle r = { self->min_x, self->min_y, self->max_x, self->max_y };
    
    // Bounds check — ~8 insn
    if (x < r.min_x || x >= r.max_x || y < r.min_y || y >= r.max_y) {
        moonbit_decref(self);             // -1 rc (early return path)
        return zero_alpha;
    }
    
    // Compute index — ~5 insn  
    int i = self->pix_offset + (y - r.min_y) * self->stride + (x - r.min_x);
    
    // Extract self.pix — THIS IS THE EXPENSIVE PART
    bytes_t pix = self->pix;              // read the pointer
    int rc = self->header->rc;            // read self's refcount
    if (rc > 1) {
        self->header->rc = rc - 1;        // -1 rc on self
        moonbit_incref(pix);              // +1 rc on pix array
    } else if (rc == 1) {
        moonbit_free(self);               // FREE self if last ref!
    }
    
    // Read one byte — 1 insn
    int byte = pix[i];
    
    moonbit_decref(pix);                  // -1 rc on pix array
    
    return (Alpha){ byte };
}

Callgrind instruction breakdown (per call)

Operation Instructions Notes
Caller: moonbit_incref(m) ~4 Bump rc before call
Copy rect valtype ~4 Fine
Bounds check ~8 Fine
Index computation ~5 Fine
Extract self.pix + rc bookkeeping ~15 Check rc, branch, incref/decref/free
Read pix[i] ~1 The actual work
moonbit_decref(pix) ~8 Drop pix rc
Return + overhead ~2
Total ~47
Without refcounting ~7 Just bounds check + array load

Refcounting accounts for ~85% of the cost (40 of 47 instructions).

Aggregate callgrind data

Profiling the full image library benchmark suite (558M total instructions):

Category Instructions % of total
moonbit_drop_object 47.5M 8.5%
free (from rc drops) 30.9M 5.5%
malloc 13.1M 2.3%
_mi_page_malloc_zero 14.2M 2.5%
Total GC/refcount overhead 105.7M 18.9%

Nearly 1 in 5 instructions is refcount bookkeeping.

Benchmark impact

Comparing MoonBit vs Go on equivalent image operations:

Benchmark MoonBit Go Ratio Root cause
at/alpha 10 ns 0.83 ns 12x 4 rc ops per call
at/gray 10 ns 0.85 ns 12x 4 rc ops per call
at/rgba 10 ns 2 ns 5.7x 4 rc ops per call
glyph_over (before workaround) 4.1 ms 344 µs 12x Per-pixel rc via AnyImage::at()

The glyph_over case was fixed by manually bypassing the struct method and reading mask.pix.unsafe_get(i) directly — reducing it from 12x to 1.0x. But this required writing a specialized function that duplicates logic, which shouldn't be necessary.

What borrowed references would look like

The compiler would recognize that alpha_at is a read-only borrower of self:

  • self is never stored into a heap-allocated structure
  • self is never returned or captured in a closure
  • The caller holds self alive for the duration of the call

The generated code would become:

// Caller: NO incref needed — self is borrowed
result = alpha_at(m, 4, 5);

// Inside alpha_at: NO rc ops — self is guaranteed alive
struct Alpha alpha_at(struct ImageAlpha* self, int32_t x, int32_t y) {
    Rectangle r = { self->min_x, self->min_y, self->max_x, self->max_y };
    if (x < r.min_x || x >= r.max_x || y < r.min_y || y >= r.max_y)
        return zero_alpha;
    int i = self->pix_offset + (y - r.min_y) * self->stride + (x - r.min_x);
    return (Alpha){ self->pix[i] };      // direct read, no rc
}

~7 instructions instead of ~47. This would close the 12x gap to Go.

Scope of impact

This optimization would benefit any method that reads from a heap-allocated struct without storing it. In the image library alone, this pattern appears in:

  • All pixel accessors: rgba_at, gray_at, alpha_at, nrgba_at, rgba64_at, etc. (10+ types)
  • All bounds(), is_opaque(), pix_offset() methods
  • Color conversion: YCbCr::rgba(), NRGBA::rgba(), etc.
  • Palette lookup: Palette::index(), Palette::convert()
  • Any struct method that just reads fields and returns a value type

This is a fundamental pattern in performance-critical MoonBit code — not just image processing but any domain with struct-heavy data (game engines, parsers, scientific computing, etc.).

Prior art

  • Rust: Explicit &self borrow syntax, lifetime system guarantees no rc needed
  • Swift: "Guaranteed" calling convention — compiler elides rc for parameters proven alive
  • Lobster: Reference counting with "borrowed" parameter optimization
  • Nim: sink vs lent parameter modes

Environment

  • MoonBit compiler: latest (March 2026)
  • Target: native (C backend)
  • Profiler: valgrind/callgrind 3.26.0
  • Platform: Linux x86_64

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions