Summary
When a method only reads from self and does not store it into any heap-allocated structure, the compiler should be able to elide all refcount operations on self and its fields. This "borrowed reference" optimization would eliminate the single largest performance bottleneck in struct-heavy, performance-critical MoonBit code.
Real-world use case
I'm porting Go's image standard library to MoonBit (bikallem/image). The library defines image types like:
pub(all) struct ImageAlpha {
pix : FixedArray[Byte]
pix_offset : Int
stride : Int
rect : Rectangle // #valtype, 4 Ints
}
With a simple pixel accessor:
pub fn ImageAlpha::alpha_at(self : ImageAlpha, x : Int, y : Int) -> @color.Alpha {
let r = self.rect
if x < r.min_x || x >= r.max_x || y < r.min_y || y >= r.max_y {
return { a: b'\x00' }
}
let i = self.pix_offset + (y - r.min_y) * self.stride + (x - r.min_x)
{ a: self.pix.unsafe_get(i) }
}
This function is read-only — it reads self.rect, self.pix_offset, self.stride, and self.pix[i], then returns a #valtype. It never stores self anywhere.
The problem: 47 instructions for a 1-byte read
I profiled with valgrind --tool=callgrind (1 million iterations). alpha_at costs 47 instructions per call. Go's equivalent is ~3 instructions (bounds check + array load). The 12x gap is almost entirely refcount overhead.
Generated C code (from moon build --target native --release)
Caller (bench loop):
while (iter < 1000000) {
moonbit_incref(m); // +1 rc on ImageAlpha before call
result = alpha_at(m, 4, 5);
sum += result.a;
iter++;
}
moonbit_decref(m); // -1 rc at loop exit
Inside alpha_at:
struct Alpha alpha_at(struct ImageAlpha* self, int32_t x, int32_t y) {
// Copy rect (valtype) — fine, ~4 insn
Rectangle r = { self->min_x, self->min_y, self->max_x, self->max_y };
// Bounds check — ~8 insn
if (x < r.min_x || x >= r.max_x || y < r.min_y || y >= r.max_y) {
moonbit_decref(self); // -1 rc (early return path)
return zero_alpha;
}
// Compute index — ~5 insn
int i = self->pix_offset + (y - r.min_y) * self->stride + (x - r.min_x);
// Extract self.pix — THIS IS THE EXPENSIVE PART
bytes_t pix = self->pix; // read the pointer
int rc = self->header->rc; // read self's refcount
if (rc > 1) {
self->header->rc = rc - 1; // -1 rc on self
moonbit_incref(pix); // +1 rc on pix array
} else if (rc == 1) {
moonbit_free(self); // FREE self if last ref!
}
// Read one byte — 1 insn
int byte = pix[i];
moonbit_decref(pix); // -1 rc on pix array
return (Alpha){ byte };
}
Callgrind instruction breakdown (per call)
| Operation |
Instructions |
Notes |
Caller: moonbit_incref(m) |
~4 |
Bump rc before call |
| Copy rect valtype |
~4 |
Fine |
| Bounds check |
~8 |
Fine |
| Index computation |
~5 |
Fine |
Extract self.pix + rc bookkeeping |
~15 |
Check rc, branch, incref/decref/free |
Read pix[i] |
~1 |
The actual work |
moonbit_decref(pix) |
~8 |
Drop pix rc |
| Return + overhead |
~2 |
|
| Total |
~47 |
|
| Without refcounting |
~7 |
Just bounds check + array load |
Refcounting accounts for ~85% of the cost (40 of 47 instructions).
Aggregate callgrind data
Profiling the full image library benchmark suite (558M total instructions):
| Category |
Instructions |
% of total |
moonbit_drop_object |
47.5M |
8.5% |
free (from rc drops) |
30.9M |
5.5% |
malloc |
13.1M |
2.3% |
_mi_page_malloc_zero |
14.2M |
2.5% |
| Total GC/refcount overhead |
105.7M |
18.9% |
Nearly 1 in 5 instructions is refcount bookkeeping.
Benchmark impact
Comparing MoonBit vs Go on equivalent image operations:
| Benchmark |
MoonBit |
Go |
Ratio |
Root cause |
at/alpha |
10 ns |
0.83 ns |
12x |
4 rc ops per call |
at/gray |
10 ns |
0.85 ns |
12x |
4 rc ops per call |
at/rgba |
10 ns |
2 ns |
5.7x |
4 rc ops per call |
glyph_over (before workaround) |
4.1 ms |
344 µs |
12x |
Per-pixel rc via AnyImage::at() |
The glyph_over case was fixed by manually bypassing the struct method and reading mask.pix.unsafe_get(i) directly — reducing it from 12x to 1.0x. But this required writing a specialized function that duplicates logic, which shouldn't be necessary.
What borrowed references would look like
The compiler would recognize that alpha_at is a read-only borrower of self:
self is never stored into a heap-allocated structure
self is never returned or captured in a closure
- The caller holds
self alive for the duration of the call
The generated code would become:
// Caller: NO incref needed — self is borrowed
result = alpha_at(m, 4, 5);
// Inside alpha_at: NO rc ops — self is guaranteed alive
struct Alpha alpha_at(struct ImageAlpha* self, int32_t x, int32_t y) {
Rectangle r = { self->min_x, self->min_y, self->max_x, self->max_y };
if (x < r.min_x || x >= r.max_x || y < r.min_y || y >= r.max_y)
return zero_alpha;
int i = self->pix_offset + (y - r.min_y) * self->stride + (x - r.min_x);
return (Alpha){ self->pix[i] }; // direct read, no rc
}
~7 instructions instead of ~47. This would close the 12x gap to Go.
Scope of impact
This optimization would benefit any method that reads from a heap-allocated struct without storing it. In the image library alone, this pattern appears in:
- All pixel accessors:
rgba_at, gray_at, alpha_at, nrgba_at, rgba64_at, etc. (10+ types)
- All
bounds(), is_opaque(), pix_offset() methods
- Color conversion:
YCbCr::rgba(), NRGBA::rgba(), etc.
- Palette lookup:
Palette::index(), Palette::convert()
- Any struct method that just reads fields and returns a value type
This is a fundamental pattern in performance-critical MoonBit code — not just image processing but any domain with struct-heavy data (game engines, parsers, scientific computing, etc.).
Prior art
- Rust: Explicit
&self borrow syntax, lifetime system guarantees no rc needed
- Swift: "Guaranteed" calling convention — compiler elides rc for parameters proven alive
- Lobster: Reference counting with "borrowed" parameter optimization
- Nim:
sink vs lent parameter modes
Environment
- MoonBit compiler: latest (March 2026)
- Target: native (C backend)
- Profiler: valgrind/callgrind 3.26.0
- Platform: Linux x86_64
Summary
When a method only reads from
selfand does not store it into any heap-allocated structure, the compiler should be able to elide all refcount operations onselfand its fields. This "borrowed reference" optimization would eliminate the single largest performance bottleneck in struct-heavy, performance-critical MoonBit code.Real-world use case
I'm porting Go's
imagestandard library to MoonBit (bikallem/image). The library defines image types like:With a simple pixel accessor:
This function is read-only — it reads
self.rect,self.pix_offset,self.stride, andself.pix[i], then returns a#valtype. It never storesselfanywhere.The problem: 47 instructions for a 1-byte read
I profiled with
valgrind --tool=callgrind(1 million iterations).alpha_atcosts 47 instructions per call. Go's equivalent is ~3 instructions (bounds check + array load). The 12x gap is almost entirely refcount overhead.Generated C code (from
moon build --target native --release)Caller (bench loop):
Inside
alpha_at:Callgrind instruction breakdown (per call)
moonbit_incref(m)self.pix+ rc bookkeepingpix[i]moonbit_decref(pix)Refcounting accounts for ~85% of the cost (40 of 47 instructions).
Aggregate callgrind data
Profiling the full image library benchmark suite (558M total instructions):
moonbit_drop_objectfree(from rc drops)malloc_mi_page_malloc_zeroNearly 1 in 5 instructions is refcount bookkeeping.
Benchmark impact
Comparing MoonBit vs Go on equivalent image operations:
at/alphaat/grayat/rgbaglyph_over(before workaround)AnyImage::at()The
glyph_overcase was fixed by manually bypassing the struct method and readingmask.pix.unsafe_get(i)directly — reducing it from 12x to 1.0x. But this required writing a specialized function that duplicates logic, which shouldn't be necessary.What borrowed references would look like
The compiler would recognize that
alpha_atis a read-only borrower ofself:selfis never stored into a heap-allocated structureselfis never returned or captured in a closureselfalive for the duration of the callThe generated code would become:
~7 instructions instead of ~47. This would close the 12x gap to Go.
Scope of impact
This optimization would benefit any method that reads from a heap-allocated struct without storing it. In the image library alone, this pattern appears in:
rgba_at,gray_at,alpha_at,nrgba_at,rgba64_at, etc. (10+ types)bounds(),is_opaque(),pix_offset()methodsYCbCr::rgba(),NRGBA::rgba(), etc.Palette::index(),Palette::convert()This is a fundamental pattern in performance-critical MoonBit code — not just image processing but any domain with struct-heavy data (game engines, parsers, scientific computing, etc.).
Prior art
&selfborrow syntax, lifetime system guarantees no rc neededsinkvslentparameter modesEnvironment