Feature request: borrowed/read-only references to eliminate refcount overhead in method calls

## Summary

When a method only *reads* from `self` and does not store it into any heap-allocated structure, the compiler should be able to elide all refcount operations on `self` and its fields. This "borrowed reference" optimization would eliminate the single largest performance bottleneck in struct-heavy, performance-critical MoonBit code.

## Real-world use case

I'm porting Go's `image` standard library to MoonBit ([bikallem/image](https://github.com/bikallem/image)). The library defines image types like:

```moonbit
pub(all) struct ImageAlpha {
  pix : FixedArray[Byte]
  pix_offset : Int
  stride : Int
  rect : Rectangle   // #valtype, 4 Ints
}
```

With a simple pixel accessor:

```moonbit
pub fn ImageAlpha::alpha_at(self : ImageAlpha, x : Int, y : Int) -> @color.Alpha {
  let r = self.rect
  if x < r.min_x || x >= r.max_x || y < r.min_y || y >= r.max_y {
    return { a: b'\x00' }
  }
  let i = self.pix_offset + (y - r.min_y) * self.stride + (x - r.min_x)
  { a: self.pix.unsafe_get(i) }
}
```

This function is **read-only** — it reads `self.rect`, `self.pix_offset`, `self.stride`, and `self.pix[i]`, then returns a `#valtype`. It never stores `self` anywhere.

## The problem: 47 instructions for a 1-byte read

I profiled with `valgrind --tool=callgrind` (1 million iterations). `alpha_at` costs **47 instructions per call**. Go's equivalent is **~3 instructions** (bounds check + array load). The 12x gap is almost entirely refcount overhead.

### Generated C code (from `moon build --target native --release`)

**Caller (bench loop):**
```c
while (iter < 1000000) {
    moonbit_incref(m);                    // +1 rc on ImageAlpha before call
    result = alpha_at(m, 4, 5);
    sum += result.a;
    iter++;
}
moonbit_decref(m);                        // -1 rc at loop exit
```

**Inside `alpha_at`:**
```c
struct Alpha alpha_at(struct ImageAlpha* self, int32_t x, int32_t y) {
    // Copy rect (valtype) — fine, ~4 insn
    Rectangle r = { self->min_x, self->min_y, self->max_x, self->max_y };
    
    // Bounds check — ~8 insn
    if (x < r.min_x || x >= r.max_x || y < r.min_y || y >= r.max_y) {
        moonbit_decref(self);             // -1 rc (early return path)
        return zero_alpha;
    }
    
    // Compute index — ~5 insn  
    int i = self->pix_offset + (y - r.min_y) * self->stride + (x - r.min_x);
    
    // Extract self.pix — THIS IS THE EXPENSIVE PART
    bytes_t pix = self->pix;              // read the pointer
    int rc = self->header->rc;            // read self's refcount
    if (rc > 1) {
        self->header->rc = rc - 1;        // -1 rc on self
        moonbit_incref(pix);              // +1 rc on pix array
    } else if (rc == 1) {
        moonbit_free(self);               // FREE self if last ref!
    }
    
    // Read one byte — 1 insn
    int byte = pix[i];
    
    moonbit_decref(pix);                  // -1 rc on pix array
    
    return (Alpha){ byte };
}
```

### Callgrind instruction breakdown (per call)

| Operation | Instructions | Notes |
|---|---|---|
| Caller: `moonbit_incref(m)` | ~4 | Bump rc before call |
| Copy rect valtype | ~4 | Fine |
| Bounds check | ~8 | Fine |
| Index computation | ~5 | Fine |
| Extract `self.pix` + rc bookkeeping | ~15 | Check rc, branch, incref/decref/free |
| Read `pix[i]` | ~1 | The actual work |
| `moonbit_decref(pix)` | ~8 | Drop pix rc |
| Return + overhead | ~2 | |
| **Total** | **~47** | |
| **Without refcounting** | **~7** | Just bounds check + array load |

**Refcounting accounts for ~85% of the cost** (40 of 47 instructions).

### Aggregate callgrind data

Profiling the full image library benchmark suite (558M total instructions):

| Category | Instructions | % of total |
|---|---|---|
| `moonbit_drop_object` | 47.5M | 8.5% |
| `free` (from rc drops) | 30.9M | 5.5% |
| `malloc` | 13.1M | 2.3% |
| `_mi_page_malloc_zero` | 14.2M | 2.5% |
| **Total GC/refcount overhead** | **105.7M** | **18.9%** |

Nearly 1 in 5 instructions is refcount bookkeeping.

## Benchmark impact

Comparing MoonBit vs Go on equivalent image operations:

| Benchmark | MoonBit | Go | Ratio | Root cause |
|---|---|---|---|---|
| `at/alpha` | 10 ns | 0.83 ns | **12x** | 4 rc ops per call |
| `at/gray` | 10 ns | 0.85 ns | **12x** | 4 rc ops per call |
| `at/rgba` | 10 ns | 2 ns | **5.7x** | 4 rc ops per call |
| `glyph_over` (before workaround) | 4.1 ms | 344 µs | **12x** | Per-pixel rc via `AnyImage::at()` |

The `glyph_over` case was fixed by manually bypassing the struct method and reading `mask.pix.unsafe_get(i)` directly — reducing it from 12x to 1.0x. But this required writing a specialized function that duplicates logic, which shouldn't be necessary.

## What borrowed references would look like

The compiler would recognize that `alpha_at` is a **read-only borrower** of `self`:
- `self` is never stored into a heap-allocated structure
- `self` is never returned or captured in a closure
- The caller holds `self` alive for the duration of the call

The generated code would become:

```c
// Caller: NO incref needed — self is borrowed
result = alpha_at(m, 4, 5);

// Inside alpha_at: NO rc ops — self is guaranteed alive
struct Alpha alpha_at(struct ImageAlpha* self, int32_t x, int32_t y) {
    Rectangle r = { self->min_x, self->min_y, self->max_x, self->max_y };
    if (x < r.min_x || x >= r.max_x || y < r.min_y || y >= r.max_y)
        return zero_alpha;
    int i = self->pix_offset + (y - r.min_y) * self->stride + (x - r.min_x);
    return (Alpha){ self->pix[i] };      // direct read, no rc
}
```

**~7 instructions instead of ~47.** This would close the 12x gap to Go.

## Scope of impact

This optimization would benefit any method that reads from a heap-allocated struct without storing it. In the image library alone, this pattern appears in:

- All pixel accessors: `rgba_at`, `gray_at`, `alpha_at`, `nrgba_at`, `rgba64_at`, etc. (10+ types)
- All `bounds()`, `is_opaque()`, `pix_offset()` methods
- Color conversion: `YCbCr::rgba()`, `NRGBA::rgba()`, etc.
- Palette lookup: `Palette::index()`, `Palette::convert()`
- Any struct method that just reads fields and returns a value type

This is a fundamental pattern in performance-critical MoonBit code — not just image processing but any domain with struct-heavy data (game engines, parsers, scientific computing, etc.).

## Prior art

- **Rust**: Explicit `&self` borrow syntax, lifetime system guarantees no rc needed
- **Swift**: "Guaranteed" calling convention — compiler elides rc for parameters proven alive
- **Lobster**: Reference counting with "borrowed" parameter optimization
- **Nim**: `sink` vs `lent` parameter modes

## Environment

- MoonBit compiler: latest (March 2026)
- Target: native (C backend)
- Profiler: valgrind/callgrind 3.26.0
- Platform: Linux x86_64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: borrowed/read-only references to eliminate refcount overhead in method calls #1163

Summary

Real-world use case

The problem: 47 instructions for a 1-byte read

Generated C code (from `moon build --target native --release`)

Callgrind instruction breakdown (per call)

Aggregate callgrind data

Benchmark impact

What borrowed references would look like

Scope of impact

Prior art

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Operation	Instructions	Notes
Caller: `moonbit_incref(m)`	~4	Bump rc before call
Copy rect valtype	~4	Fine
Bounds check	~8	Fine
Index computation	~5	Fine
Extract `self.pix` + rc bookkeeping	~15	Check rc, branch, incref/decref/free
Read `pix[i]`	~1	The actual work
`moonbit_decref(pix)`	~8	Drop pix rc
Return + overhead	~2
Total	~47
Without refcounting	~7	Just bounds check + array load

Category	Instructions	% of total
`moonbit_drop_object`	47.5M	8.5%
`free` (from rc drops)	30.9M	5.5%
`malloc`	13.1M	2.3%
`_mi_page_malloc_zero`	14.2M	2.5%
Total GC/refcount overhead	105.7M	18.9%

Benchmark	MoonBit	Go	Ratio	Root cause
`at/alpha`	10 ns	0.83 ns	12x	4 rc ops per call
`at/gray`	10 ns	0.85 ns	12x	4 rc ops per call
`at/rgba`	10 ns	2 ns	5.7x	4 rc ops per call
`glyph_over` (before workaround)	4.1 ms	344 µs	12x	Per-pixel rc via `AnyImage::at()`

Feature request: borrowed/read-only references to eliminate refcount overhead in method calls #1163

Description

Summary

Real-world use case

The problem: 47 instructions for a 1-byte read

Generated C code (from moon build --target native --release)

Callgrind instruction breakdown (per call)

Aggregate callgrind data

Benchmark impact

What borrowed references would look like

Scope of impact

Prior art

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Generated C code (from `moon build --target native --release`)