Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 96 additions & 4 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ This file provides guidance when working with code in this repository.

## What This Is

`harper-pro` is the proprietary commercial layer on top of [Harper core](https://github.com/HarperFast/harper) (Apache-2.0). Most engineering conventions, build mechanics, and runtime behaviors are inherited from core — see Harper core's [AGENTS.md](https://github.com/HarperFast/harper/blob/main/AGENTS.md) for the full picture.
`harper-pro` is the proprietary commercial layer on top of [Harper core](https://github.com/HarperFast/harper) (Apache-2.0). Most engineering conventions, build mechanics, and runtime behaviors are inherited from core — **read [core/AGENTS.md](core/AGENTS.md) for the substrate's full picture**. This document covers only what's different or additive in Pro.

`core/` is a git submodule pointing at `HarperFast/harper`. Pro adds: cluster replication, license enforcement, clone-node bootstrap, CPU profiling, and the docker-compose dev workflow.

---

Expand All @@ -32,9 +34,99 @@ inside `.git/modules/core/`, remove them immediately — they are corrupting the

---

## Pro-specific notes
## Commands

```bash
# Build (Pro-only — core has its own build)
npm run build # tsc → dist/
npm run build:watch # incremental

# Lint / Format
npm run lint # oxlint --deny-warnings
npm run lint:fix
npm run format:write # prettier
npm run lint:required # quiet — for CI

# Tests — only integration here
npm run test:integration
npm run test:integration:all # all *.test.ts in integrationTests/

# Submodule
npm run core:sync # sync core submodule to its pinned commit
npm run core:set-branch # pin core to a different branch
```

The `cluster:*` scripts in `package.json` reference `utility/dev/docker-compose.*.yml` files that are not present in the repository — they're likely produced by a private dev tooling step. Don't expect them to work out of the box.

**No `test:unit` exists.** Pro relies on core's unit suite for the substrate it inherits. `test:integration` is slow — run only when the change plausibly affects integration behavior.

---

## Where should this change go? (Pro vs. core)

If you're not sure which repo to edit, use this rule of thumb:

| Change concerns… | Edit in |
| ----------------------------------------------------------- | -------------------------------------------------- |
| Tables, Resources, transactions, audit, storage format | `core/` |
| HTTP/WS/MQTT/GraphQL protocol handling, middleware | `core/` |
| Schema, validation, permissions | `core/` |
| Multi-node replication, cluster status, node membership | `harper-pro/replication/` |
| Initial node clone from a leader | `harper-pro/cloneNode/` |
| License validation/enforcement | `harper-pro/licensing/` |
| CPU profiling / pprof integration | `harper-pro/analytics/` |
| TLS cert signing for cluster auth | `harper-pro/security/` |
| `bin/harper.js` CLI behavior (component registration order) | `harper-pro/bin/` |
| Build / packaging / release scripts | `harper-pro/build-tools/` or `harper-pro/scripts/` |

When a feature spans both, prefer landing as much as possible in `core/` and gluing it together via a Pro-registered component.

---

## Repository map

### Pro source folders

- **`bin/`** — CLI entry points. `harper.js` is the main executable; loads `cloneNode` if `HDB_LEADER_URL` is set; registers `analytics`, `licensing`, `replication` components.
- **`replication/`** — multi-node replication subsystem. **See [replication/DESIGN.md](replication/DESIGN.md)** for the section index. The big file is `replication/replicationConnection.ts` (2288 lines).
- **`cloneNode/`** — `cloneNode.ts` (~30KB). One-shot replication from a leader during init when `HDB_LEADER_URL` is set. Auth via cert or credentials. Tests: `integrationTests/cloneNode/`.
- **`licensing/`** — usage license validation and enforcement. `usageLicensing.ts` (lifecycle, usage aggregation) and `validation.ts` (EdDSA signature verification).
- **`analytics/`** — CPU profiling via Datadog pprof. `profile.ts` is the entry. **Not the same as core's `resources/analytics/`** (which records request-level telemetry).
- **`security/`** — Pro-specific cryptography: `certificate.ts` (TLS signing/validation), `sshKeyOperations.ts`, `keyService.ts` (JWT + private-key resolution). **Core PKI lives in `core/security/`** — don't confuse them.

### Pro tests

- **`integrationTests/`** — end-to-end, runs full Harper instances. `run.mjs` is the custom test harness with shard support. Subdirs mirror source (`analytics/`, `cloneNode/`, `cluster/`, `licensing/`, `security/`).
- **`unitTests/`** — small. `testUtils.js` (mock helpers, db reset) and `setupTestApp.mjs` (in-memory app scaffold).

### Pro non-source

- **`build-tools/`** — `build-pro.sh` orchestrates the build; `sync-core.sh` syncs the core submodule; `download-prebuilds.js` fetches native prebuilds; `set-core-branch.sh` pins core's branch.
- **`scripts/`** — `patch-release.js` (~12KB). Cherry-picks PRs labeled `patch` from `main` onto a release branch in both core and Pro, bumps the version, syncs the submodule. See `CONTRIBUTING.md` for usage.
- **`dev/`** — `sync-commits.js`. One-time repo-migration utility, not part of normal runtime.
- **`static/`** — `defaultConfig.yaml` template, `ascii_logo.txt`.

### Submodule

- **`core/`** — the Harper OSS core (`HarperFast/harper`). Has its own AGENTS.md, DESIGN.md, and now per-folder DESIGN.md docs. **When touching substrate behavior, edit there, not here.**
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implying the developer has another copy of harper on their system?

Should these instructions potentially include how to link the submodule to the local git copy of harper when doing cross-repo development?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these instructions potentially include how to link the submodule to the local git copy of harper when doing cross-repo development?

Sure, I would also like instructions on how to do that! How do you do that?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git submodule add /path/to/local/repo [folder_name]


---

## Pro-specific conventions

- **Linter**: oxlint with `--deny-warnings` (`npm run lint`), same as core.
- **Tests**: only `npm run test:integration` exists here. There is no `test:unit` split — Pro relies on core for unit-test coverage of the substrate it inherits. `test:integration` is slow; run only when the change plausibly affects integration behavior.
- **Linter**: `oxlint --deny-warnings`, same as core.
- **Storage substrate**: same as core — RocksDB primary, LMDB available via `HARPER_STORAGE_ENGINE=lmdb`.
- **Documentation scope**: https://docs.harperdb.io is authoritative for Harper mechanics. Pro docs describe Pro-only surface, not core behavior.
- **Submodule pointer**: when changing core, commit there first, then bump the submodule pointer in Pro in a separate commit. Don't combine core changes with submodule bumps — they need to be reviewable separately.
- **Patch releases**: PRs that should land in a stable release branch must carry the **`patch`** label. See `CONTRIBUTING.md` for the patch-release workflow.

---

## Cross-references

- **[core/AGENTS.md](core/AGENTS.md)** — substrate architecture (Resources, Server, Components, Data Layer). Read first for substrate questions.
- **[core/DESIGN.md](core/DESIGN.md)** — non-obvious internals (RecordObject prototype, getFromSource timing, blob orphan cleanup).
- **[core/resources/DESIGN.md](core/resources/DESIGN.md)** — `Table.ts` and `Resource.ts` section indexes.
- **[core/server/DESIGN.md](core/server/DESIGN.md)** — HTTP/WS/MQTT layer + middleware ordering.
- **[replication/DESIGN.md](replication/DESIGN.md)** — Pro replication subsystem.
- **[CONTRIBUTING.md](CONTRIBUTING.md)** — patch release procedure; package-lock merge driver setup.
131 changes: 131 additions & 0 deletions replication/DESIGN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# replication/ — Navigation Guide

Real-time, peer-to-peer replication of table data across cluster nodes via persistent WebSocket connections. Implements eventual consistency: when a local transaction commits, the audit records are forwarded asynchronously to peers.

**Read this when:** you're touching cluster sync, debugging missed writes, JWT/cluster auth, latency-based node selection, or blob transfer between nodes.

**Integration boundary with core:** replication hooks into core's table resource layer — a `Replicator` class is installed as a `source` of the table (`table.sourcedFrom(class Replicator extends Resource {...})`). When a local cache miss occurs, the Replicator picks the lowest-latency peer and fetches. Core's audit store (`core/resources/auditStore.ts`) and node-id mapping (`core/resources/nodeIdMapping.ts`) are the two data structures replication reads.

> **Navigation convention.** Code is referenced by **symbol name** (class, function, exported const). Use your editor's go-to-symbol or `grep -n '<name>' replication/<file>` to jump. Line numbers drift; symbols don't.

---

## Files (6 total, ~4200 lines)

| File | Purpose |
| -------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `replicationConnection.ts` | The protocol engine. Defines `NodeReplicationConnection`, encodes/decodes the binary frame format, drives audit-record forwarding, manages blobs, and writes shared latency/back-pressure counters. **The big file.** |
| `replicator.ts` | Setup module: `start()`, per-database/per-table `Replicator` resource class, retrieval-connection pool, operation forwarding, mTLS config. |
| `subscriptionManager.ts` | Main-thread orchestration. Delegates subscription work to worker threads; routes around disconnects. |
| `setNode.ts` | Cluster member operations — add/remove nodes, CSR signing, TLS certificate negotiation. |
| `knownNodes.ts` | Node registry (`hdb_nodes` system table) + shared-memory `Float64Array` status buffers (latency, confirmation, back-pressure). |
| `clusterStatus.ts` | Read-only status reporting for `cluster_status` operation. |

---

## Key abstractions

### `NodeReplicationConnection` (`replicationConnection.ts`)

A persistent connection to one remote node. Owns the WebSocket lifecycle, reconnection (initial delay `INITIAL_RETRY_TIME`), latency tracking, and per-subscription state. **Inspect this when debugging connection drops or auth failures** (see issue #135 lineage on JWT/cluster auth).

### `replicateOverWS(ws, options, authorization)` (`replicationConnection.ts`)

The protocol decoder. Reads incoming binary commands — each is a top-level named const in the same file:

| Command constant | Value | Meaning |
| ------------------------------------------ | --------- | ----------------------------------------- |
| `SUBSCRIPTION_REQUEST` | 129 | Client wants to subscribe to a table |
| `RESIDENCY_LIST` | 130 | Negotiate which records each node holds |
| `TABLE_FIXED_STRUCTURE` | 132 | Schema sync |
| `GET_RECORD` / `GET_RECORD_RESPONSE` | 133 / 134 | Cache-miss fetch |
| `OPERATION_REQUEST` / `OPERATION_RESPONSE` | 136 / 137 | Forwarded operations |
| `NODE_NAME` / `NODE_NAME_TO_ID_MAP` | 140 / 141 | Identity exchange |
| `DISCONNECT` | 142 | Graceful close (not used on auth failure) |
| `SEQUENCE_ID_UPDATE` | 143 | Audit sequence cursor |
| `COMMITTED_UPDATE` | 144 | Confirm-on-commit |
| `DB_SCHEMA` | 145 | Database schema replication |
| `BLOB_CHUNK` | 146 | Blob bytes |
| `SUBSCRIPTION_UPDATE` | 147 | Audit record forwarded to subscribers |

The `authorization` parameter is a **promise that may resolve asynchronously**; on rejection the socket closes without a DISCONNECT frame (relevant to JWT failure flows).

### `Replicator extends Resource` (`replicator.ts`)

A `Resource` class installed as a `source` of a table. Declared inside `setReplicator()` and passed to `table.sourcedFrom(...)`. Its `static async load(entry)` method picks the lowest-latency available node for cache-miss fetches.

### Shared status buffers (`getReplicationSharedStatus` in `knownNodes.ts`)

Per (database, remote_node) pair: an mmap-backed `Float64Array` shared across threads, used to avoid IPC for hot-path status updates. Position constants live in `replicationConnection.ts`:

| Position | Constant |
| -------- | ------------------------------ |
| 0 | `CONFIRMATION_STATUS_POSITION` |
| 1 | `RECEIVED_VERSION_POSITION` |
| 2 | `RECEIVED_TIME_POSITION` |
| 3 | `SENDING_TIME_POSITION` |
| 4 | `LATENCY_POSITION` |
| 5 | `RECEIVING_STATUS_POSITION` |
| 6 | `BACK_PRESSURE_RATIO_POSITION` |

These are written concurrently by `replicationConnection.ts` without explicit synchronization. Don't introduce read-modify-write patterns on this buffer.

### `hdb_nodes` system table (`getHDBNodeTable` in `knownNodes.ts`)

Schema (defined in that function): `name` (PK), `subscriptions[]`, `system_info`, `url`, `routes`, `ca`, `ca_info`, `replicates`, `revoked_certificates`, plus `__createdtime__` / `__updatedtime__`. Subscription updates flow through `subscribeToNodeUpdates`, which fans out to `monitorNodeCAs` → refresh `replicationCertificateAuthorities` (exported from `replicator.ts`).

---

## Subsystems

**Connection management** — `NodeReplicationConnection.connect()` (`replicationConnection.ts`), `subscriptionManager.startOnMainThread()`. Dial/retry, thread-pool delegation, recovery on disconnect.

**Binary protocol** — `replicateOverWS` (`replicationConnection.ts`); command constants are the `*_REQUEST` / `*_UPDATE` / `*_RESPONSE` consts at module top; msgpack body; back-pressure ratio recomputed on `BACK_PRESSURE_INTERVAL` (30 s).

**Data propagation** — Audit-record iteration → forwarding; blob streaming with concurrency cap `MAX_OUTSTANDING_BLOBS_BEING_SENT` (declared inside `replicateOverWS`); commit confirmation batched on `COMMITTED_UPDATE_DELAY` (2 ms).

**Latency awareness** — Ping every `PING_INTERVAL` (30 s); latency captured on pong; `Replicator.load()` routes cache-miss fetches to the lowest-latency node.

**Node discovery & TLS** — `hdb_nodes` subscriptions, `setNode.ts` for member ops, `buildReplicationMtlsConfig()` (`replicator.ts`), `monitorNodeCAs()` (`replicator.ts`).

---

## Non-obvious behaviors

1. **Auth failures don't send DISCONNECT.** When the `authorization` promise rejects in `replicateOverWS`, the connection closes with "Unauthorized" but no DISCONNECT frame is sent — the client is expected to retry. This is the lineage of JWT/cluster auth bugs (issue #135).

2. **Origin loop prevention via delayed sequence updates.** A node receiving its own message (checked against `remoteToLocalNodeId`) skips local processing but still forwards. To avoid feedback loops, the sequence-update emit is delayed by `SKIPPED_MESSAGE_SEQUENCE_UPDATE_DELAY` (300 ms; in `replicationConnection.ts`).

3. **Blob back-pressure & timeout.** Blobs time out after `blobTimeout` (default 120s); concurrent sends are capped at `MAX_OUTSTANDING_BLOBS_BEING_SENT = 5`; back-pressure ratio (computed every `BACK_PRESSURE_INTERVAL`) tells senders to pause. If you're seeing large-data replication hangs, look here first.

4. **Shared-buffer concurrency.** The Float64Array status buffers are touched from multiple threads with no lock. Treat them as eventually consistent; use the callback param of `subscribeToNodeUpdates` if you need notification.

---

## Tests

**Integration tests** live in `../integrationTests/cluster/`:

| File | Purpose |
| ------------------------------------ | ------------------------------------------------ |
| `clusterShared.mjs` | Shared fixture/helper (cluster boot, node setup) |
| `fullyConnectedReplication.test.mjs` | Full-mesh topology |
| `replicationTopology.test.mjs` | Dynamic membership changes |
| `replicationLoad.test.mjs` | Concurrent-write load |

There is no dedicated `unitTests/replication/` directory — replication is exercised entirely via integration tests that spin up multi-node clusters.

---

## "Where is X" cheat sheet

| Question | Where |
| ----------------------------------------- | ---------------------------------------------------------------------------------------------- |
| Where does a remote message get decoded? | `replicationConnection.ts → replicateOverWS` |
| Where do cache-miss fetches pick a peer? | `replicator.ts → Replicator.load` (declared inside `setReplicator`) |
| Where is the connection retry loop? | `replicationConnection.ts → NodeReplicationConnection` (uses `INITIAL_RETRY_TIME`) |
| Where is mTLS configured? | `replicator.ts → buildReplicationMtlsConfig` |
| Where is a new cluster member added? | `setNode.ts` (the whole file is one operation) |
| Where are protocol message types defined? | `replicationConnection.ts` — top-level consts (`SUBSCRIPTION_REQUEST` … `SUBSCRIPTION_UPDATE`) |
| Where is `hdb_nodes` schema? | `knownNodes.ts → getHDBNodeTable` |
| What does `cluster_status` return? | `clusterStatus.ts` (82 lines, whole file) |
Loading