Skip to content

MNEMOS Embedkit — NPU-accelerated semantic memory for agentic AI on Cix Sky1 / 面向 Cix Sky1 的智能体语义记忆嵌入工具包 #22

@perlowja

Description

@perlowja

The MNEMOS Embedkit is the embedding layer that ships with NCZ 26.5 Magnetar. It gives you a production-grade semantic memory backend running entirely on-device, no cloud dependency.

What it does

The Embedkit provides a unified embedding interface with automatic backend selection:

from embedkit import Engine

engine = Engine.auto()  # picks NPU on Cix Sky1, CPU ONNX elsewhere
vector = engine.embed("What does the agent remember about the user?")

On Cix Sky1, Engine.auto() detects /dev/aipu (the Zhouyi V3 NPU device) and routes to the npu-cix adapter using libnoe and a compiled .cix model. On other hardware it falls back to cpu-llamacpp or cpu-fastembed (ONNX).

Validated performance (Cix Sky1)

Backend Throughput Notes
NPU (libnoe + bge-small-zh.cix) 30+ emb/sec sustained Per-call job overhead; persistent-job path coming
CPU ONNX (fastembed) 700+ emb/sec batched Leaves NPU + GPU free for other workloads

Model: bge-small-zh-v1.5 (512-dim, 256-token context), validated over 2000-call sustained runs on MS-R1 64GB.

Architecture

Agent query
    ↓
MNEMOS semantic memory API
    ↓
Embedkit Engine.auto()
    ├── NPU adapter (libnoe + .cix model) — Cix Sky1
    ├── CPU ONNX adapter (fastembed) — any arm64/x86
    └── llama.cpp adapter (CPU/Vulkan) — fallback
    ↓
Vector embedding → MNEMOS vector store → retrieval results

The NPU runs concurrently with the Mali-G720 GPU (which handles LLM decode via llama.cpp Vulkan). No resource contention — both paths are always warm.

Status

  • npu-cix adapter: validated on Cix Sky1 / MS-R1
  • cpu-fastembed adapter: validated on arm64 and x86
  • 🔜 Bundled in ISO: currently installs post-boot via ncz install mnemos; will be baked into the next Magnetar release
  • 🔜 70–80 emb/sec on NPU: pending upstream libnoe persistent-job fix (tracked with cixtech)

Install (current)

On a running Magnetar system:

ncz install mnemos   # pulls MNEMOS server + Embedkit

The Embedkit source will be published to the MNEMOS organization shortly.


NCZ 26.5 Magnetar: https://github.com/nclawzero/distro/releases/tag/v26.5-magnetar



MNEMOS 嵌入工具包——面向 Cix Sky1 的 NPU 加速语义记忆推理

MNEMOS Embedkit 是随 NCZ 26.5 Magnetar 一同发布的嵌入层工具包。它提供完全本地化的生产级语义记忆后端,无需云端依赖。

功能概述

Embedkit 提供统一的嵌入接口,自动选择最优后端:

from embedkit import Engine

engine = Engine.auto()  # Cix Sky1 上自动选择 NPU,其他硬件回退到 CPU ONNX
vector = engine.embed("智能体需要记住哪些关于用户的信息?")

在 Cix Sky1 上,Engine.auto() 检测到 /dev/aipu(周易 V3 NPU 设备),通过 libnoe 和已编译的 .cix 模型文件路由到 npu-cix 适配器;在其他硬件上自动回退到 cpu-llamacppcpu-fastembed(ONNX)。

验证性能(Cix Sky1)

后端 吞吐量 说明
NPU(libnoe + bge-small-zh.cix) 30+ 次嵌入/秒(持续) 每次调用重建 job 的开销;持久化 job 路径即将支持
CPU ONNX(fastembed) 700+ 次嵌入/秒(批量) NPU + GPU 保持空闲,可并行处理其他工作负载

模型:bge-small-zh-v1.5(512 维,256 token 上下文),在 MS-R1 64GB 上经 2000 次调用的持续测试验证。

架构

智能体查询
    ↓
MNEMOS 语义记忆 API
    ↓
Embedkit Engine.auto()
    ├── NPU 适配器(libnoe + .cix 模型)—— Cix Sky1
    ├── CPU ONNX 适配器(fastembed)—— 任意 arm64/x86
    └── llama.cpp 适配器(CPU/Vulkan)—— 兜底
    ↓
向量嵌入 → MNEMOS 向量存储 → 检索结果

NPU 与 Mali-G720 GPU(通过 llama.cpp Vulkan 运行 LLM 解码)并行工作,无资源争抢,两条推理路径始终保持热启动。

当前状态

  • npu-cix 适配器:在 Cix Sky1 / MS-R1 上验证
  • cpu-fastembed 适配器:在 arm64 和 x86 上验证
  • 🔜 内置于 ISO:当前通过 ncz install mnemos 安装(Magnetar 下一版本将直接打包)
  • 🔜 NPU 70–80 次/秒目标:等待上游 libnoe 持久化 job 修复(已向星睿反馈追踪)

安装(当前)

在运行中的 Magnetar 系统上:

ncz install mnemos   # 拉取 MNEMOS 服务器 + Embedkit

Embedkit 源码将很快发布至 MNEMOS 组织。


NCZ 26.5 Magnetar:https://github.com/nclawzero/distro/releases/tag/v26.5-magnetar

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions