feat: 接入STT语音识别 by ZhuangHT-1201 · Pull Request #1 · MINDS-THU/ThisIsMyDepartment.AI

ZhuangHT-1201 · 2026-03-27T10:17:08Z

功能描述：

本项目实现了基于离线模型的实时语音转文字（STT）功能。用户可以通过界面上的开关控制录音，识别结果将实时反馈至人物气泡或对话框中。

具体技术方案与改进：

模型精度升级：将原始的tiny模型升级为Faster-Whisper base模型。实测在中文语境下的断句与语义识别准确率提升，能更好地支撑复杂指令的识别。
输出规范化（简体中文）：优化了后端的 initial_prompt 逻辑，约束模型输出简体中文。同时修复了在静音或低信噪比环境下，模型会产生“幻觉”复述 Prompt 指令的 Bug。
仓库瘦身：删除了本地冗余的 scripts/models/tiny 静态资源文件夹（约 75MB）。改为按需自动下载并利用缓存，减小了代码库的体积。
VAD 与媒体流稳定性优化：在 SpeechToTextService.ts 中重构了语音采集逻辑，引入VAD（语音活动检测）。系统仅在检测到有效语音且停顿 1s 后发送带完整 Header 的 WebM 音频包，解决了后端解析报错（Invalid data found）的问题。
交互控制集成：在前端 HUD 界面新增了 SPEECH（转录）和 MIC（麦克风）的逻辑联动按钮；实现了漫游气泡与对话框面板的文字分流逻辑，确保交互体验的顺滑。
开发工具集：新增 start_dev.bat 一键启动脚本，支持前端自动编译及多端微服务同步拉起

Copilot

Pull request overview

本 PR 旨在新增本地离线 STT（语音转文字）能力，并把识别文本联动到现有聊天气泡与右侧对话面板，同时尝试修复本地开发下“分身/克隆人”相关问题。

Changes:

新增前端 SpeechToTextService：采集麦克风音频并分片上传到本地 STT 服务端
新增本地 STT FastAPI 微服务脚本 scripts/stt_server.py（faster-whisper tiny 模型）
UI 与消息链路接入：状态面板新增 STT 开关；在 App 内将识别文本注入对话与角色头顶气泡

Reviewed changes

Copilot reviewed 6 out of 9 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
src/main/ui/CharacterStatusOverlay.ts	增加 STT 开关按钮并刷新按钮状态
src/main/services/SpeechToTextService.ts	新增浏览器端录音分片与本地转写请求逻辑
src/main/nodes/CharacterNode.ts	调整 `say()` 行为以支持头顶气泡文本展示
src/main/ThisIsMyDepartmentApp.ts	STT 初始化/回调接入；媒体开关联动；以及若干线上/目录/Spawned avatar 逻辑调整
scripts/stt_server.py	新增本地 STT 转写 API 服务
scripts/models/tiny/config.json	新增 tiny 模型配置文件

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-29T11:03:09Z

scripts/stt_server.py

+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,


CORS is configured with allow_origins=["*"] while also setting allow_credentials=True, which yields invalid CORS behavior in browsers and is overly permissive for a localhost API. Restrict allow_origins to the app origin(s) and/or set allow_credentials=False.

Suggested change

allow_credentials=True,

allow_credentials=False,

Copilot · 2026-03-29T11:03:10Z

src/main/ThisIsMyDepartmentApp.ts

+        const offsetOptions = [
+            { x: 28, y: 0 },
+            { x: -28, y: 0 },
+            { x: 0, y: 28 },
+            { x: 0, y: -28 },


This method now chooses a single deterministic offset and returns it without checking collisions/navigability. This can place spawned avatars inside walls/props. Consider reintroducing the previous multi-candidate + collision-probe selection before falling back.

Copilot · 2026-03-29T11:03:10Z

src/main/ThisIsMyDepartmentApp.ts

                if (existingNode) {
-                    this.pendingSpawnedAvatarReplacements.set(presence.ownerUserId, presence);
                    if (this.activeLLMConversation?.agent === existingNode) {
                        this.closeLLMConversation(existingNode);
                    }
-                    this.despawnSpawnedAvatarNode(presence.ownerUserId, existingNode);
-                    return;
+                    existingNode.remove();
                }


This change removes the spawned-avatar despawn helper(s) (including the particle/animation path). However, the class still calls this.emitSpawnedAvatarDespawnParticles(...) in environment avatar spawn/despawn code, and that method no longer exists in the class, causing a compile error. Either restore the helper method(s) or update the environment-avatar animation code to not depend on them.

Copilot · 2026-03-29T11:03:11Z

src/main/ui/CharacterStatusOverlay.ts

+        this.sttToggleButton = this.createMediaButton(app.isTranscriptionEnabled ? "STT ON" : "STT OFF", () => {
+            const currentState = this.app?.isTranscriptionEnabled ?? true;
+            this.app?.toggleTranscriptionSetting(!currentState);
+            this.refresh();
+        });


The STT toggle button uses hard-coded labels ("STT ON/OFF") instead of the existing i18n pattern (app.t(...)) used for the audio/video buttons. Add translation keys and use app.t(...) so the UI stays localized and consistent.

Suggested change

this.sttToggleButton = this.createMediaButton(app.isTranscriptionEnabled ? "STT ON" : "STT OFF", () => {

const currentState = this.app?.isTranscriptionEnabled ?? true;

this.app?.toggleTranscriptionSetting(!currentState);

this.refresh();

});

this.sttToggleButton = this.createMediaButton(

app.isTranscriptionEnabled ? app.t("status.stt.on") : app.t("status.stt.off"),

() => {

const currentState = this.app?.isTranscriptionEnabled ?? true;

this.app?.toggleTranscriptionSetting(!currentState);

this.refresh();

}

);

Copilot · 2026-03-29T11:03:11Z

src/main/ThisIsMyDepartmentApp.ts

@@ -9,9 +9,6 @@ import { Game } from "../engine/Game";
 import { OnlineService } from "../engine/online/OnlineService";
 import { Camera } from "../engine/scene/Camera";
 import { FadeToBlack } from "../engine/scene/camera/FadeToBlack";
-import { Animator } from "../engine/scene/animations/Animator";
-import { ScenePointerDownEvent } from "../engine/scene/events/ScenePointerDownEvent";
-import { SimpleDirection } from "../engine/geom/Direction";
 import { Direction } from "../engine/geom/Direction";


ThisIsMyDepartmentApp still references Animator, ScenePointerDownEvent, and SimpleDirection (e.g., in environment avatar placement handlers and despawn animations), but those imports were removed from this file. This will fail TypeScript compilation; re-add the missing imports or remove the dependent code paths consistently.

Copilot · 2026-03-29T11:03:12Z

src/main/ThisIsMyDepartmentApp.ts


 type SpawnedAvatarPresence = NonNullable<RoomInfoEvent["spawnedAvatars"]>[number];
-type EnvironmentAvatarPlacementMode = "position" | "walk-area";
-
-interface EnvironmentAvatarPlacementResult {
-    position?: { x: number; y: number };
-    walkArea?: { x: number; y: number; width: number; height: number };
-}
-
-interface EnvironmentAvatarPlacementSession {
-    avatar: EditableEnvironmentAvatar;
-    mode: EnvironmentAvatarPlacementMode;
-    previewNode: EnvironmentPlacementPreviewNode;
-    dragAnchor?: { x: number; y: number };
-    resolve: (result: EnvironmentAvatarPlacementResult | null) => void;
-}

 const RIGHT_ELEVATOR_SPAWN_POINT = { x: 1488, y: 704 };
 const SPAWNED_AVATAR_WANDER_SIZE = 120;


The environment-avatar placement helper types (EnvironmentAvatarPlacementMode / EnvironmentAvatarPlacementResult / EnvironmentAvatarPlacementSession) were removed from this file, but the class still uses them (e.g., environmentAvatarPlacementSession field and beginEnvironmentAvatarPlacement). Restore these type declarations (or import them) so the file compiles.

Copilot · 2026-03-29T11:03:12Z

src/main/ThisIsMyDepartmentApp.ts

+        if (enabled) {
+            void this.sttService?.start();
+        } else {
+            this.sttService?.stop();
+        }


This block duplicates the STT start/stop logic that already runs immediately above (gated by enabled && isTranscriptionEnabled). Remove this duplicate start/stop to avoid redundant start() calls and inconsistent recorder state.

Suggested change

if (enabled) {

void this.sttService?.start();

} else {

this.sttService?.stop();

}

Copilot · 2026-03-29T11:03:12Z

src/main/services/SpeechToTextService.ts

+            this.setupRecorder(stream);
+            console.log("STT Service: 本地语音采集已启动");
+        } catch (error) {
+            console.error("STT Service: 麦克风获取失败", error);


If getUserMedia fails, isEnabled remains true, leaving the service stuck in an enabled-but-not-recording state and preventing retries. Set isEnabled = false in this catch path (and consider surfacing the error to the UI).

Suggested change

console.error("STT Service: 麦克风获取失败", error);

console.error("STT Service: 麦克风获取失败", error);

this.isEnabled = false;

Copilot · 2026-03-29T11:03:12Z

src/main/services/SpeechToTextService.ts

+        this.mediaRecorder.start();
+        this.recordingInterval = setInterval(() => {
+            if (this.mediaRecorder && this.mediaRecorder.state === "recording") {
+                this.mediaRecorder.stop();
+                this.mediaRecorder.start();
+            }
+        }, 4000);


Calling mediaRecorder.stop() and immediately mediaRecorder.start() can throw InvalidStateError because stop() completes asynchronously in many browsers. Prefer mediaRecorder.start(timeslice) for periodic chunks, or restart recording from an onstop handler.

Suggested change

this.mediaRecorder.start();

this.recordingInterval = setInterval(() => {

if (this.mediaRecorder && this.mediaRecorder.state === "recording") {

this.mediaRecorder.stop();

this.mediaRecorder.start();

}

}, 4000);

// 使用 timeslice 参数每 4 秒生成一次音频分片，避免 stop 后立即 start 带来的 InvalidStateError

this.mediaRecorder.start(4000);

Copilot · 2026-03-29T11:03:13Z

src/main/ThisIsMyDepartmentApp.ts

-}

 const RIGHT_ELEVATOR_SPAWN_POINT = { x: 1488, y: 704 };
 const SPAWNED_AVATAR_WANDER_SIZE = 120;


SPAWNED_AVATAR_DESPAWN_DURATION_MS and related spawned-avatar constants were deleted near the top, but the file still uses SPAWNED_AVATAR_DESPAWN_DURATION_MS in environment avatar spawn/despawn animations. This will break the build; either restore the constant(s) or replace usages with a new, locally-scoped duration constant.

Suggested change

const SPAWNED_AVATAR_WANDER_SIZE = 120;

const SPAWNED_AVATAR_WANDER_SIZE = 120;

const SPAWNED_AVATAR_DESPAWN_DURATION_MS = 1000;

cyrilli · 2026-03-29T13:10:42Z

@codex[agent] apply changes based on the comments in this thread

chatgpt-codex-connector · 2026-03-29T13:14:36Z

Summary

Updated handleRoomInfoUpdate to rebuild spawned-avatar presences from event.spawnedAvatars using connected owner user IDs, instead of the previous inverted filter logic. This prevents stale/disconnected avatar presences from being retained while keeping valid connected owners synchronized. src/main/ThisIsMyDepartmentApp.tsL2883-L2904
Kept the environment-avatar upsert flow intact before spawned-avatar reconciliation, so room-sync behavior remains consistent. src/main/ThisIsMyDepartmentApp.tsL2883-L2886

Testing

✅ npm run compile
✅ git status --short && git branch --show-current && git log --oneline -n 3
✅ git add src/main/ThisIsMyDepartmentApp.ts && git commit -m "fix: filter spawned avatar presences by connected owners" && git show --stat --oneline -n 1

Also completed:

Created commit: b7bdd03
Opened follow-up PR via make_pr with title: “fix: preserve room sync by filtering spawned avatars via connected owners”.

View task →

Implemented changes: - proxy browser STT requests through the existing Node backend and auto-start the local Python worker - add a live STT indicator to the dialogue window header - restrict STT capture to active dialogue sessions only and discard audio captured before the dialogue opened - prevent STT transcripts from appearing as scene speech bubbles over characters - localize and polish the Chinese STT toggle label in the character status panel TODO before this PR is production-usable: 1. Improve speech recognition accuracy. Current transcription quality is weak for fast or blurry speech and sentence segmentation is not reliable. 2. Validate scalability with many concurrent users online. The current local worker architecture may become a bottleneck and needs load/performance review.

cyrilli · 2026-03-29T14:15:42Z

I think this PR is heading in a useful direction, but there are still two areas that need to be addressed before it is ready for practical use.

First, the speech recognition quality is not yet reliable enough. In its current state, sentence segmentation is inconsistent, and transcription accuracy drops noticeably when the speaker talks quickly or unclearly. Since this directly affects the dialogue experience, I think the PR should include further work on chunking strategy, model configuration, or post-processing so that sentence boundaries and common spoken input are handled more robustly.

Second, the scalability of the current approach is still unclear. The implementation now depends on a local worker process for transcription, but it is not obvious how this will behave when many users are online and using STT at the same time. We should have at least some discussion or validation around CPU and memory usage, request concurrency, queueing latency, and the likely bottlenecks of this architecture before treating it as a production-ready solution.

ZhuangHT-1201 added 3 commits March 27, 2026 18:06

feat: 接入轻量级离线 STT 语音识别与 AI 对话联动

e94b593

feat: 新增 STT 语音转录 UI 开关与状态联动

dd94cf2

Merge branch 'main' into main

7397328

cyrilli assigned Copilot Mar 29, 2026

cyrilli requested a review from Copilot March 29, 2026 10:56

Copilot started reviewing on behalf of cyrilli March 29, 2026 10:56 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

cyrilli assigned Codex Mar 29, 2026

feat: 实现语音转文字功能，集成Faster-Whisper base模型并优化VAD检测

ad6e28d

ZhuangHT-1201 changed the title ~~feat: 接入本地离线STT语音识别与对话框联动~~ feat: 接入STT语音识别 Mar 31, 2026

	console.error("STT Service: 麦克风获取失败", error);
	console.error("STT Service: 麦克风获取失败", error);
	this.isEnabled = false;

	const SPAWNED_AVATAR_WANDER_SIZE = 120;
	const SPAWNED_AVATAR_WANDER_SIZE = 120;
	const SPAWNED_AVATAR_DESPAWN_DURATION_MS = 1000;

Conversation

ZhuangHT-1201 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

功能描述：

具体技术方案与改进：

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

cyrilli commented Mar 29, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 29, 2026

Summary

Uh oh!

cyrilli commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ZhuangHT-1201 commented Mar 27, 2026 •

edited

Loading