docs: add F19 cluster inference specs to PRD + TDD

- PRD v2.7: F19 multi-device cluster inference feature spec - TDD v1.6: 8.5.15 cluster architecture, dispatcher, pipeline, API design Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 10:33:55 +08:00 · 2026-03-02 10:33:55 +08:00 · 0103a483b8
commit 0103a483b8
parent f322557af3
2 changed files with 203 additions and 7 deletions
--- a/docs/PRD-Integrated.md
+++ b/docs/PRD-Integrated.md
@ -8,8 +8,8 @@
 |------|------|
 | 文件名稱 | 邊緣 AI 開發平台 PRD |
 | 產品名稱 | （暫未定名，以下稱「本平台」） |
-| 版本 | v2.6 |
-| 日期 | 2026-03-01 |
+| 版本 | v2.7 |
+| 日期 | 2026-03-02 |
 | 狀態 | 更新中 |

 ---
@ -1255,6 +1255,21 @@ Kneron Dongle           Arduino 開發板           非 Kneron 晶片
 | **Error 40 處理** | KL520: `restartBridge()` → 重建整個 Python 子進程 → 重新連線 + 載入模型。KL720: 先直接 retry → 若失敗才 fallback 到 bridge restart |
 | **驗證結果** | KL520: 街道場景 8 物件偵測正確；KL720: KDP→KDP2 韌體更新成功（0x0200→0x0720）、connect/flash/inference 流程驗證 |

+#### F19 — 多裝置叢集推論（Cluster Inference）
+
+| 項目 | 規格 |
+|------|------|
+| **概述** | 允許使用者將多個 Kneron 邊緣裝置（KL520、KL720 可混用）組成推論叢集，透過加權分派機制將影像幀分配至各裝置並行推論，合併後以單一串流輸出結果，實現線性吞吐量擴展 |
+| **叢集管理** | 建立/刪除叢集、新增/移除成員裝置；叢集設定純記憶體管理（MVP 階段），重啟後需重新建立 |
+| **異質支援** | 同一叢集可混用 KL520 + KL720，系統依裝置晶片類型自動選用正確的 NEF 檔案（透過既有 `resolveModelPath` 機制） |
+| **加權分派** | 依裝置算力動態分配幀：KL720 預設權重 3、KL520 預設權重 1（可由使用者調整）；採用加權輪詢（Weighted Round-Robin）演算法 |
+| **統一結果** | 各裝置推論結果合併至單一 channel，標記來源裝置 ID 與幀序號（frameIndex），前端可依幀序排序或直接串流顯示 |
+| **叢集燒錄** | 一鍵為叢集所有裝置燒錄同一邏輯模型（系統自動依晶片選取對應 NEF），各裝置燒錄進度獨立追蹤 |
+| **叢集工作區** | 單一工作區 UI 顯示叢集推論串流，側邊面板顯示各裝置個別 FPS/延遲與叢集整體 FPS |
+| **容錯機制** | 裝置推論失敗自動標記為 degraded，叢集持續以剩餘裝置運作；裝置恢復後可重新加入 |
+| **WebSocket** | 叢集推論結果使用 `inference:cluster:{clusterId}` room 廣播 |
+| **限制** | MVP 階段：每叢集最多 8 裝置；叢集設定不持久化；不支援跨機器叢集 |
+
 ---

 ## B5. 功能路線圖（Post-MVP）
@ -1441,4 +1456,4 @@ Phase 3 — 進階功能（長期差異化）

 ---

-*文件版本：v2.4 | 日期：2026-02-28 | 狀態：更新中*
+*文件版本：v2.7 | 日期：2026-03-02 | 狀態：更新中*
--- a/docs/TDD.md
+++ b/docs/TDD.md
@ -7,9 +7,9 @@
 | 項目 | 內容 |
 |------|------|
 | 文件名稱 | 邊緣 AI 開發平台 TDD |
-| 對應 PRD | PRD-Integrated.md v2.6 |
-| 版本 | v1.5 |
-| 日期 | 2026-03-01 |
+| 對應 PRD | PRD-Integrated.md v2.7 |
+| 版本 | v1.6 |
+| 日期 | 2026-03-02 |
 | 狀態 | 更新中 |

 ---
@ -2110,6 +2110,187 @@ builds:
 # Windows: NSIS 包裝成 Setup.exe
 ```

+#### 8.5.15 多裝置叢集推論（F19 — Cluster Inference）
+
+| 前端元件 | 後端模組 | 說明 |
+|---------|---------|------|
+| `components/cluster/cluster-list.tsx` | `internal/cluster/manager.go` | 叢集清單與 CRUD |
+| `components/cluster/cluster-card.tsx` | `internal/cluster/dispatcher.go` | 加權分派引擎 |
+| `components/cluster/cluster-performance.tsx` | `internal/cluster/pipeline.go` | 叢集推論管線 |
+| `app/workspace/cluster/[clusterId]/` | `internal/cluster/types.go` | 叢集資料結構 |
+| `stores/cluster-store.ts` | `internal/api/handlers/cluster_handler.go` | 叢集 REST API |
+| `hooks/use-cluster-inference-stream.ts` | `internal/api/ws/cluster_inference_ws.go` | 叢集推論 WS |
+| `types/cluster.ts` | `internal/api/ws/cluster_flash_ws.go` | 叢集燒錄 WS |
+
+**叢集推論架構：**
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│                      Cluster Manager                                 │
+│  clusters map[string]*Cluster                                        │
+│  ┌────────────────────────────────────────────────────────────────┐  │
+│  │ Cluster {ID, Name, Devices[], ModelID, Status}                │  │
+│  └────────────────────────────────────────────────────────────────┘  │
+└──────────┬───────────────────────────────────────────────────────────┘
+           │
+           ▼
+┌──────────────────────────────────────────────────────────────────────┐
+│                   Cluster Inference Pipeline                         │
+│                                                                      │
+│  FrameSource ──► Dispatcher ──┬──► Device A (KL720, w=3) ──┐        │
+│  (camera/     (weighted       ├──► Device B (KL520, w=1) ──┤        │
+│   image/       round-robin)   └──► Device C (KL720, w=3) ──┤        │
+│   video)                                                    │        │
+│                                                             ▼        │
+│                                                    Result Collector  │
+│                                                    (merge + tag      │
+│                                                     frameIndex +     │
+│                                                     deviceId)        │
+│                                                             │        │
+│                                                             ▼        │
+│                                                    resultCh ──► WS   │
+│                                                    "inference:       │
+│                                                     cluster:{id}"    │
+└──────────────────────────────────────────────────────────────────────┘
+```
+
+**核心資料結構（`internal/cluster/types.go`）：**
+
+```go
+type ClusterStatus string  // "idle" / "inferencing" / "degraded"
+
+type DeviceMember struct {
+    DeviceID string  `json:"deviceId"`
+    Weight   int     `json:"weight"`    // 分派權重: KL720=3, KL520=1
+    Status   string  `json:"status"`    // "active", "degraded", "removed"
+}
+
+type Cluster struct {
+    ID       string          `json:"id"`
+    Name     string          `json:"name"`
+    Devices  []DeviceMember  `json:"devices"`
+    ModelID  string          `json:"modelId,omitempty"`
+    Status   ClusterStatus   `json:"status"`
+}
+
+type ClusterResult struct {
+    *driver.InferenceResult
+    ClusterID  string `json:"clusterId"`
+    FrameIndex int64  `json:"frameIndex"`
+}
+```
+
+**加權分派演算法（`internal/cluster/dispatcher.go`）：**
+
+```go
+// Weighted Round-Robin 分派器
+// 範例: devices A(w=3), B(w=1), C(w=3)
+// 分派序列: A, A, A, B, C, C, C, A, A, A, B, ...
+type Dispatcher struct {
+    members    []DeviceMember
+    drivers    []driver.DeviceDriver
+    current    int     // 目前裝置索引
+    remaining  int     // 目前裝置剩餘幀數
+    frameIndex int64   // 全域幀計數器
+    mu         sync.Mutex
+}
+
+func (d *Dispatcher) Next() (driver.DeviceDriver, int64, error)  // 跳過 degraded 裝置
+func (d *Dispatcher) MarkDegraded(deviceID string)                // 標記故障
+func (d *Dispatcher) MarkActive(deviceID string)                  // 恢復正常
+func (d *Dispatcher) ActiveCount() int                            // 可用裝置數
+```
+
+**並行推論 Worker 模式（`internal/cluster/pipeline.go`）：**
+
+```
+Main goroutine:
+  for {
+      frame = source.ReadFrame()
+      frameCh <- frame  (MJPEG 預覽)
+      device, frameIdx = dispatcher.Next()
+      workerCh[device] <- {frame, frameIdx}
+  }
+
+Per-device worker goroutine (每裝置一個):
+  for job := range workerCh {
+      result = device.RunInference(job.frame)
+      result.FrameIndex = job.frameIdx
+      result.DeviceID = device.Info().ID
+      resultCh <- result   // 合併到統一輸出
+  }
+```
+
+每裝置獨立 goroutine + buffered channel（size=2），慢裝置不會阻塞快裝置。Dispatcher 按權重分配更多幀給算力強的裝置。
+
+**容錯機制：**
+
+```go
+// Worker 內部容錯
+result, err := device.RunInference(frame)
+if err != nil {
+    consecutiveErrors++
+    if consecutiveErrors >= 3 {
+        dispatcher.MarkDegraded(deviceID)
+        cluster.Status = ClusterDegraded
+        // 通知前端，worker 暫停並定期重試
+    }
+    continue
+}
+consecutiveErrors = 0
+resultCh <- result
+```
+
+**REST API：**
+
+| Method | Path | 功能 |
+|--------|------|------|
+| `GET` | `/api/clusters` | 列出所有叢集 |
+| `POST` | `/api/clusters` | 建立叢集 `{name, deviceIds}` |
+| `GET` | `/api/clusters/:id` | 取得叢集詳情 |
+| `DELETE` | `/api/clusters/:id` | 刪除叢集 |
+| `POST` | `/api/clusters/:id/devices` | 新增裝置 `{deviceId, weight?}` |
+| `DELETE` | `/api/clusters/:id/devices/:deviceId` | 移除裝置 |
+| `PUT` | `/api/clusters/:id/devices/:deviceId/weight` | 更新權重 `{weight}` |
+| `POST` | `/api/clusters/:id/flash` | 叢集燒錄 `{modelId}` |
+| `POST` | `/api/clusters/:id/inference/start` | 啟動叢集推論 |
+| `POST` | `/api/clusters/:id/inference/stop` | 停止叢集推論 |
+
+**WebSocket 端點：**
+
+| Path | Room | 說明 |
+|------|------|------|
+| `/ws/clusters/:id/inference` | `inference:cluster:{id}` | 叢集推論結果串流 |
+| `/ws/clusters/:id/flash-progress` | `flash:cluster:{id}` | 叢集燒錄進度 |
+
+**叢集燒錄流程（異質 NEF 自動解析）：**
+
+```
+POST /api/clusters/:id/flash {modelId: "yolov5s"}
+  │
+  ├── 遍歷 cluster.Devices
+  │   ├── device A (KL720) → resolveModelPath → data/nef/kl720/kl720_20005_yolov5s.nef
+  │   ├── device B (KL520) → resolveModelPath → data/nef/kl520/kl520_20005_yolov5s.nef
+  │   └── device C (KL720) → resolveModelPath → data/nef/kl720/kl720_20005_yolov5s.nef
+  │
+  ├── 並行燒錄所有裝置 (各開 goroutine)
+  │
+  └── WS 推送各裝置進度:
+      {deviceId: "A", percent: 45, stage: "transferring"}
+      {deviceId: "B", percent: 20, stage: "preparing"}
+```
+
+**效能估算：**
+
+| 配置 | 單裝置 FPS | 叢集 FPS（理論） |
+|------|-----------|----------------|
+| 1× KL520 | ~40 | ~40 |
+| 1× KL720 | ~80 | ~80 |
+| 2× KL520 + 2× KL720 | — | ~240 (40×2 + 80×2) |
+| 4× KL720 | — | ~320 |
+
+實際吞吐量受 USB 頻寬和 host CPU 限制，建議使用多個 USB controller 或 powered hub。
+
 ---

 ## 9. 開發環境與工具鏈
@ -2658,4 +2839,4 @@ go.uber.org/zap          // 結構化日誌

 ---

-*文件版本：v1.3 | 日期：2026-02-28 | 狀態：更新中*
+*文件版本：v1.6 | 日期：2026-03-02 | 狀態：更新中*