docs: add F19 cluster inference specs to PRD + TDD
- PRD v2.7: F19 multi-device cluster inference feature spec - TDD v1.6: 8.5.15 cluster architecture, dispatcher, pipeline, API design Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
f322557af3
commit
0103a483b8
@ -8,8 +8,8 @@
|
||||
|------|------|
|
||||
| 文件名稱 | 邊緣 AI 開發平台 PRD |
|
||||
| 產品名稱 | (暫未定名,以下稱「本平台」) |
|
||||
| 版本 | v2.6 |
|
||||
| 日期 | 2026-03-01 |
|
||||
| 版本 | v2.7 |
|
||||
| 日期 | 2026-03-02 |
|
||||
| 狀態 | 更新中 |
|
||||
|
||||
---
|
||||
@ -1255,6 +1255,21 @@ Kneron Dongle Arduino 開發板 非 Kneron 晶片
|
||||
| **Error 40 處理** | KL520: `restartBridge()` → 重建整個 Python 子進程 → 重新連線 + 載入模型。KL720: 先直接 retry → 若失敗才 fallback 到 bridge restart |
|
||||
| **驗證結果** | KL520: 街道場景 8 物件偵測正確;KL720: KDP→KDP2 韌體更新成功(0x0200→0x0720)、connect/flash/inference 流程驗證 |
|
||||
|
||||
#### F19 — 多裝置叢集推論(Cluster Inference)
|
||||
|
||||
| 項目 | 規格 |
|
||||
|------|------|
|
||||
| **概述** | 允許使用者將多個 Kneron 邊緣裝置(KL520、KL720 可混用)組成推論叢集,透過加權分派機制將影像幀分配至各裝置並行推論,合併後以單一串流輸出結果,實現線性吞吐量擴展 |
|
||||
| **叢集管理** | 建立/刪除叢集、新增/移除成員裝置;叢集設定純記憶體管理(MVP 階段),重啟後需重新建立 |
|
||||
| **異質支援** | 同一叢集可混用 KL520 + KL720,系統依裝置晶片類型自動選用正確的 NEF 檔案(透過既有 `resolveModelPath` 機制) |
|
||||
| **加權分派** | 依裝置算力動態分配幀:KL720 預設權重 3、KL520 預設權重 1(可由使用者調整);採用加權輪詢(Weighted Round-Robin)演算法 |
|
||||
| **統一結果** | 各裝置推論結果合併至單一 channel,標記來源裝置 ID 與幀序號(frameIndex),前端可依幀序排序或直接串流顯示 |
|
||||
| **叢集燒錄** | 一鍵為叢集所有裝置燒錄同一邏輯模型(系統自動依晶片選取對應 NEF),各裝置燒錄進度獨立追蹤 |
|
||||
| **叢集工作區** | 單一工作區 UI 顯示叢集推論串流,側邊面板顯示各裝置個別 FPS/延遲與叢集整體 FPS |
|
||||
| **容錯機制** | 裝置推論失敗自動標記為 degraded,叢集持續以剩餘裝置運作;裝置恢復後可重新加入 |
|
||||
| **WebSocket** | 叢集推論結果使用 `inference:cluster:{clusterId}` room 廣播 |
|
||||
| **限制** | MVP 階段:每叢集最多 8 裝置;叢集設定不持久化;不支援跨機器叢集 |
|
||||
|
||||
---
|
||||
|
||||
## B5. 功能路線圖(Post-MVP)
|
||||
@ -1441,4 +1456,4 @@ Phase 3 — 進階功能(長期差異化)
|
||||
|
||||
---
|
||||
|
||||
*文件版本:v2.4 | 日期:2026-02-28 | 狀態:更新中*
|
||||
*文件版本:v2.7 | 日期:2026-03-02 | 狀態:更新中*
|
||||
|
||||
189
docs/TDD.md
189
docs/TDD.md
@ -7,9 +7,9 @@
|
||||
| 項目 | 內容 |
|
||||
|------|------|
|
||||
| 文件名稱 | 邊緣 AI 開發平台 TDD |
|
||||
| 對應 PRD | PRD-Integrated.md v2.6 |
|
||||
| 版本 | v1.5 |
|
||||
| 日期 | 2026-03-01 |
|
||||
| 對應 PRD | PRD-Integrated.md v2.7 |
|
||||
| 版本 | v1.6 |
|
||||
| 日期 | 2026-03-02 |
|
||||
| 狀態 | 更新中 |
|
||||
|
||||
---
|
||||
@ -2110,6 +2110,187 @@ builds:
|
||||
# Windows: NSIS 包裝成 Setup.exe
|
||||
```
|
||||
|
||||
#### 8.5.15 多裝置叢集推論(F19 — Cluster Inference)
|
||||
|
||||
| 前端元件 | 後端模組 | 說明 |
|
||||
|---------|---------|------|
|
||||
| `components/cluster/cluster-list.tsx` | `internal/cluster/manager.go` | 叢集清單與 CRUD |
|
||||
| `components/cluster/cluster-card.tsx` | `internal/cluster/dispatcher.go` | 加權分派引擎 |
|
||||
| `components/cluster/cluster-performance.tsx` | `internal/cluster/pipeline.go` | 叢集推論管線 |
|
||||
| `app/workspace/cluster/[clusterId]/` | `internal/cluster/types.go` | 叢集資料結構 |
|
||||
| `stores/cluster-store.ts` | `internal/api/handlers/cluster_handler.go` | 叢集 REST API |
|
||||
| `hooks/use-cluster-inference-stream.ts` | `internal/api/ws/cluster_inference_ws.go` | 叢集推論 WS |
|
||||
| `types/cluster.ts` | `internal/api/ws/cluster_flash_ws.go` | 叢集燒錄 WS |
|
||||
|
||||
**叢集推論架構:**
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ Cluster Manager │
|
||||
│ clusters map[string]*Cluster │
|
||||
│ ┌────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Cluster {ID, Name, Devices[], ModelID, Status} │ │
|
||||
│ └────────────────────────────────────────────────────────────────┘ │
|
||||
└──────────┬───────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ Cluster Inference Pipeline │
|
||||
│ │
|
||||
│ FrameSource ──► Dispatcher ──┬──► Device A (KL720, w=3) ──┐ │
|
||||
│ (camera/ (weighted ├──► Device B (KL520, w=1) ──┤ │
|
||||
│ image/ round-robin) └──► Device C (KL720, w=3) ──┤ │
|
||||
│ video) │ │
|
||||
│ ▼ │
|
||||
│ Result Collector │
|
||||
│ (merge + tag │
|
||||
│ frameIndex + │
|
||||
│ deviceId) │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ resultCh ──► WS │
|
||||
│ "inference: │
|
||||
│ cluster:{id}" │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**核心資料結構(`internal/cluster/types.go`):**
|
||||
|
||||
```go
|
||||
type ClusterStatus string // "idle" / "inferencing" / "degraded"
|
||||
|
||||
type DeviceMember struct {
|
||||
DeviceID string `json:"deviceId"`
|
||||
Weight int `json:"weight"` // 分派權重: KL720=3, KL520=1
|
||||
Status string `json:"status"` // "active", "degraded", "removed"
|
||||
}
|
||||
|
||||
type Cluster struct {
|
||||
ID string `json:"id"`
|
||||
Name string `json:"name"`
|
||||
Devices []DeviceMember `json:"devices"`
|
||||
ModelID string `json:"modelId,omitempty"`
|
||||
Status ClusterStatus `json:"status"`
|
||||
}
|
||||
|
||||
type ClusterResult struct {
|
||||
*driver.InferenceResult
|
||||
ClusterID string `json:"clusterId"`
|
||||
FrameIndex int64 `json:"frameIndex"`
|
||||
}
|
||||
```
|
||||
|
||||
**加權分派演算法(`internal/cluster/dispatcher.go`):**
|
||||
|
||||
```go
|
||||
// Weighted Round-Robin 分派器
|
||||
// 範例: devices A(w=3), B(w=1), C(w=3)
|
||||
// 分派序列: A, A, A, B, C, C, C, A, A, A, B, ...
|
||||
type Dispatcher struct {
|
||||
members []DeviceMember
|
||||
drivers []driver.DeviceDriver
|
||||
current int // 目前裝置索引
|
||||
remaining int // 目前裝置剩餘幀數
|
||||
frameIndex int64 // 全域幀計數器
|
||||
mu sync.Mutex
|
||||
}
|
||||
|
||||
func (d *Dispatcher) Next() (driver.DeviceDriver, int64, error) // 跳過 degraded 裝置
|
||||
func (d *Dispatcher) MarkDegraded(deviceID string) // 標記故障
|
||||
func (d *Dispatcher) MarkActive(deviceID string) // 恢復正常
|
||||
func (d *Dispatcher) ActiveCount() int // 可用裝置數
|
||||
```
|
||||
|
||||
**並行推論 Worker 模式(`internal/cluster/pipeline.go`):**
|
||||
|
||||
```
|
||||
Main goroutine:
|
||||
for {
|
||||
frame = source.ReadFrame()
|
||||
frameCh <- frame (MJPEG 預覽)
|
||||
device, frameIdx = dispatcher.Next()
|
||||
workerCh[device] <- {frame, frameIdx}
|
||||
}
|
||||
|
||||
Per-device worker goroutine (每裝置一個):
|
||||
for job := range workerCh {
|
||||
result = device.RunInference(job.frame)
|
||||
result.FrameIndex = job.frameIdx
|
||||
result.DeviceID = device.Info().ID
|
||||
resultCh <- result // 合併到統一輸出
|
||||
}
|
||||
```
|
||||
|
||||
每裝置獨立 goroutine + buffered channel(size=2),慢裝置不會阻塞快裝置。Dispatcher 按權重分配更多幀給算力強的裝置。
|
||||
|
||||
**容錯機制:**
|
||||
|
||||
```go
|
||||
// Worker 內部容錯
|
||||
result, err := device.RunInference(frame)
|
||||
if err != nil {
|
||||
consecutiveErrors++
|
||||
if consecutiveErrors >= 3 {
|
||||
dispatcher.MarkDegraded(deviceID)
|
||||
cluster.Status = ClusterDegraded
|
||||
// 通知前端,worker 暫停並定期重試
|
||||
}
|
||||
continue
|
||||
}
|
||||
consecutiveErrors = 0
|
||||
resultCh <- result
|
||||
```
|
||||
|
||||
**REST API:**
|
||||
|
||||
| Method | Path | 功能 |
|
||||
|--------|------|------|
|
||||
| `GET` | `/api/clusters` | 列出所有叢集 |
|
||||
| `POST` | `/api/clusters` | 建立叢集 `{name, deviceIds}` |
|
||||
| `GET` | `/api/clusters/:id` | 取得叢集詳情 |
|
||||
| `DELETE` | `/api/clusters/:id` | 刪除叢集 |
|
||||
| `POST` | `/api/clusters/:id/devices` | 新增裝置 `{deviceId, weight?}` |
|
||||
| `DELETE` | `/api/clusters/:id/devices/:deviceId` | 移除裝置 |
|
||||
| `PUT` | `/api/clusters/:id/devices/:deviceId/weight` | 更新權重 `{weight}` |
|
||||
| `POST` | `/api/clusters/:id/flash` | 叢集燒錄 `{modelId}` |
|
||||
| `POST` | `/api/clusters/:id/inference/start` | 啟動叢集推論 |
|
||||
| `POST` | `/api/clusters/:id/inference/stop` | 停止叢集推論 |
|
||||
|
||||
**WebSocket 端點:**
|
||||
|
||||
| Path | Room | 說明 |
|
||||
|------|------|------|
|
||||
| `/ws/clusters/:id/inference` | `inference:cluster:{id}` | 叢集推論結果串流 |
|
||||
| `/ws/clusters/:id/flash-progress` | `flash:cluster:{id}` | 叢集燒錄進度 |
|
||||
|
||||
**叢集燒錄流程(異質 NEF 自動解析):**
|
||||
|
||||
```
|
||||
POST /api/clusters/:id/flash {modelId: "yolov5s"}
|
||||
│
|
||||
├── 遍歷 cluster.Devices
|
||||
│ ├── device A (KL720) → resolveModelPath → data/nef/kl720/kl720_20005_yolov5s.nef
|
||||
│ ├── device B (KL520) → resolveModelPath → data/nef/kl520/kl520_20005_yolov5s.nef
|
||||
│ └── device C (KL720) → resolveModelPath → data/nef/kl720/kl720_20005_yolov5s.nef
|
||||
│
|
||||
├── 並行燒錄所有裝置 (各開 goroutine)
|
||||
│
|
||||
└── WS 推送各裝置進度:
|
||||
{deviceId: "A", percent: 45, stage: "transferring"}
|
||||
{deviceId: "B", percent: 20, stage: "preparing"}
|
||||
```
|
||||
|
||||
**效能估算:**
|
||||
|
||||
| 配置 | 單裝置 FPS | 叢集 FPS(理論) |
|
||||
|------|-----------|----------------|
|
||||
| 1× KL520 | ~40 | ~40 |
|
||||
| 1× KL720 | ~80 | ~80 |
|
||||
| 2× KL520 + 2× KL720 | — | ~240 (40×2 + 80×2) |
|
||||
| 4× KL720 | — | ~320 |
|
||||
|
||||
實際吞吐量受 USB 頻寬和 host CPU 限制,建議使用多個 USB controller 或 powered hub。
|
||||
|
||||
---
|
||||
|
||||
## 9. 開發環境與工具鏈
|
||||
@ -2658,4 +2839,4 @@ go.uber.org/zap // 結構化日誌
|
||||
|
||||
---
|
||||
|
||||
*文件版本:v1.3 | 日期:2026-02-28 | 狀態:更新中*
|
||||
*文件版本:v1.6 | 日期:2026-03-02 | 狀態:更新中*
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user