docs: align Design.md with Phase 1 architecture
Restructure section 2 from component overview to tech selection table covering Scheduler, Worker, Queue, Job State, Artifact Store, Web UI. Document why Redis Stream for the queue (language-neutral, consumer groups, single Redis instance, Crash Reset alignment) and detail worker horizontal scaling via consumer groups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
4d381c0b50
commit
7404ca9bc8
316
docs/Design.md
316
docs/Design.md
@ -37,78 +37,131 @@
|
||||
|
||||
---
|
||||
|
||||
## 2. 系統元件與責任
|
||||
## 2. 技術選型
|
||||
|
||||
### 2.1 Web UI
|
||||
| 元件 | 技術 | 說明 |
|
||||
|------|------|------|
|
||||
| Task Scheduler | **Node.js / Express** | I/O 密集、event loop 適合調度角色 |
|
||||
| Worker (ONNX/BIE/NEF) | **Python / FastAPI** | 需要 Kneron Toolchain Python 環境 |
|
||||
| Task Queue | **Redis Stream** | 語言中立,Node.js 與 Python 都是 Redis client;支援 Consumer Group 做水平擴展 |
|
||||
| Job State | **Redis Hash** | 與 Queue 共用同一個 Redis instance,重啟即清空 |
|
||||
| Artifact Store | **Shared Volume**(單機)/ MinIO(跨機,未來) | Worker 之間的檔案交換區 |
|
||||
| Web UI | **Vue 3 + Vite** | 前端 SPA |
|
||||
|
||||
### 2.1 為什麼 Queue 用 Redis Stream
|
||||
|
||||
- **語言中立**:Scheduler (Node.js) 用 `ioredis`,Worker (Python) 用 `redis-py`,雙方只需約定 queue name 與 message 格式
|
||||
- **Consumer Group**:同一個 Consumer Group 內,每筆訊息只會被一個 Worker 消費,天然支援水平擴展
|
||||
- **不額外引入元件**:Queue 與 Job State 共用同一個 Redis,部署簡單
|
||||
- **符合 Crash Reset 哲學**:Redis 不開 persistence,重啟即清空
|
||||
|
||||
### 2.2 Worker 水平擴展(Consumer Group)
|
||||
|
||||
```
|
||||
queue:onnx
|
||||
└─ Consumer Group: "onnx-workers"
|
||||
├─ onnx-worker-1 ← XREADGROUP 拿到 Job A
|
||||
├─ onnx-worker-2 ← XREADGROUP 拿到 Job B
|
||||
└─ onnx-worker-3 ← XREADGROUP 拿到 Job C
|
||||
|
||||
queue:bie
|
||||
└─ Consumer Group: "bie-workers"
|
||||
├─ bie-worker-1
|
||||
└─ bie-worker-2
|
||||
|
||||
queue:nef
|
||||
└─ Consumer Group: "nef-workers"
|
||||
└─ nef-worker-1
|
||||
```
|
||||
|
||||
- 同一個 Consumer Group 內,每筆訊息**只會被一個 Worker 拿到**,Redis 自動分配
|
||||
- 擴展只需多啟 Worker process(例如 `docker-compose up --scale bie-worker=3`),不需改程式碼
|
||||
- 每個 Worker 一次只處理一個 job,透過 `XACK` 確認完成
|
||||
|
||||
---
|
||||
|
||||
## 3. 系統元件與責任
|
||||
|
||||
### 3.1 Web UI
|
||||
- 提供使用者:
|
||||
- 上傳 model.onnx、quant.zip
|
||||
- 建立轉換任務
|
||||
- 查詢任務狀態
|
||||
- 下載 BIE / NEF
|
||||
- Web UI 不直接讀寫 Redis 或 Queue
|
||||
- Web UI 可直接與 MinIO 上傳 / 下載(或透過 Scheduler 取得路徑)
|
||||
- Web UI 透過 Scheduler API 上傳檔案 / 下載結果
|
||||
- 狀態更新:以 SSE 為主,Polling 為備援(3-5 秒)
|
||||
|
||||
### 2.2 Task Scheduler(控制面)
|
||||
### 3.2 Task Scheduler(控制面,Node.js)
|
||||
- 唯一的控制面元件
|
||||
- 職責:
|
||||
- 建立 job_id
|
||||
- 建立 Redis job record
|
||||
- 將 job 推入對應 task queue
|
||||
- 接收 worker 完成事件(done queue)
|
||||
- 提供 REST API(建立 job、查詢狀態)
|
||||
- 建立 job_id(UUID v4)
|
||||
- 建立 Redis job record(Hash)
|
||||
- 將 job 推入對應 task queue(`XADD queue:onnx`)
|
||||
- 監聽 worker 完成事件(`XREADGROUP queue:done`)
|
||||
- 推進 job 狀態(ONNX → BIE → NEF → COMPLETED)
|
||||
- 提供 SSE 事件串流
|
||||
- 流程固定:ONNX → BIE → NEF,不支援跳過或只跑部分
|
||||
- Scheduler 不等待 worker、不執行 heavy CPU 任務
|
||||
|
||||
### 2.3 Task Queue
|
||||
- 至少包含以下 queue:
|
||||
- queue:onnx
|
||||
- queue:bie
|
||||
- queue:nef
|
||||
- queue:done
|
||||
### 3.3 Task Queue(Redis Stream)
|
||||
- Queue 列表:
|
||||
- `queue:onnx` — ONNX 轉換任務
|
||||
- `queue:bie` — BIE 量化任務
|
||||
- `queue:nef` — NEF 編譯任務
|
||||
- `queue:done` — Worker 完成回報
|
||||
- 每個 queue 對應一個 Consumer Group
|
||||
- Queue 僅用於「暫存」任務(重啟可清空)
|
||||
|
||||
### 2.4 Worker Pools
|
||||
### 3.4 Worker Pools(Python)
|
||||
- ONNX / BIE / NEF Worker 各自獨立 pool
|
||||
- Worker 特性:
|
||||
- 無狀態
|
||||
- pull-based(自行從 queue 取任務)
|
||||
- pull-based(`XREADGROUP` 從 queue 取任務)
|
||||
- 一次只處理一個 job
|
||||
- 加入 Consumer Group,自動分配任務
|
||||
- Worker 失敗:
|
||||
- 回報 fail
|
||||
- 回報 fail 到 `queue:done`
|
||||
- Scheduler 將 job 標記為 FAILED
|
||||
|
||||
### 2.5 MinIO(Artifact Store)
|
||||
- 所有 job 的 input / output / log 都放在 MinIO
|
||||
- 作為跨 worker / 跨主機的唯一資料交換區
|
||||
### 3.5 Artifact Store(Shared Volume)
|
||||
- 所有 Worker 掛載同一個 Docker volume(`/data/jobs`)
|
||||
- 每個 job 的檔案存放在 `/data/jobs/{job_id}/` 下
|
||||
- Worker 之間透過檔案系統直接讀寫,無需上傳/下載
|
||||
- 單機部署:Docker named volume 或 bind mount
|
||||
- 未來跨機部署:可替換為 MinIO,只需改 Worker I/O 層
|
||||
|
||||
---
|
||||
|
||||
## 3. 命名與資料結構
|
||||
## 4. 命名與資料結構
|
||||
|
||||
### 3.1 Job ID
|
||||
### 4.1 Job ID
|
||||
- 使用 UUID v4
|
||||
- 全系統唯一
|
||||
|
||||
### 3.2 MinIO 路徑規則(建議)
|
||||
### 4.2 Job 工作目錄結構
|
||||
|
||||
Bucket:toolchain-jobs
|
||||
掛載路徑:`/data/jobs`
|
||||
|
||||
```
|
||||
jobs/{job_id}/
|
||||
/data/jobs/{job_id}/
|
||||
input/
|
||||
model.onnx
|
||||
quant.zip
|
||||
output/
|
||||
result.bie
|
||||
result.nef
|
||||
model.onnx # 使用者上傳的模型檔
|
||||
ref_images/ # 量化用參考圖片
|
||||
*.jpg / *.png
|
||||
out.onnx # ONNX Worker 產出
|
||||
out.bie # BIE Worker 產出
|
||||
out.nef # NEF Worker 產出
|
||||
logs/
|
||||
onnx.log
|
||||
bie.log
|
||||
nef.log
|
||||
```
|
||||
|
||||
### 3.3 Redis Job Record(Hash / JSON)
|
||||
所有 Worker 共用同一個 volume,直接讀寫同一個 job 目錄。
|
||||
|
||||
### 4.3 Redis Job Record(Hash / JSON)
|
||||
|
||||
Key:job:{job_id}
|
||||
|
||||
@ -129,26 +182,68 @@ error:
|
||||
|
||||
---
|
||||
|
||||
## 4. 成功流程(Happy Path)
|
||||
### 4.4 Redis Stream Queue 協定
|
||||
|
||||
Scheduler 與 Worker 之間透過 Redis Stream 溝通,需約定以下格式:
|
||||
|
||||
#### Task Message(Scheduler → Worker Queue)
|
||||
```json
|
||||
{
|
||||
"job_id": "uuid-v4",
|
||||
"created_at": "ISO-8601",
|
||||
"input_dir": "jobs/{job_id}/",
|
||||
"parameters": {
|
||||
"model_id": 1,
|
||||
"version": "0001",
|
||||
"platform": "720",
|
||||
"enable_evaluate": false,
|
||||
"enable_sim_fp": false
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Done Message(Worker → queue:done)
|
||||
```json
|
||||
{
|
||||
"job_id": "uuid-v4",
|
||||
"step": "onnx|bie|nef",
|
||||
"result": "ok|fail",
|
||||
"reason": "optional error message",
|
||||
"output_files": ["out.onnx"],
|
||||
"completed_at": "ISO-8601"
|
||||
}
|
||||
```
|
||||
|
||||
#### Consumer Group 命名規則
|
||||
| Queue | Consumer Group | Consumer 命名 |
|
||||
|-------|---------------|--------------|
|
||||
| `queue:onnx` | `onnx-workers` | `onnx-worker-{hostname}` |
|
||||
| `queue:bie` | `bie-workers` | `bie-worker-{hostname}` |
|
||||
| `queue:nef` | `nef-workers` | `nef-worker-{hostname}` |
|
||||
| `queue:done` | `scheduler` | `scheduler-{hostname}` |
|
||||
|
||||
---
|
||||
|
||||
## 5. 成功流程(Happy Path)
|
||||
|
||||
1. 使用者透過 Web UI 建立轉換任務
|
||||
2. Web UI 上傳 input 到 MinIO
|
||||
2. Web UI 上傳 input 到 Scheduler,Scheduler 存入 `/data/jobs/{job_id}/input/`
|
||||
3. Scheduler 建立 job record,push 到 queue:onnx
|
||||
4. ONNX worker pull 任務、執行、push done
|
||||
4. ONNX worker pull 任務、從 job 目錄讀取 input、執行、寫出 out.onnx、push done
|
||||
5. Scheduler 推進狀態,push 到 queue:bie
|
||||
6. BIE worker 執行、輸出 result.bie、push done
|
||||
6. BIE worker 從同一個 job 目錄讀取 out.onnx + ref_images、執行、寫出 out.bie、push done
|
||||
7. Scheduler 推進狀態,push 到 queue:nef
|
||||
8. NEF worker 執行、輸出 result.nef、push done
|
||||
8. NEF worker 從 job 目錄讀取 out.bie、執行、寫出 out.nef、push done
|
||||
9. Scheduler 標記 COMPLETED
|
||||
10. Web UI 取得下載路徑
|
||||
|
||||
---
|
||||
|
||||
## 4.1 工作目錄與 Worker I/O 規格(落地版)
|
||||
## 5.1 工作目錄與 Worker I/O 規格(落地版)
|
||||
|
||||
以下為實作落地時的檔案布局與 worker 互動規格(與 MinIO 路徑可一對一對應):
|
||||
|
||||
### 4.1.1 工作目錄
|
||||
### 5.1.1 工作目錄
|
||||
- API 取號後建立 `task_id`
|
||||
- API 將使用者上傳檔案放入工作目錄:
|
||||
- `{base_path}/{task_id}/`
|
||||
@ -161,7 +256,7 @@ error:
|
||||
<image files...>
|
||||
```
|
||||
|
||||
### 4.1.2 ONNX Worker
|
||||
### 5.1.2 ONNX Worker
|
||||
- 輸入:工作目錄下的唯一檔案(不假設檔名 / 副檔名)
|
||||
- 輸出:`out.onnx`
|
||||
- 輸出位置:同一工作目錄
|
||||
@ -169,21 +264,21 @@ error:
|
||||
- `enable_evaluate` (default: `false`):是否執行 IP evaluator(原 Web GUI 流程為 OFF)
|
||||
- `enable_sim_fp` (default: `false`):是否執行浮點 E2E 模擬(尚未接線)
|
||||
|
||||
### 4.1.3 BIE Worker
|
||||
### 5.1.3 BIE Worker
|
||||
- 輸入:`out.onnx` + `ref_images/*`
|
||||
- 輸出:`out.bie`
|
||||
- 輸出位置:同一工作目錄
|
||||
- 可選旗標:
|
||||
- `enable_sim_fixed` (default: `false`):是否執行定點 E2E 模擬(尚未接線)
|
||||
|
||||
### 4.1.4 NEF Worker
|
||||
### 5.1.4 NEF Worker
|
||||
- 輸入:`out.bie`
|
||||
- 輸出:`out.nef`
|
||||
- 輸出位置:同一工作目錄
|
||||
- 可選旗標:
|
||||
- `enable_sim_hw` (default: `false`):是否執行硬體 E2E 模擬(尚未接線)
|
||||
|
||||
### 4.1.6 流程預設開關對照(原 Web GUI vs 現在 Workers)
|
||||
### 5.1.5 流程預設開關對照(原 Web GUI vs 現在 Workers)
|
||||
| 步驟 | 原 Web GUI 預設 | 現在 Workers 預設 | 開關 |
|
||||
|---|---|---|---|
|
||||
| ONNX 轉換/最佳化 | ON | ON | 無 |
|
||||
@ -194,13 +289,13 @@ error:
|
||||
| NEF Compile | ON | ON | 無 |
|
||||
| HW E2E 模擬 | OFF | OFF | `enable_sim_hw` |
|
||||
|
||||
### 4.1.5 Core / Toolchain 路徑一致性
|
||||
### 5.1.6 Core / Toolchain 路徑一致性
|
||||
- Worker 需將工作目錄 path 傳給 core
|
||||
- Core 需設定 toolchain 相關 path(輸出與中間檔)都指向該工作目錄
|
||||
|
||||
---
|
||||
|
||||
## 5. 失敗行為(簡化)
|
||||
## 6. 失敗行為(簡化)
|
||||
|
||||
- Worker 例外 → 回報 fail
|
||||
- Scheduler 標記 job 為 FAILED
|
||||
@ -209,7 +304,7 @@ error:
|
||||
|
||||
---
|
||||
|
||||
## 6. API(Scheduler)
|
||||
## 7. API(Scheduler)
|
||||
|
||||
### POST /jobs
|
||||
建立新 job
|
||||
@ -258,13 +353,14 @@ SSE 事件串流(主動推送狀態更新)
|
||||
|
||||
---
|
||||
|
||||
## 7. Worker 行為規範
|
||||
## 8. Worker 行為規範
|
||||
|
||||
- 從對應 queue pull job_id
|
||||
- 從 MinIO 下載 input
|
||||
- 從對應 queue pull job_id(`XREADGROUP`)
|
||||
- 從 shared volume 讀取 input(`/data/jobs/{job_id}/`)
|
||||
- 執行 Toolchain flow
|
||||
- 上傳 output
|
||||
- push done event:
|
||||
- 將 output 寫入同一個 job 目錄
|
||||
- `XACK` 確認訊息已處理
|
||||
- push done event 到 `queue:done`:
|
||||
```
|
||||
{
|
||||
"job_id": "...",
|
||||
@ -276,7 +372,7 @@ SSE 事件串流(主動推送狀態更新)
|
||||
|
||||
---
|
||||
|
||||
## 8. 併發與限流
|
||||
## 9. 併發與限流
|
||||
|
||||
- BIE Worker 數量 = 最大同時 BIE 任務數
|
||||
- 每個 BIE Worker 一次只跑一個 job
|
||||
@ -284,7 +380,7 @@ SSE 事件串流(主動推送狀態更新)
|
||||
|
||||
---
|
||||
|
||||
## 9. 最小部署建議
|
||||
## 10. 最小部署建議
|
||||
|
||||
- web-ui x1
|
||||
- scheduler x1
|
||||
@ -296,9 +392,121 @@ SSE 事件串流(主動推送狀態更新)
|
||||
|
||||
---
|
||||
|
||||
## 10. MVP 驗證項目
|
||||
## 11. 本地開發與測試
|
||||
|
||||
- 單一 job 成功完成
|
||||
### 11.1 docker-compose 一鍵啟動
|
||||
|
||||
提供 `docker-compose.yml`,一個指令啟動整套系統:
|
||||
|
||||
```bash
|
||||
docker-compose up
|
||||
```
|
||||
|
||||
服務清單:
|
||||
|
||||
```yaml
|
||||
volumes:
|
||||
job-data: # 所有 Worker + Scheduler 共用的 job 工作目錄
|
||||
|
||||
services:
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
ports: ["6379:6379"]
|
||||
# 不開 persistence,符合 Crash Reset 哲學
|
||||
|
||||
scheduler:
|
||||
build: ./apps/task-scheduler
|
||||
ports: ["4000:4000"]
|
||||
depends_on: [redis]
|
||||
volumes:
|
||||
- job-data:/data/jobs
|
||||
environment:
|
||||
- REDIS_URL=redis://redis:6379
|
||||
- JOB_DATA_DIR=/data/jobs
|
||||
|
||||
onnx-worker:
|
||||
build: ./services/workers/onnx
|
||||
depends_on: [redis]
|
||||
volumes:
|
||||
- job-data:/data/jobs
|
||||
environment:
|
||||
- REDIS_URL=redis://redis:6379
|
||||
- JOB_DATA_DIR=/data/jobs
|
||||
- WORKER_MODE=real # real | stub
|
||||
|
||||
bie-worker:
|
||||
build: ./services/workers/bie
|
||||
depends_on: [redis]
|
||||
volumes:
|
||||
- job-data:/data/jobs
|
||||
environment:
|
||||
- REDIS_URL=redis://redis:6379
|
||||
- JOB_DATA_DIR=/data/jobs
|
||||
- WORKER_MODE=real
|
||||
|
||||
nef-worker:
|
||||
build: ./services/workers/nef
|
||||
depends_on: [redis]
|
||||
volumes:
|
||||
- job-data:/data/jobs
|
||||
environment:
|
||||
- REDIS_URL=redis://redis:6379
|
||||
- JOB_DATA_DIR=/data/jobs
|
||||
- WORKER_MODE=real
|
||||
|
||||
web-ui:
|
||||
build: ./apps/web
|
||||
ports: ["3000:3000"]
|
||||
depends_on: [scheduler]
|
||||
```
|
||||
|
||||
水平擴展 Worker:
|
||||
```bash
|
||||
docker-compose up --scale bie-worker=3
|
||||
```
|
||||
|
||||
### 11.2 Stub Worker 模式
|
||||
|
||||
開發 Scheduler 或 Web UI 時,不需要啟動真正的 Kneron Toolchain 環境。
|
||||
透過環境變數 `WORKER_MODE=stub` 切換為 Stub 模式:
|
||||
|
||||
```bash
|
||||
# 啟動整套系統,Worker 使用 stub 模式(不需要 toolchain 環境)
|
||||
WORKER_MODE=stub docker-compose up
|
||||
```
|
||||
|
||||
Stub Worker 行為:
|
||||
- 收到任務後 sleep 數秒(模擬處理時間)
|
||||
- 產生假的輸出檔案(空檔或固定內容)
|
||||
- 回報成功到 `queue:done`
|
||||
- 不依賴任何 Kneron Toolchain 二進位檔
|
||||
|
||||
```python
|
||||
# Stub 實作範例
|
||||
async def process_onnx_core_stub(input_paths, output_path, parameters):
|
||||
await asyncio.sleep(3) # 模擬處理時間
|
||||
stub_file = output_path / "out.onnx"
|
||||
stub_file.write_bytes(b"STUB_ONNX_OUTPUT")
|
||||
return {"file_path": str(stub_file), "file_size": stub_file.stat().st_size}
|
||||
```
|
||||
|
||||
### 11.3 開發情境對照
|
||||
|
||||
| 開發情境 | 需要什麼 | 指令 |
|
||||
|---------|---------|------|
|
||||
| 改 Scheduler 邏輯 | Redis + stub workers | `WORKER_MODE=stub docker-compose up redis scheduler onnx-worker bie-worker nef-worker` |
|
||||
| 改 Web UI | Redis + Scheduler + stub workers | `WORKER_MODE=stub docker-compose up` |
|
||||
| 改 Worker 核心邏輯 | DevContainer + Redis(跑單元測試) | `docker run -d redis && pytest tests/` |
|
||||
| 完整 E2E 測試 | docker-compose(真實 Worker) | `docker-compose up` |
|
||||
| 測試水平擴展 | docker-compose + scale | `docker-compose up --scale bie-worker=3` |
|
||||
|
||||
---
|
||||
|
||||
## 12. MVP 驗證項目
|
||||
|
||||
- 單一 job 成功完成(ONNX → BIE → NEF → COMPLETED)
|
||||
- 多 job 排隊,BIE 不超量
|
||||
- Worker 失敗 → job FAILED
|
||||
- Redis 重啟 → job 消失,UI 提示重送
|
||||
- Stub Worker 模式可正常走完完整流程
|
||||
- Worker 水平擴展(scale=3)可正常分配任務
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user