A 階段第二個 milestone、銜接 M9-1 bridge.py、暴露 service layer 給 M9-3 API/WebSocket。 New module `server/internal/firmware/`: - types.go: 123 行(FirmwareVersion / FirmwareProgress / ActiveTaskInfo / UpgradeDriver interface / 8 reason const) - progress.go: 147 行(仿 flash pattern 的 Tracker、Task.cancel 預留 SIGTERM force-cancel godoc) - service.go: 373 行(核心 service:UpgradeFirmware / HasActiveTask / GetActiveTaskInfo / RequestShutdown / WaitForActiveTasks / ListBundledVersions / GetCurrentVersion) - service_test.go: 676 行、13 個 test 含 MultiDeviceParallel Driver layer: - kl720_driver.go: 697 → 1054 行(+357、新 UpgradeFirmware method + tryRouteFirmwareEvent + sendCommandForUpgrade snapshot pattern) - kl720_driver_test.go: 360 行、11 個 test(含 InfoNotBlockedDuringUpgrade / CtxCancelReleasesBridge / StderrEventAfterCtxCancel 100 round stress) 關鍵設計: - flash 與 firmware 模組分離(不 import flash) - UpgradeDriver interface 隔離 driver 細節、DeviceLookup interface 隔離 device manager - 中介 channel pattern(service ↔ driver)方便 service 補欄位(DeviceID / Direction / BeforeVersion) - timeout 雙保險:chip timeout + 30s margin - 8 reason enum 對齊 bridge.py、stage 採 Design 命名 Concurrency race 修復(M9-2 Reviewer round 1 → round 2): - Major 1(mutex deadlock):新 fwUpgradeMu 獨立鎖 + sendCommandForUpgrade snapshot stdin/stdout pattern、避開 d.mu field-level race + 升級期間 Info/Disconnect 不被卡 + timeout 路徑無死鎖 - Major 2(close-channel race):tryRouteFirmwareEvent 持 fwMu 整段、配合 defer setFirmwareProgressCh(nil) 提供 happen-before、絕無 send on closed channel panic Reviewer 兩輪審查: - Round 1: 0 Critical / 2 Major / 5 Minor / 5 Suggestion - Round 2: 0 Critical / 0 Major / 2 Minor / 2 Suggestion(11/12 issue 修到位、Suggestion 4 留 follow-up) M9-1 follow-up 順手清: - m5(test 死碼 _firmware_upgrade_start_ts 殘留兩行)已清 - s5(test 註解 idempotent shape 說明)已加 測試: - go test ./... -race -count=1: 全綠(28s、無 regression) - Python: 36 tests + 22 subtests 全綠(0.31s) - go vet / build: 0 output 下一步:M9-3 API handler + WebSocket progress(CI 建議 `go test -race -count=3` 提升 race 偵測強度) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
153 lines
8.8 KiB
Markdown
153 lines
8.8 KiB
Markdown
# Reviewer Report — M9-2 Go driver + firmware service
|
||
|
||
> 審查日期:2026-05-25
|
||
> 範圍:M9-2 Go driver UpgradeFirmware + firmware service + types/progress + M9-1 follow-up cleanup
|
||
> 統計:0 Critical / 2 Major / 5 Minor / 5 Suggestion
|
||
|
||
## TL;DR
|
||
|
||
M9-2 整體實作完整、規格對齊度高、TDD §8.6 graceful shutdown 介面齊備、bridge 端 4 個情境 stage 序列在 Go driver 端正確 forward;25 個 Go test + 36 個 Python test 覆蓋核心路徑。但**有 2 個 Major 與 driver mutex 範圍 + close-channel race 相關**,建議 backend 第 2 輪修;其餘 Minor / Suggestion 不阻擋 M9-3 啟動。**不升級 security agent**。
|
||
|
||
## 審查範圍
|
||
|
||
| # | 檔案 | 行數 | 性質 |
|
||
|---|------|------|------|
|
||
| 1 | `server/internal/firmware/types.go` | 123(新) | 列舉 / Schema |
|
||
| 2 | `server/internal/firmware/progress.go` | 141(新) | ProgressTracker |
|
||
| 3 | `server/internal/firmware/service.go` | 346(新) | 核心 service |
|
||
| 4 | `server/internal/firmware/service_test.go` | 517(新、11 tests) | 單元測試 |
|
||
| 5 | `server/internal/driver/kneron/kl720_driver.go` | 697→948 | 修改、+251 行 |
|
||
| 6 | `server/internal/driver/kneron/kl720_driver_test.go` | 177(新、8 tests) | 單元測試 |
|
||
| 7 | `server/scripts/test_kneron_bridge_firmware.py` | 清 m5/s5、+9/-2 | M9-1 follow-up |
|
||
|
||
---
|
||
|
||
## 🔴 Critical(必修、阻擋 merge)
|
||
|
||
**無**。
|
||
|
||
---
|
||
|
||
## 🟠 Major(強烈建議修、建議第 2 輪修)
|
||
|
||
### Major 1 — driver mutex hold 整段升級 + timeout deadlock
|
||
**檔案**:`kl720_driver.go:860-866 + 875`
|
||
|
||
**問題**:sendCommand goroutine 在 `d.mu.Lock() → sendCommand → d.mu.Unlock()` 整段持鎖、sendCommand 內部會 blocking 等 bridge stdout 回應 60-200s。期間:
|
||
- 其他 method(Info / IsConnected / Disconnect)被卡 60-200s
|
||
- M9-3 GetActiveTaskInfo / DeviceInfo 會打 `driver.Info()` → 被卡
|
||
- **timeout 路徑死鎖**:service ctx.Done → driver line 875 `d.mu.Lock()` 等 → 但 sendCommand goroutine 還持有 d.mu → sendCommand goroutine 永遠不回(bridge 卡住、沒人殺它)→ 死鎖
|
||
|
||
**建議**:sendCommand goroutine 改為只在 stdin write / stdout read 細粒度 lock、不整段持鎖;或為 firmware 升級開獨立 sub-mutex 避免阻塞 Info。
|
||
|
||
### Major 2 — fwProgressCh close-channel race
|
||
**檔案**:`kl720_driver.go:758-797 + service.go:197-203`
|
||
|
||
**問題**:tryRouteFirmwareEvent 取 ch 後 release fwMu 才 send、與 service goroutine close(intermediate) 之間有 race window。
|
||
- 視窗很窄(一個 LOCK/UNLOCK 之間)
|
||
- stderr 在 device disconnect / process die 的 inflight event 處理時最容易撞到
|
||
- **panic: send on closed channel** 可能發生
|
||
|
||
**建議修法(任選)**:
|
||
- A:tryRouteFirmwareEvent 用 `defer recover()` 包 send
|
||
- B:driver 自己擁有 stderr push ch、UpgradeFirmware return 前先 close own ch + drain 後再 close service intermediate
|
||
- C:用 atomic.Value 存 ch + close 前先 setFirmwareProgressCh(nil)(已做、但與 close 沒嚴格 happen-before)
|
||
|
||
production 跑 1 萬次升級會撞上、建議第 2 輪修。
|
||
|
||
---
|
||
|
||
## 🟡 Minor(建議修、不阻擋)
|
||
|
||
| # | 檔案:行 | 問題 | 建議 |
|
||
|---|--------|------|------|
|
||
| Minor 1 | service.go:36-39 | brickRiskReasons 集合少 `ReasonTimeout`(TDD §3.4 第 6 列「>180s KL720」標 BRICK_RISK)| 加註解或加入集合 |
|
||
| Minor 2 | service_test.go / kl720_driver_test.go | 缺 `ctx.Done` 後 sendCommand goroutine 釋放測試、缺多 device 平行 upgrade 測試 | 補 `TestUpgradeFirmware_CtxCancelReleasesBridge` + `TestUpgradeFirmware_MultiDeviceParallel` |
|
||
| Minor 3 | kl720_driver.go:830-840 | 早退路徑(pythonReady=false / unsupported chip)defer 仍把 needsReset 設 true、語意不準 | 加 `if !attemptedUpgrade { return }` guard |
|
||
| Minor 4 | service.go:189-196 | 「結論先寫、推論在前」順序、初讀者要多看一次 | 把結論移到段首 |
|
||
| Minor 5 | kl720_driver.go:758-797 | fwMu 取 ch 後 release 才 send 的 atomic 缺口(同 Major 2 同源、但跨 session lifecycle 風險)| 與 Major 2 一起處理 |
|
||
|
||
---
|
||
|
||
## 💡 Suggestion(純改善建議)
|
||
|
||
1. **kl720_driver.go:925-937**:sendCommand 成功後補 done event 容錯、會與 stderr 推的 done event 重複。M9-3 wire 到 WS 後前端可能跑兩次 cleanup。建議 service.runUpgrade forward loop 對 `StageDone` 去重、或前端 idempotent guard。
|
||
2. 補 `TestUpgradeFirmware_MultiDeviceParallel` 3 device 同時 upgrade、驗 ProgressTracker key 不誤匹配。
|
||
3. `service.go:303-324` ListBundledVersions 加 `os.Stat(chipDir)` lightly check、區分 chip 不支援 vs firmware 漏 build。
|
||
4. `bridgeFirmwareEvent` struct 與 `firmware.FirmwareProgress` 共用 struct + 雙 JSON tag、避免欄位增刪維護兩處。
|
||
5. `progress.go:25` `Task.cancel func()` field 目前 service 寫入但沒 reader、加 TODO 或 godoc 註解說明預留 SIGTERM force-cancel。
|
||
|
||
---
|
||
|
||
## 4 個情境 stage 序列在 Go driver 端的對應
|
||
|
||
| 情境 | bridge 行為(M9-1)| Go driver 處理 | 結果 |
|
||
|------|---------|--------------|------|
|
||
| KL520 KDP1 + loader | 5 stage:preparing → loading → flashing → verifying → done | stderr scanner 偵 5 line JSON → tryRouteFirmwareEvent → forward intermediate → runUpgrade 補欄位 → task.ProgressCh | ✅ TestUpgradeFirmware_SuccessFiveStages 驗證 |
|
||
| KL520 KDP2 short-circuit | 4 stage:preparing → flashing → verifying → done(無 loading) | 同上、4 event forward;driver 端對 stage 順序無 hardcode 校驗 | ✅ 不誤判 |
|
||
| KL720 legacy(無 loader) | 4 stage:preparing → flashing → verifying → done | 同上、4 event forward | ✅ TestKL720KDPLegacy 等效驗證 |
|
||
| 已 KDP2 / no-op | bridge 端仍走 verify rescan 確認 | done event 推到 progressCh + sendCommand 回 success 後 line 928 補 done event 容錯(雙保險、可能 done 出現 2 次、Suggestion 1)| ✅ 雙保險(但需去重) |
|
||
|
||
---
|
||
|
||
## TDD 對齊度評估
|
||
|
||
| TDD 章節 | 對應實作 | 對齊狀態 |
|
||
|---------|---------|---------|
|
||
| §4.2 FirmwareProgress 9 欄位 | types.go:34-53 | ✅ 完全對齊 |
|
||
| §4.3 Stage 列舉 6 個 | types.go:56-63 | ✅ |
|
||
| §5.1 A 階段流程 | service.go + driver.UpgradeFirmware | ✅ |
|
||
| §5.2 B2 降版流程 | 留 B2、本期 only Upgrade | ✅ Direction enum 兩值都保留 |
|
||
| §6.1 8 reason enum | types.go:73-82 | ✅ |
|
||
| §6.1 handler 回傳格式 | bridgeFirmwareEvent struct kl720_driver.go:731-744 | ✅ JSON tag snake_case 對齊 bridge.py |
|
||
| §8.6.1 HasActiveTask / RequestShutdown / WaitForActiveTasks | service.go:258-293 | ✅ |
|
||
| §8.6.2 GetActiveTaskInfo(含 EtaSeconds) | service.go:266-268 + progress.go:108-141 | ✅ |
|
||
| §8.6.3 SIGTERM handler 整合 main.go | 延後 M9-3 | ✅ service 端介面齊備 |
|
||
|
||
---
|
||
|
||
## Concurrency 評估(重點關注)
|
||
|
||
整體 goroutine / mutex / channel 拓樸:
|
||
|
||
```
|
||
service.UpgradeFirmware(caller 同步)
|
||
└─ taskWg.Add(1)
|
||
└─ go runUpgrade
|
||
├─ go drv.UpgradeFirmware(ctx, chip, intermediate)
|
||
│ └─ go func() sendCommand → resCh
|
||
│ └─ select ctx.Done / resCh → 推 event 到 intermediate
|
||
│ └─ return → defer setFirmwareProgressCh(nil)
|
||
├─ for ev := range intermediate { forward + 補欄位 → task.ProgressCh }
|
||
├─ <-driverDone(err)
|
||
├─ if err && lastStage != Done/Error → 補 error event
|
||
└─ defer close(task.ProgressCh) → markDone → taskWg.Done
|
||
|
||
外部 stderr goroutine(在 driver.startPython 內、process-level)
|
||
└─ scanner.Scan() loop
|
||
└─ if firmware_progress prefix → tryRouteFirmwareEvent → 取 fwProgressCh → write
|
||
```
|
||
|
||
**Race / deadlock 詳細分析見上方 Major 1 / Major 2。**
|
||
|
||
---
|
||
|
||
## M9-1 follow-up 清理驗證
|
||
|
||
| 項目 | 預期 | 實際 | 狀態 |
|
||
|------|------|------|------|
|
||
| m5 行 616 死碼刪除 | 砍 `bridge._firmware_upgrade_start_ts` | line 615-617 改為只設 `_firmware_upgrade_in_progress = False` + Reviewer 註解 | ✅ |
|
||
| m5 行 632 死碼刪除 | 同上 | line 631-636 改為 `start_ts = time.monotonic()` 本地變數 + 註解明示 closure capture | ✅ |
|
||
| s5 註解 | 第二次 register 的「刻意設計」說明 | line 684-688 註解明確說明 idempotent shape + retest install→unregister | ✅ |
|
||
|
||
**正面評價**:M9-1 reviewer 留下的問題清單在 M9-2 確實被執行。
|
||
|
||
---
|
||
|
||
## 結論
|
||
|
||
- ✅ **通過 with 2 Major + 5 Minor + 5 Suggestion**
|
||
- ✅ **不阻擋 M9-3 啟動**——Major 1 / 2 是 driver 內部 concurrency 不影響 M9-3(HTTP handler + WebSocket)介面對齊;M9-3 可平行起手、Major 1 / 2 第 2 輪修
|
||
- ✅ **不需要升級給 security agent**——security 軸無 OWASP / authentication / authorization / injection 等深審需求
|
||
- ⚠️ **建議 backend 第 2 輪修 Major 1 + Major 2**:mutex 範圍 + close-channel race 是 production 跑萬次後會踩雷的問題、修起來成本不高、值得在 M9-2 收尾前修完
|