visionA/local-tool/.autoflow/05-implementation/review/m9-2-go-driver-firmware-service-review.md
jim800121chen c03eb6fd0e feat(local-tool): M9-2 — Go driver UpgradeFirmware + firmware service module
A 階段第二個 milestone、銜接 M9-1 bridge.py、暴露 service layer 給 M9-3 API/WebSocket。

New module `server/internal/firmware/`:
- types.go: 123 行(FirmwareVersion / FirmwareProgress / ActiveTaskInfo / UpgradeDriver interface / 8 reason const)
- progress.go: 147 行(仿 flash pattern 的 Tracker、Task.cancel 預留 SIGTERM force-cancel godoc)
- service.go: 373 行(核心 service:UpgradeFirmware / HasActiveTask / GetActiveTaskInfo / RequestShutdown / WaitForActiveTasks / ListBundledVersions / GetCurrentVersion)
- service_test.go: 676 行、13 個 test 含 MultiDeviceParallel

Driver layer:
- kl720_driver.go: 697 → 1054 行(+357、新 UpgradeFirmware method + tryRouteFirmwareEvent + sendCommandForUpgrade snapshot pattern)
- kl720_driver_test.go: 360 行、11 個 test(含 InfoNotBlockedDuringUpgrade / CtxCancelReleasesBridge / StderrEventAfterCtxCancel 100 round stress)

關鍵設計:
- flash 與 firmware 模組分離(不 import flash)
- UpgradeDriver interface 隔離 driver 細節、DeviceLookup interface 隔離 device manager
- 中介 channel pattern(service ↔ driver)方便 service 補欄位(DeviceID / Direction / BeforeVersion)
- timeout 雙保險:chip timeout + 30s margin
- 8 reason enum 對齊 bridge.py、stage 採 Design 命名

Concurrency race 修復(M9-2 Reviewer round 1 → round 2):
- Major 1(mutex deadlock):新 fwUpgradeMu 獨立鎖 + sendCommandForUpgrade snapshot stdin/stdout pattern、避開 d.mu field-level race + 升級期間 Info/Disconnect 不被卡 + timeout 路徑無死鎖
- Major 2(close-channel race):tryRouteFirmwareEvent 持 fwMu 整段、配合 defer setFirmwareProgressCh(nil) 提供 happen-before、絕無 send on closed channel panic

Reviewer 兩輪審查:
- Round 1: 0 Critical / 2 Major / 5 Minor / 5 Suggestion
- Round 2: 0 Critical / 0 Major / 2 Minor / 2 Suggestion(11/12 issue 修到位、Suggestion 4 留 follow-up)

M9-1 follow-up 順手清:
- m5(test 死碼 _firmware_upgrade_start_ts 殘留兩行)已清
- s5(test 註解 idempotent shape 說明)已加

測試:
- go test ./... -race -count=1: 全綠(28s、無 regression)
- Python: 36 tests + 22 subtests 全綠(0.31s)
- go vet / build: 0 output

下一步:M9-3 API handler + WebSocket progress(CI 建議 `go test -race -count=3` 提升 race 偵測強度)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 11:27:36 +08:00

153 lines
8.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Reviewer Report — M9-2 Go driver + firmware service
> 審查日期2026-05-25
> 範圍M9-2 Go driver UpgradeFirmware + firmware service + types/progress + M9-1 follow-up cleanup
> 統計0 Critical / 2 Major / 5 Minor / 5 Suggestion
## TL;DR
M9-2 整體實作完整、規格對齊度高、TDD §8.6 graceful shutdown 介面齊備、bridge 端 4 個情境 stage 序列在 Go driver 端正確 forward25 個 Go test + 36 個 Python test 覆蓋核心路徑。但**有 2 個 Major 與 driver mutex 範圍 + close-channel race 相關**,建議 backend 第 2 輪修;其餘 Minor / Suggestion 不阻擋 M9-3 啟動。**不升級 security agent**。
## 審查範圍
| # | 檔案 | 行數 | 性質 |
|---|------|------|------|
| 1 | `server/internal/firmware/types.go` | 123 | 列舉 / Schema |
| 2 | `server/internal/firmware/progress.go` | 141 | ProgressTracker |
| 3 | `server/internal/firmware/service.go` | 346 | 核心 service |
| 4 | `server/internal/firmware/service_test.go` | 517新、11 tests | 單元測試 |
| 5 | `server/internal/driver/kneron/kl720_driver.go` | 697→948 | 修改、+251 行 |
| 6 | `server/internal/driver/kneron/kl720_driver_test.go` | 177新、8 tests | 單元測試 |
| 7 | `server/scripts/test_kneron_bridge_firmware.py` | 清 m5/s5、+9/-2 | M9-1 follow-up |
---
## 🔴 Critical必修、阻擋 merge
**無**
---
## 🟠 Major強烈建議修、建議第 2 輪修)
### Major 1 — driver mutex hold 整段升級 + timeout deadlock
**檔案**`kl720_driver.go:860-866 + 875`
**問題**sendCommand goroutine 在 `d.mu.Lock() → sendCommand → d.mu.Unlock()` 整段持鎖、sendCommand 內部會 blocking 等 bridge stdout 回應 60-200s。期間
- 其他 methodInfo / IsConnected / Disconnect被卡 60-200s
- M9-3 GetActiveTaskInfo / DeviceInfo 會打 `driver.Info()` → 被卡
- **timeout 路徑死鎖**service ctx.Done → driver line 875 `d.mu.Lock()` 等 → 但 sendCommand goroutine 還持有 d.mu → sendCommand goroutine 永遠不回bridge 卡住、沒人殺它)→ 死鎖
**建議**sendCommand goroutine 改為只在 stdin write / stdout read 細粒度 lock、不整段持鎖或為 firmware 升級開獨立 sub-mutex 避免阻塞 Info。
### Major 2 — fwProgressCh close-channel race
**檔案**`kl720_driver.go:758-797 + service.go:197-203`
**問題**tryRouteFirmwareEvent 取 ch 後 release fwMu 才 send、與 service goroutine close(intermediate) 之間有 race window。
- 視窗很窄(一個 LOCK/UNLOCK 之間)
- stderr 在 device disconnect / process die 的 inflight event 處理時最容易撞到
- **panic: send on closed channel** 可能發生
**建議修法(任選)**
- AtryRouteFirmwareEvent 用 `defer recover()` 包 send
- Bdriver 自己擁有 stderr push ch、UpgradeFirmware return 前先 close own ch + drain 後再 close service intermediate
- C用 atomic.Value 存 ch + close 前先 setFirmwareProgressCh(nil)(已做、但與 close 沒嚴格 happen-before
production 跑 1 萬次升級會撞上、建議第 2 輪修。
---
## 🟡 Minor建議修、不阻擋
| # | 檔案:行 | 問題 | 建議 |
|---|--------|------|------|
| Minor 1 | service.go:36-39 | brickRiskReasons 集合少 `ReasonTimeout`TDD §3.4 第 6 列「>180s KL720」標 BRICK_RISK| 加註解或加入集合 |
| Minor 2 | service_test.go / kl720_driver_test.go | 缺 `ctx.Done` 後 sendCommand goroutine 釋放測試、缺多 device 平行 upgrade 測試 | 補 `TestUpgradeFirmware_CtxCancelReleasesBridge` + `TestUpgradeFirmware_MultiDeviceParallel` |
| Minor 3 | kl720_driver.go:830-840 | 早退路徑pythonReady=false / unsupported chipdefer 仍把 needsReset 設 true、語意不準 | 加 `if !attemptedUpgrade { return }` guard |
| Minor 4 | service.go:189-196 | 「結論先寫、推論在前」順序、初讀者要多看一次 | 把結論移到段首 |
| Minor 5 | kl720_driver.go:758-797 | fwMu 取 ch 後 release 才 send 的 atomic 缺口(同 Major 2 同源、但跨 session lifecycle 風險)| 與 Major 2 一起處理 |
---
## 💡 Suggestion純改善建議
1. **kl720_driver.go:925-937**sendCommand 成功後補 done event 容錯、會與 stderr 推的 done event 重複。M9-3 wire 到 WS 後前端可能跑兩次 cleanup。建議 service.runUpgrade forward loop 對 `StageDone` 去重、或前端 idempotent guard。
2.`TestUpgradeFirmware_MultiDeviceParallel` 3 device 同時 upgrade、驗 ProgressTracker key 不誤匹配。
3. `service.go:303-324` ListBundledVersions 加 `os.Stat(chipDir)` lightly check、區分 chip 不支援 vs firmware 漏 build。
4. `bridgeFirmwareEvent` struct 與 `firmware.FirmwareProgress` 共用 struct + 雙 JSON tag、避免欄位增刪維護兩處。
5. `progress.go:25` `Task.cancel func()` field 目前 service 寫入但沒 reader、加 TODO 或 godoc 註解說明預留 SIGTERM force-cancel。
---
## 4 個情境 stage 序列在 Go driver 端的對應
| 情境 | bridge 行為M9-1| Go driver 處理 | 結果 |
|------|---------|--------------|------|
| KL520 KDP1 + loader | 5 stagepreparing → loading → flashing → verifying → done | stderr scanner 偵 5 line JSON → tryRouteFirmwareEvent → forward intermediate → runUpgrade 補欄位 → task.ProgressCh | ✅ TestUpgradeFirmware_SuccessFiveStages 驗證 |
| KL520 KDP2 short-circuit | 4 stagepreparing → flashing → verifying → done無 loading | 同上、4 event forwarddriver 端對 stage 順序無 hardcode 校驗 | ✅ 不誤判 |
| KL720 legacy無 loader | 4 stagepreparing → flashing → verifying → done | 同上、4 event forward | ✅ TestKL720KDPLegacy 等效驗證 |
| 已 KDP2 / no-op | bridge 端仍走 verify rescan 確認 | done event 推到 progressCh + sendCommand 回 success 後 line 928 補 done event 容錯(雙保險、可能 done 出現 2 次、Suggestion 1| ✅ 雙保險(但需去重) |
---
## TDD 對齊度評估
| TDD 章節 | 對應實作 | 對齊狀態 |
|---------|---------|---------|
| §4.2 FirmwareProgress 9 欄位 | types.go:34-53 | ✅ 完全對齊 |
| §4.3 Stage 列舉 6 個 | types.go:56-63 | ✅ |
| §5.1 A 階段流程 | service.go + driver.UpgradeFirmware | ✅ |
| §5.2 B2 降版流程 | 留 B2、本期 only Upgrade | ✅ Direction enum 兩值都保留 |
| §6.1 8 reason enum | types.go:73-82 | ✅ |
| §6.1 handler 回傳格式 | bridgeFirmwareEvent struct kl720_driver.go:731-744 | ✅ JSON tag snake_case 對齊 bridge.py |
| §8.6.1 HasActiveTask / RequestShutdown / WaitForActiveTasks | service.go:258-293 | ✅ |
| §8.6.2 GetActiveTaskInfo含 EtaSeconds | service.go:266-268 + progress.go:108-141 | ✅ |
| §8.6.3 SIGTERM handler 整合 main.go | 延後 M9-3 | ✅ service 端介面齊備 |
---
## Concurrency 評估(重點關注)
整體 goroutine / mutex / channel 拓樸:
```
service.UpgradeFirmwarecaller 同步)
└─ taskWg.Add(1)
└─ go runUpgrade
├─ go drv.UpgradeFirmware(ctx, chip, intermediate)
│ └─ go func() sendCommand → resCh
│ └─ select ctx.Done / resCh → 推 event 到 intermediate
│ └─ return → defer setFirmwareProgressCh(nil)
├─ for ev := range intermediate { forward + 補欄位 → task.ProgressCh }
├─ <-driverDoneerr
├─ if err && lastStage != Done/Error → 補 error event
└─ defer close(task.ProgressCh) → markDone → taskWg.Done
外部 stderr goroutine在 driver.startPython 內、process-level
└─ scanner.Scan() loop
└─ if firmware_progress prefix → tryRouteFirmwareEvent → 取 fwProgressCh → write
```
**Race / deadlock 詳細分析見上方 Major 1 / Major 2。**
---
## M9-1 follow-up 清理驗證
| 項目 | 預期 | 實際 | 狀態 |
|------|------|------|------|
| m5 行 616 死碼刪除 | 砍 `bridge._firmware_upgrade_start_ts` | line 615-617 改為只設 `_firmware_upgrade_in_progress = False` + Reviewer 註解 | ✅ |
| m5 行 632 死碼刪除 | 同上 | line 631-636 改為 `start_ts = time.monotonic()` 本地變數 + 註解明示 closure capture | ✅ |
| s5 註解 | 第二次 register 的「刻意設計」說明 | line 684-688 註解明確說明 idempotent shape + retest install→unregister | ✅ |
**正面評價**M9-1 reviewer 留下的問題清單在 M9-2 確實被執行。
---
## 結論
-**通過 with 2 Major + 5 Minor + 5 Suggestion**
-**不阻擋 M9-3 啟動**——Major 1 / 2 是 driver 內部 concurrency 不影響 M9-3HTTP handler + WebSocket介面對齊M9-3 可平行起手、Major 1 / 2 第 2 輪修
-**不需要升級給 security agent**——security 軸無 OWASP / authentication / authorization / injection 等深審需求
- ⚠️ **建議 backend 第 2 輪修 Major 1 + Major 2**mutex 範圍 + close-channel race 是 production 跑萬次後會踩雷的問題、修起來成本不高、值得在 M9-2 收尾前修完