perf(local-tool): Windows KL520 cold-boot connect 106s → ~40s(跳過多餘 reset)

背景:
Windows 實測 KL520 首次 connect 耗時 106 秒,原因是 reset 流程內部重複
firmware load:
  1. 進來 Loader → load firmware (35s) → Comp/U
  2. reset 退回 Loader → bridge 重啟
  3. reconnect 進來又是 Loader → load firmware (30s) → Comp/U
  4. Loader reconnect 第一次常 fail(15s timeout)
總共 ~65s 花在「砍掉剛載好的 firmware、再載一次」的白工上。

根因:先前修的 needsReset 邏輯不管 firmware 新舊一律 reset。但 Error 15
只發生在「Comp/U 是上次 session 殘留」的情境;「本次 connect 內部剛載的
Comp/U」session 是乾淨的,不需要 reset。

修法(條件性 reset):
- server/scripts/kneron_bridge.py:connect handler 新增追蹤本次有無走
  firmware load flow,return 多帶 `fresh_firmware_loaded` bool
- server/internal/driver/kneron/kl720_driver.go:Connect 讀 flag,若為
  true 就 skipReset(firmware 剛載的,session 已乾淨)

驗證(2026-04-21):
- `/tmp/test_bridge.py` 拔插 USB 後跑 `connect (fw=Loader) →
  fresh_firmware_loaded=True → skip reset → load_model → inference`
  → 11 detections(person×8, tie×3, latency 332ms)
- Mac UI Comp/U 殘留路徑:reset → 11 bbox ✓
- Mac UI Loader cold-boot 路徑(拔插後):skip reset → 11 bbox ✓

預期效益:
- Windows cold-boot(常見):106s → ~40s(省 65s)
- Mac 跨 session(常見):~15-20s 不變
- 極少數(Windows device 未斷電但跨 server session):走完整 reset

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
jim800121chen 2026-04-21 11:09:25 +08:00
parent 30d0ff5695
commit b71ff4cd3c
3 changed files with 61 additions and 22 deletions

View File

@ -64,10 +64,36 @@ Mac 版 app 上傳單張圖推論,畫面上完全沒有 bbox 標註。
三種尺寸516×640 直式 / 1920×1080 横式 / 512×512 正方)全通過。 三種尺寸516×640 直式 / 1920×1080 横式 / 512×512 正方)全通過。
### 待使用者驗證 ### 已驗證2026-04-21
- [ ] Mac UI 端實測:上傳 `~/Downloads/000000000459.jpg` 應見 11 個 bbox 精準框住 person + tie - [x] Mac UI Comp/U 殘留路徑reset 後推論 11 個 bbox 正確
- [ ] Windows 實測首次 connect 耗時 + 是否還踩 HTTP timeout現已放寬到 120s - [x] Mac UI Loader cold-boot 路徑(拔插 USBskip reset 後推論 11 個 bbox 正確
- [x] Windows 實測首次 connect106s 成功(< 120s timeout推論正確
### 後續優化Windows connect 106s → 預期 ~40s方案 C
Windows 實測發現即使 timeout 120s 夠用,使用者要等 106s 體感太久。拆解
瓶頸發現走了兩次 firmware load第一次 connect 進來 Loader → load fw
→ Comp/U ~35s / reset → 回 Loader / reconnect → load fw → Comp/U ~30s
reset 流程中第二次 firmware load 是白做工。
**條件性 reset方案 C**
- `kneron_bridge.py connect` 回報 `fresh_firmware_loaded` flag
- `True`:本次 connect 內部剛做過 firmware load原本是 Loader
- `False`:進來就是 Comp/U上次 session 殘留,需要 reset 清乾淨)
- `kl720_driver.go` 判 flag 決定要不要做 restartBridge reset
**驗證兩條路徑都 OK2026-04-21**
- Loader cold-boot → skip reset → 推論 11 bbox ✓
- Comp/U 殘留 → 做 reset → 推論 11 bbox ✓
**預期效益**
- Windows cold-boot最常見106s → **~40s**(省 65s
- Mac 跨 session最常見~15-20s 不變
- 極少數情境Windows 但 device 未斷電):維持走完整 reset 流程
### 待驗證
- [ ] Linux 實測 - [ ] Linux 實測
- [ ] Windows 實測方案 C 效益(預期 cold-boot 降到 ~40s
### 前端 debug log 去留 ### 前端 debug log 去留
`camera-overlay.tsx``console.log('[bbox-debug] ...')` 驗證完成後**可清可留**。保留成本低,對未來 debug 有幫助。 `camera-overlay.tsx``console.log('[bbox-debug] ...')` 驗證完成後**可清可留**。保留成本低,對未來 debug 有幫助。

View File

@ -286,32 +286,37 @@ func (d *KneronDriver) Connect() error {
if fw, ok := resp["firmware"].(string); ok { if fw, ok := resp["firmware"].(string); ok {
d.info.FirmwareVer = fw d.info.FirmwareVer = fw
} }
// Bridge reports whether firmware was freshly loaded during this connect.
// Freshly loaded firmware = clean state → no reset needed.
// Firmware already present (残留 from previous session) → must reset to
// avoid Error 15 SEND_DATA_TOO_LARGE on first inference.
freshFirmware, _ := resp["fresh_firmware_loaded"].(bool)
d.mu.Unlock() d.mu.Unlock()
// First connect after server start: reset device to clear stale models. // First connect after server start: reset device to clear stale session.
// //
// BOTH KL520 and KL720 需要 reset // Why reset is needed:
// - KL720: flash-basedfirmware 和 model 保留在 flashreset 清 stale
// model 才有意義。
// - KL520: USB Boot / RAM-based。若 session 間 firmware 殘留(不是剛載
// 的 Comp/U直接 load_model + inference 100% 炸 Error 15。必須
// reset → Loader → reload firmware → Comp/U 得到乾淨 session。
// //
// - KL720 是 flash-based 裝置firmware 和 model 會保留在 flashreset // Why we skip reset when freshFirmware=true:
// 清 stale model 才有意義。 // - 這次 connect 內部剛做過完整 firmware load → Comp/U 是新鮮乾淨的。
// // 再做 reset 會再砍掉 reload 一次,浪費 30-60s 沒意義。
// - KL520 雖然是 USB Boot 裝置RAM-based firmware斷電即清理論上 // - Windows cold boot 情境最常見device 斷電後第一次 connect
// 每次 connect 是 clean state。但實測發現若 session 間 firmware 殘留 // 省下 restartBridge 的 ~65s 代價。
// fw=KDP2 Comp/U 而非 Loader直接走 load_model + inference 會 skipReset := freshFirmware
// 100% 炸 ApiKPException Error 15 (SEND_DATA_TOO_LARGE)。只有走 if needsReset && !skipReset {
// reset → reboot 到 Loader → 重新載 firmware 到 Comp/U 的完整流程, d.driverLog("INFO", "[kneron] first connect — resetting %s to clear stale session (firmware was already present)...", d.chipType)
// 才能得到能正常 inference 的 session。
//
// 成本KL520 reset + firmware load + reconnect ~15-20smacOS 實測)。
// Windows 上可能更久;若 HTTP connect timeout 60s 不夠,需調高或改
// 非同步 connect pattern。
if needsReset {
d.driverLog("INFO", "[kneron] first connect after server start — resetting %s to clear stale session...", d.chipType)
if err := d.restartBridge(); err != nil { if err := d.restartBridge(); err != nil {
d.driverLog("WARN", "[kneron] reset on connect failed (non-fatal): %v", err) d.driverLog("WARN", "[kneron] reset on connect failed (non-fatal): %v", err)
} else { } else {
d.driverLog("INFO", "[kneron] device reset complete — clean state ready") d.driverLog("INFO", "[kneron] device reset complete — clean state ready")
} }
} else if needsReset && skipReset {
d.driverLog("INFO", "[kneron] %s: skipping reset — firmware just loaded, session already clean", d.chipType)
} }
return nil return nil

View File

@ -864,7 +864,11 @@ def handle_connect(params):
kp.core.set_timeout(device_group=_device_group, milliseconds=_timeout_ms) kp.core.set_timeout(device_group=_device_group, milliseconds=_timeout_ms)
_log(f"set_timeout succeeded") _log(f"set_timeout succeeded")
# Firmware handling — chip-dependent # Firmware handling — chip-dependent.
# fresh_firmware_loaded is used by Go driver to decide whether to
# skip the post-connect reset (freshly loaded firmware is already
# in a clean state — reset would just waste 30-60s reloading it).
fresh_firmware_loaded = False
if "Loader" in fw_str: if "Loader" in fw_str:
# Device is in USB Boot (Loader) mode and needs firmware # Device is in USB Boot (Loader) mode and needs firmware
if _device_chip == "KL720": if _device_chip == "KL720":
@ -904,11 +908,14 @@ def handle_connect(params):
device_group=_device_group, milliseconds=_timeout_ms device_group=_device_group, milliseconds=_timeout_ms
) )
fw_str = str(target_dev.firmware) fw_str = str(target_dev.firmware)
fresh_firmware_loaded = True
_log(f"Reconnected after firmware load, firmware: {fw_str}") _log(f"Reconnected after firmware load, firmware: {fw_str}")
else: else:
_log(f"WARNING: {_device_chip} firmware files not found, skipping firmware load") _log(f"WARNING: {_device_chip} firmware files not found, skipping firmware load")
else: else:
# Not in Loader mode — firmware already present # Not in Loader mode — firmware already present from a previous
# session. This is the state that triggers Error 15 on inference
# without reset, per observed bug.
_log(f"{_device_chip}: firmware already present (normal). fw={fw_str}") _log(f"{_device_chip}: firmware already present (normal). fw={fw_str}")
return { return {
@ -916,6 +923,7 @@ def handle_connect(params):
"firmware": fw_str, "firmware": fw_str,
"kn_number": f"0x{target_dev.kn_number:08X}", "kn_number": f"0x{target_dev.kn_number:08X}",
"chip": _device_chip, "chip": _device_chip,
"fresh_firmware_loaded": fresh_firmware_loaded,
} }
except Exception as e: except Exception as e: