jim800121chen d8a9517c9d feat(task-scheduler): Phase 0.8b — API key auth + /result endpoint

Auth pillar 從 OAuth 2.0 resource server 改成 pre-shared API key
（visionA ↔ converter 1:1 internal trust）。新增 GET /api/v1/jobs/:id/result
streaming endpoint 給 visionA backend 中轉 NEF 下載。

Phase A（auth 切換）：
- 新增 apiKeyMiddleware（constant-time compare、tokenFingerprint、4 audit events）
- 砍 OAuth middleware + JWKS（保留 oauthClient 供 promote → FAA 使用）
- 4 個 endpoint 換掛 requireApiKey
- 加 TRUST_PROXY env + Express trust proxy 設定（forensic source_ip）

Phase B（/result endpoint）：
- streaming NEF download with 5min timeout + concurrent cap 10
- Two-tier rate limit（burst 5/10s + sustained 20/min）
- Bandwidth quota（1 GB/hr + 6 GB/24hr）by token_fingerprint
- Range header silently ignored + Accept-Ranges: none
- filename quote-escape + RFC 5987 fallback + sanitize
- 8 個 /result audit events（forensic 完整）

設計演進記錄：docs/TODO-visionA-integration-v2.md（5/2 OAuth → 5/16 API key
→ 5/16 download via converter；對應 visionA repo ADR-015/016）

Tests: 597 → 666 (+69)、29 suites all pass
Security: APPROVE WITH CONDITIONS（單 instance 部署、6 新 env、24hr 監控）
npm audit: 3 vuln → 0（transitive AWS SDK xml chain）

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-17 22:47:28 +08:00

5.5 KiB

Raw Blame History

Database 設計

狀態：Phase 1 完工 — Phase 0.8b 完全不動。

配套：design-doc.md §3.7、api/api-jobs.md。

1. 為什麼用 Redis、不用 PostgreSQL

Phase 1 資料模式簡單：job 是 state machine、user index 是 key-value
既有哲學「Crash 即 Reset」對 Redis 友善（PG 引入持久化反而變複雜）
Redis Set 做 user 索引足夠（單 user 7 天內 < 10 個 job）
未來若要跨 Crash recovery / 多 instance HA，再評估 PG

2. Key 規劃

Key	類型	用途	TTL
`job:{job_id}`	String (JSON)	Job 完整 record	7 天
`user:{user_id}:jobs`	Set	該 user 所有 job_id（不分狀態）	每次寫入時 `EXPIRE 7d`
`user:{user_id}:active_job`	String	當前 in-progress job_id（= `created` 或 `running`）	隨 job 結束刪除
`ratelimit:client:{client_id}`	由 `express-rate-limit` 管理	per-client_id rate limit	5 min
`queue:onnx` / `queue:bie` / `queue:nef`	Redis Stream	Worker 任務佇列	—
`queue:done`	Redis Stream	Worker 完成事件	—
`queue:progress`	Redis Stream	Worker stage 內進度（選配，Phase 2）	—

3. Job record schema

{
  // 既有欄位
  "job_id": "uuid",
  "created_at": "...",
  "updated_at": "...",
  "status": "ONNX | BIE | NEF | COMPLETED | FAILED",   // 內部仍用大寫
  "stage": "onnx | bie | nef | null",
  "progress": 0,
  "parameters": {
    "model_id": 1001,
    "version": "0001",
    "platform": "520",
    "enable_evaluate": false,
    "enable_sim_fp": false,
    "enable_sim_fixed": false,
    "enable_sim_hw": false
  },
  "output": {                                          // 舊格式（向後相容）
    "bie_path": null,
    "nef_path": null,
    "onnx_path": null
  },
  "result_object_keys": {                              // 新格式
    "onnx": "jobs/{job_id}/output/out.onnx",
    "bie": "jobs/{job_id}/output/out.bie",
    "nef": "jobs/{job_id}/output/out.nef"
  },
  "error": null,
  "origin": "api | web",
  "user_id": "visionA-user-12345",
  "tenant_id": "uuid-or-null",
  "created_by_client_id": "visionA-service",          // API key 模式下固定值
  "source_filename": "model.onnx",                    // Phase 0.8b 新增（/result endpoint filename 用）
  "input": {
    "filename": "model.onnx",
    "object_key": "jobs/{job_id}/input/model.onnx",
    "size_bytes": 204800000,
    "ref_images_count": 0
  },
  "stage_timings": {
    "onnx": { "started_at": "...", "completed_at": "..." },
    "bie": { "started_at": "...", "completed_at": null },
    "nef": null
  },
  "stage_progress": 0,
  "expires_at": "2026-05-23T12:00:00Z",
  "metadata": {},
  "promoted": false,                                  // 冪等性 flag
  "promoted_object_keys": []                          // 已 promote 的目標
}

3.1 `source_filename` 欄位

Phase 0.8b 新增需求：/result endpoint 需要這個欄位構造 download filename。

寫入點：POST /api/v1/jobs handler 在 multer 接收 model 檔後，把 multipart.filename 寫入 job.source_filename（已 sanitized）。

Backend 端 task：確認 jobService.createJob 寫入這個欄位（檢查既有 code、可能已存在；若沒有則補上）。

4. 對外 status 映射（不變）

詳見 api/api-jobs.md §5.3。

5. User 索引設計

5.1 Key 寫入時機

建立 job:
  MULTI
  SET job:{id} {...}
  SADD user:{user_id}:jobs {id}
  EXPIRE user:{user_id}:jobs 604800
  SETNX user:{user_id}:active_job {id}
  EXEC

  若 SETNX 回 0 → 衝突，回滾，回 409
  若 SETNX 回 1 → 成功

完成 / 失敗時：
  MULTI
  SET job:{id} {...}
  DEL user:{user_id}:active_job   # 僅在 value == 當前 job_id 時才 DEL
  EXEC

5.2 Lua script（claim_active_job）

-- KEYS[1] = user:{user_id}:active_job
-- KEYS[2] = job:{job_id}
-- KEYS[3] = user:{user_id}:jobs
-- ARGV[1] = job_id
-- ARGV[2] = job_json
-- ARGV[3] = ttl_seconds

if redis.call('EXISTS', KEYS[1]) == 1 then
  return {'conflict', redis.call('GET', KEYS[1])}
end
redis.call('SET', KEYS[1], ARGV[1])
redis.call('SET', KEYS[2], ARGV[2])
redis.call('SADD', KEYS[3], ARGV[1])
redis.call('EXPIRE', KEYS[3], tonumber(ARGV[3]))
return {'ok'}

5.3 避免 `KEYS *`

錯誤做法：redis.keys('job:*') O(N) 阻塞。

正確做法：

const ids = await redis.smembers(`user:${userId}:jobs`);
const pipeline = redis.pipeline();
for (const id of ids) pipeline.get(`job:${id}`);
const results = await pipeline.exec();

6. 記憶體預估

每個 job record 約 2-4 KB（含 stage_timings 等）
每個 user index Set 每個元素 < 40 bytes
1000 並發 user × 10 jobs = 10k job record ≈ 40 MB

Redis 輕鬆。Converter Bucket lifecycle 7 天，Redis 也跟著 TTL 7 天，記憶體上限可控。

7. M5 方案 A：先寫 MinIO 後 Lua claim

避免「拿到 Lua claim 但 MinIO 失敗」需要 rollback Redis 的複雜度：

MinIO 失敗 → 直接回 502，Redis 完全乾淨
Lua conflict / throw → cleanup MinIO（fire-and-forget，靠 7d lifecycle 兜底）
enqueue 失敗 → 補償 release Redis + cleanup MinIO

8. Phase 0.8b 變動

無。Database 完全不動。

唯一相關變動：

created_by_client_id 在 API key 模式下會固定為 visionA-service（middleware 設定 req.auth.clientId）— 此為 handler 行為，不是 schema 改變
source_filename 欄位確認存在（既有實作可能已有；若無，Backend 補上 — 屬於 Phase B 任務）

5.5 KiB Raw Blame History Unescape Escape