jim800121chen d8a9517c9d feat(task-scheduler): Phase 0.8b — API key auth + /result endpoint
Auth pillar 從 OAuth 2.0 resource server 改成 pre-shared API key
(visionA ↔ converter 1:1 internal trust)。新增 GET /api/v1/jobs/:id/result
streaming endpoint 給 visionA backend 中轉 NEF 下載。

Phase A(auth 切換):
- 新增 apiKeyMiddleware(constant-time compare、tokenFingerprint、4 audit events)
- 砍 OAuth middleware + JWKS(保留 oauthClient 供 promote → FAA 使用)
- 4 個 endpoint 換掛 requireApiKey
- 加 TRUST_PROXY env + Express trust proxy 設定(forensic source_ip)

Phase B(/result endpoint):
- streaming NEF download with 5min timeout + concurrent cap 10
- Two-tier rate limit(burst 5/10s + sustained 20/min)
- Bandwidth quota(1 GB/hr + 6 GB/24hr)by token_fingerprint
- Range header silently ignored + Accept-Ranges: none
- filename quote-escape + RFC 5987 fallback + sanitize
- 8 個 /result audit events(forensic 完整)

設計演進記錄:docs/TODO-visionA-integration-v2.md(5/2 OAuth → 5/16 API key
→ 5/16 download via converter;對應 visionA repo ADR-015/016)

Tests: 597 → 666 (+69)、29 suites all pass
Security: APPROVE WITH CONDITIONS(單 instance 部署、6 新 env、24hr 監控)
npm audit: 3 vuln → 0(transitive AWS SDK xml chain)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 22:47:28 +08:00

5.5 KiB
Raw Blame History

Database 設計

狀態Phase 1 完工 — Phase 0.8b 完全不動

配套design-doc.md §3.7、api/api-jobs.md


1. 為什麼用 Redis、不用 PostgreSQL

  • Phase 1 資料模式簡單job 是 state machine、user index 是 key-value
  • 既有哲學「Crash 即 Reset」對 Redis 友善PG 引入持久化反而變複雜)
  • Redis Set 做 user 索引足夠(單 user 7 天內 < 10 個 job
  • 未來若要跨 Crash recovery / 多 instance HA再評估 PG

2. Key 規劃

Key 類型 用途 TTL
job:{job_id} String (JSON) Job 完整 record 7 天
user:{user_id}:jobs Set 該 user 所有 job_id不分狀態 每次寫入時 EXPIRE 7d
user:{user_id}:active_job String 當前 in-progress job_id= createdrunning 隨 job 結束刪除
ratelimit:client:{client_id} express-rate-limit 管理 per-client_id rate limit 5 min
queue:onnx / queue:bie / queue:nef Redis Stream Worker 任務佇列
queue:done Redis Stream Worker 完成事件
queue:progress Redis Stream Worker stage 內進度選配Phase 2

3. Job record schema

{
  // 既有欄位
  "job_id": "uuid",
  "created_at": "...",
  "updated_at": "...",
  "status": "ONNX | BIE | NEF | COMPLETED | FAILED",   // 內部仍用大寫
  "stage": "onnx | bie | nef | null",
  "progress": 0,
  "parameters": {
    "model_id": 1001,
    "version": "0001",
    "platform": "520",
    "enable_evaluate": false,
    "enable_sim_fp": false,
    "enable_sim_fixed": false,
    "enable_sim_hw": false
  },
  "output": {                                          // 舊格式(向後相容)
    "bie_path": null,
    "nef_path": null,
    "onnx_path": null
  },
  "result_object_keys": {                              // 新格式
    "onnx": "jobs/{job_id}/output/out.onnx",
    "bie": "jobs/{job_id}/output/out.bie",
    "nef": "jobs/{job_id}/output/out.nef"
  },
  "error": null,
  "origin": "api | web",
  "user_id": "visionA-user-12345",
  "tenant_id": "uuid-or-null",
  "created_by_client_id": "visionA-service",          // API key 模式下固定值
  "source_filename": "model.onnx",                    // Phase 0.8b 新增(/result endpoint filename 用)
  "input": {
    "filename": "model.onnx",
    "object_key": "jobs/{job_id}/input/model.onnx",
    "size_bytes": 204800000,
    "ref_images_count": 0
  },
  "stage_timings": {
    "onnx": { "started_at": "...", "completed_at": "..." },
    "bie": { "started_at": "...", "completed_at": null },
    "nef": null
  },
  "stage_progress": 0,
  "expires_at": "2026-05-23T12:00:00Z",
  "metadata": {},
  "promoted": false,                                  // 冪等性 flag
  "promoted_object_keys": []                          // 已 promote 的目標
}

3.1 source_filename 欄位

Phase 0.8b 新增需求:/result endpoint 需要這個欄位構造 download filename。

寫入點POST /api/v1/jobs handler 在 multer 接收 model 檔後,把 multipart.filename 寫入 job.source_filename(已 sanitized

Backend 端 task:確認 jobService.createJob 寫入這個欄位(檢查既有 code、可能已存在若沒有則補上


4. 對外 status 映射(不變)

詳見 api/api-jobs.md §5.3。


5. User 索引設計

5.1 Key 寫入時機

建立 job:
  MULTI
  SET job:{id} {...}
  SADD user:{user_id}:jobs {id}
  EXPIRE user:{user_id}:jobs 604800
  SETNX user:{user_id}:active_job {id}
  EXEC

  若 SETNX 回 0 → 衝突,回滾,回 409
  若 SETNX 回 1 → 成功

完成 / 失敗時:
  MULTI
  SET job:{id} {...}
  DEL user:{user_id}:active_job   # 僅在 value == 當前 job_id 時才 DEL
  EXEC

5.2 Lua scriptclaim_active_job

-- KEYS[1] = user:{user_id}:active_job
-- KEYS[2] = job:{job_id}
-- KEYS[3] = user:{user_id}:jobs
-- ARGV[1] = job_id
-- ARGV[2] = job_json
-- ARGV[3] = ttl_seconds

if redis.call('EXISTS', KEYS[1]) == 1 then
  return {'conflict', redis.call('GET', KEYS[1])}
end
redis.call('SET', KEYS[1], ARGV[1])
redis.call('SET', KEYS[2], ARGV[2])
redis.call('SADD', KEYS[3], ARGV[1])
redis.call('EXPIRE', KEYS[3], tonumber(ARGV[3]))
return {'ok'}

5.3 避免 KEYS *

錯誤做法redis.keys('job:*') O(N) 阻塞。

正確做法

const ids = await redis.smembers(`user:${userId}:jobs`);
const pipeline = redis.pipeline();
for (const id of ids) pipeline.get(`job:${id}`);
const results = await pipeline.exec();

6. 記憶體預估

  • 每個 job record 約 2-4 KB含 stage_timings 等)
  • 每個 user index Set 每個元素 < 40 bytes
  • 1000 並發 user × 10 jobs = 10k job record ≈ 40 MB

Redis 輕鬆。Converter Bucket lifecycle 7 天Redis 也跟著 TTL 7 天,記憶體上限可控。


7. M5 方案 A先寫 MinIO 後 Lua claim

避免「拿到 Lua claim 但 MinIO 失敗」需要 rollback Redis 的複雜度:

  • MinIO 失敗 → 直接回 502Redis 完全乾淨
  • Lua conflict / throw → cleanup MinIOfire-and-forget靠 7d lifecycle 兜底)
  • enqueue 失敗 → 補償 release Redis + cleanup MinIO

8. Phase 0.8b 變動

。Database 完全不動。

唯一相關變動:

  • created_by_client_id 在 API key 模式下會固定為 visionA-servicemiddleware 設定 req.auth.clientId)— 此為 handler 行為,不是 schema 改變
  • source_filename 欄位確認存在既有實作可能已有若無Backend 補上 — 屬於 Phase B 任務)