7 Commits

Author SHA1 Message Date
cbd1b9db28 fix(task-scheduler): Bug #10 — convention path fallback(visionA promote/result 拿不到 NEF)
visionA e2e 撞到:promote / result endpoint 在 status=COMPLETED 仍拿不到
NEF(409 source_not_available / 404 result_not_found)。

根因:worker (services/workers/consumer.py:118) 把 NEF/BIE/ONNX 上傳到
固定 convention path `jobs/{job_id}/out.{output_name}`、但 scheduler 端
advanceJob (jobService.js:246) 沒接收 worker done event 的 output path、
所以 job.output.{source}_path 永遠 null、讀取端拿不到。

修法 A(讀取端 fallback、最低風險):
- promote.js getJobOutputKey() + result.js extractNefObjectKey() 在
  status=COMPLETED + jobId 有效 + source ∈ {onnx,bie,nef} 時、反推
  convention path
- 不改 worker / 不改 advanceJob / 不改 redis schema
- fallback 放最後、保留 result_object_keys / output.{source}_path 兩種
  顯式設定優先級

Phase 2 backlog(待補完):
- 補完 worker → scheduler done event 寫 output path
- advanceJob 接收 output path 並寫進 redis
- 清掉本批 fallback dead branch + promote 409 source_not_available
  dead branch(fallback 後 valid source 永遠拿得到 key)

Tests: 666/666 pass(無回歸)
Reviewer:  通過、guard 嚴格、對齊 worker convention、無 path traversal 風險

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:55:22 +08:00
b8457ddb95 fix(task-scheduler): Bug #8 — terminal release 保留 job record(visionA poll 404)
visionA e2e 撞到:job COMPLETED 後 < 1s poll GET /api/v1/jobs/:id 回 404。
根因:jobService terminal release(COMPLETED / FAILED)path 用了
release_active_job.lua、該 Lua 給 enqueue rollback 用、會 DEL job:{id}
+ SREM user:jobs Set、post-completion API 拿不到 record。

修法 A(不改舊 Lua、新增專用 Lua):
- 新增 src/redis/luaScripts/release_lock_only.lua — 只 DEL active_job lock、
  保留 job:{id} record 與 user:{}:jobs Set、給正常 terminal 用
- 新增 releaseActiveLockOnly() JS wrapper(同 releaseActiveJob 的 API surface +
  atomic guard)
- jobService.js terminal release path 改用 releaseActiveLockOnly
- enqueue 失敗 rollback path 仍用舊 releaseActiveJob(語意正確、該情境 job
  尚未 schedule 完成、清乾淨才對)
- t9 unit / integration test 用 destructuring alias rename 避免改 27 個
  assertion

post-completion API 路徑(visionA poll / Phase B /result / promote)都需要
job record 仍在 — 本修法解此契約。

Tests: 666/666 pass(無回歸)
Reviewer:  通過、設計嚴謹(atomic guard、rollback vs terminal 語意分離)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:02:54 +08:00
3b7aa4c79a fix(task-scheduler): server.js 漏傳 minio dep 給 jobService(visionA e2e 撞到)
visionA 跑 Phase 0.8b e2e 時 POST /api/v1/jobs 回 502 storage_unavailable。
根因:server.js 建立 jobService 時沒把 minio facade 傳進去、
jobService.js 走 `deps.minio || null` fallback、writeInputToMinIO()
因為 minio=null throw「minio dep is required」、API 回 502。

修法:傳 minio facade 進 createJobService deps。
legacy CRUD 介面(不依賴 minio)行為不變—minio 是 optional dep。

Tests: 666/666 pass(無回歸)
Reviewer:  通過、correctness 軸無 Critical/Major

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:12:55 +08:00
d8a9517c9d feat(task-scheduler): Phase 0.8b — API key auth + /result endpoint
Auth pillar 從 OAuth 2.0 resource server 改成 pre-shared API key
(visionA ↔ converter 1:1 internal trust)。新增 GET /api/v1/jobs/:id/result
streaming endpoint 給 visionA backend 中轉 NEF 下載。

Phase A(auth 切換):
- 新增 apiKeyMiddleware(constant-time compare、tokenFingerprint、4 audit events)
- 砍 OAuth middleware + JWKS(保留 oauthClient 供 promote → FAA 使用)
- 4 個 endpoint 換掛 requireApiKey
- 加 TRUST_PROXY env + Express trust proxy 設定(forensic source_ip)

Phase B(/result endpoint):
- streaming NEF download with 5min timeout + concurrent cap 10
- Two-tier rate limit(burst 5/10s + sustained 20/min)
- Bandwidth quota(1 GB/hr + 6 GB/24hr)by token_fingerprint
- Range header silently ignored + Accept-Ranges: none
- filename quote-escape + RFC 5987 fallback + sanitize
- 8 個 /result audit events(forensic 完整)

設計演進記錄:docs/TODO-visionA-integration-v2.md(5/2 OAuth → 5/16 API key
→ 5/16 download via converter;對應 visionA repo ADR-015/016)

Tests: 597 → 666 (+69)、29 suites all pass
Security: APPROVE WITH CONDITIONS(單 instance 部署、6 新 env、24hr 監控)
npm audit: 3 vuln → 0(transitive AWS SDK xml chain)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 22:47:28 +08:00
4d381c0b50 feat(task-scheduler): Phase 1 — modularize server + add OAuth/JWKS + /api/v1/* routes
Refactor server.js (647 → 99 lines) into 30+ modules under src/:
- auth/: JWKS validation, JWT middleware, OAuth client_credentials
- routes/v1/: jobs (POST/GET/:id) + promote with input validation
- routes/legacy.js: existing /jobs multipart path (backward compatible)
- services/: jobService, healthService, sseService, statusMapper,
  doneListener
- middleware/: requestId, errorHandler, perClientRateLimit,
  uploadConcurrency, upload (multer + storage)
- redis/: Lua scripts for atomic claim/release_active_job
- storage/: local + minio adapters; fileAccessAgent/: PUT promote client
- config.js: env var validation with fail-fast

Phase 1 features (T1–T11):
- T1 Auth middleware + JWKS (Member Center OAuth2 resource server)
- T2 OAuth client (Member Center client_credentials, Basic auth)
- T3 /api/v1/* router skeleton
- T4 server.js refactor (legacy endpoints fully preserved, real-Redis
  regression verified — existing worker consumer group untouched)
- T5 POST /api/v1/jobs (multipart, OWASP-audited, 2 Critical / 6 Major
  fixed; Risk-A/B documented as accepted)
- T6 GET /api/v1/jobs + GET /:id (cursor pagination, ETag, IDOR-safe)
- T7 POST /jobs/:id/promote (FAA PUT with own service token, 300s
  timeout, fail-fast on missing FAA URL)
- T8 /health upgrade (healthy/degraded/unhealthy + 30s background cache)
- T9 stage_timings (release_active_job in terminal states)
- T10 env + Docker integration (MULTIPART_* + concurrency limiter)
- T11 README (498 lines) + OpenAPI 3.0 spec (1588 lines)

Tests: 630 pass across 29 suites. Updated Dockerfile + .dockerignore +
docker-compose.yml env passthrough (no hardcoded secrets, fail-fast on
missing required vars).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:55:05 +08:00
efa67d59a4 Add web frontend, MinIO storage, monitoring, and docker-compose deployment
- Frontend: rewrite Home.vue to match backend POST /jobs API (remove single-stage options)
- Frontend: add Monitor page (/monitor) for queue and job monitoring
- Frontend: add job history with localStorage tracking (per-browser)
- Frontend: fix Nginx proxy rewrite (/api -> /) and add 500MB upload limit
- Backend: add MinIO storage support (STORAGE_BACKEND=minio) alongside local mode
- Backend: add GET /queues/stats API for queue monitoring
- Backend: fix download handler for MinIO (buffer mode for Node 18 compat)
- Workers: add S3/MinIO download/upload in consumer.py with isolated temp dirs
- Workers: add s3_storage.py helper with lifecycle rule (7-day TTL)
- Docker: add docker-compose.yml with all services (web, scheduler, redis, workers)
- Docker: ports mapped to 9500 (web) and 9501 (scheduler)
- Config: add .env to .gitignore to protect secrets

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 15:04:09 +08:00
warrenchen
31f61b5122 Initail project 2026-01-28 06:16:04 +00:00