16 Commits

Author SHA1 Message Date
cbd1b9db28 fix(task-scheduler): Bug #10 — convention path fallback(visionA promote/result 拿不到 NEF)
visionA e2e 撞到:promote / result endpoint 在 status=COMPLETED 仍拿不到
NEF(409 source_not_available / 404 result_not_found)。

根因:worker (services/workers/consumer.py:118) 把 NEF/BIE/ONNX 上傳到
固定 convention path `jobs/{job_id}/out.{output_name}`、但 scheduler 端
advanceJob (jobService.js:246) 沒接收 worker done event 的 output path、
所以 job.output.{source}_path 永遠 null、讀取端拿不到。

修法 A(讀取端 fallback、最低風險):
- promote.js getJobOutputKey() + result.js extractNefObjectKey() 在
  status=COMPLETED + jobId 有效 + source ∈ {onnx,bie,nef} 時、反推
  convention path
- 不改 worker / 不改 advanceJob / 不改 redis schema
- fallback 放最後、保留 result_object_keys / output.{source}_path 兩種
  顯式設定優先級

Phase 2 backlog(待補完):
- 補完 worker → scheduler done event 寫 output path
- advanceJob 接收 output path 並寫進 redis
- 清掉本批 fallback dead branch + promote 409 source_not_available
  dead branch(fallback 後 valid source 永遠拿得到 key)

Tests: 666/666 pass(無回歸)
Reviewer:  通過、guard 嚴格、對齊 worker convention、無 path traversal 風險

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:55:22 +08:00
b8457ddb95 fix(task-scheduler): Bug #8 — terminal release 保留 job record(visionA poll 404)
visionA e2e 撞到:job COMPLETED 後 < 1s poll GET /api/v1/jobs/:id 回 404。
根因:jobService terminal release(COMPLETED / FAILED)path 用了
release_active_job.lua、該 Lua 給 enqueue rollback 用、會 DEL job:{id}
+ SREM user:jobs Set、post-completion API 拿不到 record。

修法 A(不改舊 Lua、新增專用 Lua):
- 新增 src/redis/luaScripts/release_lock_only.lua — 只 DEL active_job lock、
  保留 job:{id} record 與 user:{}:jobs Set、給正常 terminal 用
- 新增 releaseActiveLockOnly() JS wrapper(同 releaseActiveJob 的 API surface +
  atomic guard)
- jobService.js terminal release path 改用 releaseActiveLockOnly
- enqueue 失敗 rollback path 仍用舊 releaseActiveJob(語意正確、該情境 job
  尚未 schedule 完成、清乾淨才對)
- t9 unit / integration test 用 destructuring alias rename 避免改 27 個
  assertion

post-completion API 路徑(visionA poll / Phase B /result / promote)都需要
job record 仍在 — 本修法解此契約。

Tests: 666/666 pass(無回歸)
Reviewer:  通過、設計嚴謹(atomic guard、rollback vs terminal 語意分離)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:02:54 +08:00
3b7aa4c79a fix(task-scheduler): server.js 漏傳 minio dep 給 jobService(visionA e2e 撞到)
visionA 跑 Phase 0.8b e2e 時 POST /api/v1/jobs 回 502 storage_unavailable。
根因:server.js 建立 jobService 時沒把 minio facade 傳進去、
jobService.js 走 `deps.minio || null` fallback、writeInputToMinIO()
因為 minio=null throw「minio dep is required」、API 回 502。

修法:傳 minio facade 進 createJobService deps。
legacy CRUD 介面(不依賴 minio)行為不變—minio 是 optional dep。

Tests: 666/666 pass(無回歸)
Reviewer:  通過、correctness 軸無 Critical/Major

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:12:55 +08:00
aeaecb8c06 fix(compose): Phase 0.8b deploy blocker — env 透傳 + 命名規格
d8a9517 commit 漏改 docker-compose.yml:scheduler service environment block
沒透傳 Phase 0.8b 新 env、即使 stage .env 設了 container 也讀不到、
deploy 後 CONVERTER_API_KEY undefined 會啟動 503 reject all requests。

docker-compose.yml:
- 新增 10 個 Phase 0.8b env 透傳(CONVERTER_API_KEY 無 default fail-secure、
  其他用 ${VAR:-default} fail-soft)
- 砍 9 個已廢 OAuth resource-server env(MEMBER_CENTER_ISSUER / JWKS_URL /
  AUDIENCE / CONVERTER_TENANT_ID / SCOPE_* / JWKS_* / JWT_*)
- 保留 8 個 promote → FAA 用 env(MEMBER_CENTER_TOKEN_URL /
  KNERON_CONVERTER_CLIENT_ID/SECRET / FILE_ACCESS_AGENT_* /
  OAUTH_TOKEN_* / PROMOTE_TIMEOUT_MS)

docs/autoflow/04-architecture/api/api-result.md §16:
- 新增 Env Naming Reference Table(30 個 canonical env names)
- 拍板 source code 為 single source of truth、env.example 對齊
- 確認 /result 8 個 env + 其他 22 個的命名規格
- 留歷史記錄:Orchestrator 之前用過想像中縮寫名(_MAX / _HOURLY_QUOTA /
  RESULT_CONCURRENT_STREAM_MAX)造成命名混亂、§16 為未來 prompt 引用標準

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 01:01:59 +08:00
d8a9517c9d feat(task-scheduler): Phase 0.8b — API key auth + /result endpoint
Auth pillar 從 OAuth 2.0 resource server 改成 pre-shared API key
(visionA ↔ converter 1:1 internal trust)。新增 GET /api/v1/jobs/:id/result
streaming endpoint 給 visionA backend 中轉 NEF 下載。

Phase A(auth 切換):
- 新增 apiKeyMiddleware(constant-time compare、tokenFingerprint、4 audit events)
- 砍 OAuth middleware + JWKS(保留 oauthClient 供 promote → FAA 使用)
- 4 個 endpoint 換掛 requireApiKey
- 加 TRUST_PROXY env + Express trust proxy 設定(forensic source_ip)

Phase B(/result endpoint):
- streaming NEF download with 5min timeout + concurrent cap 10
- Two-tier rate limit(burst 5/10s + sustained 20/min)
- Bandwidth quota(1 GB/hr + 6 GB/24hr)by token_fingerprint
- Range header silently ignored + Accept-Ranges: none
- filename quote-escape + RFC 5987 fallback + sanitize
- 8 個 /result audit events(forensic 完整)

設計演進記錄:docs/TODO-visionA-integration-v2.md(5/2 OAuth → 5/16 API key
→ 5/16 download via converter;對應 visionA repo ADR-015/016)

Tests: 597 → 666 (+69)、29 suites all pass
Security: APPROVE WITH CONDITIONS(單 instance 部署、6 新 env、24hr 監控)
npm audit: 3 vuln → 0(transitive AWS SDK xml chain)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 22:47:28 +08:00
cff9236699 docs: migrate Autoflow shared documents to docs/autoflow/
Move PRD, design specs, architecture docs, and TDD from .autoflow/
(personal/per-branch layer) to docs/autoflow/ (shared layer that
goes into git) per the new Autoflow workspace layout.

Files moved:
- 02-prd/PRD.md
- 03-design/design-review.md
- 03-design/user-flow-cross-system.md
- 04-architecture/TDD.md
- 04-architecture/design-doc.md
- 04-architecture/security.md

The originals were never tracked, so git mv reduced to a filesystem
rename with no history to preserve. .autoflow/ remains for personal
notes (progress.md, review reports, testing logs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:59:21 +08:00
7404ca9bc8 docs: align Design.md with Phase 1 architecture
Restructure section 2 from component overview to tech selection table
covering Scheduler, Worker, Queue, Job State, Artifact Store, Web UI.
Document why Redis Stream for the queue (language-neutral, consumer
groups, single Redis instance, Crash Reset alignment) and detail
worker horizontal scaling via consumer groups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:55:14 +08:00
4d381c0b50 feat(task-scheduler): Phase 1 — modularize server + add OAuth/JWKS + /api/v1/* routes
Refactor server.js (647 → 99 lines) into 30+ modules under src/:
- auth/: JWKS validation, JWT middleware, OAuth client_credentials
- routes/v1/: jobs (POST/GET/:id) + promote with input validation
- routes/legacy.js: existing /jobs multipart path (backward compatible)
- services/: jobService, healthService, sseService, statusMapper,
  doneListener
- middleware/: requestId, errorHandler, perClientRateLimit,
  uploadConcurrency, upload (multer + storage)
- redis/: Lua scripts for atomic claim/release_active_job
- storage/: local + minio adapters; fileAccessAgent/: PUT promote client
- config.js: env var validation with fail-fast

Phase 1 features (T1–T11):
- T1 Auth middleware + JWKS (Member Center OAuth2 resource server)
- T2 OAuth client (Member Center client_credentials, Basic auth)
- T3 /api/v1/* router skeleton
- T4 server.js refactor (legacy endpoints fully preserved, real-Redis
  regression verified — existing worker consumer group untouched)
- T5 POST /api/v1/jobs (multipart, OWASP-audited, 2 Critical / 6 Major
  fixed; Risk-A/B documented as accepted)
- T6 GET /api/v1/jobs + GET /:id (cursor pagination, ETag, IDOR-safe)
- T7 POST /jobs/:id/promote (FAA PUT with own service token, 300s
  timeout, fail-fast on missing FAA URL)
- T8 /health upgrade (healthy/degraded/unhealthy + 30s background cache)
- T9 stage_timings (release_active_job in terminal states)
- T10 env + Docker integration (MULTIPART_* + concurrency limiter)
- T11 README (498 lines) + OpenAPI 3.0 spec (1588 lines)

Tests: 630 pass across 29 suites. Updated Dockerfile + .dockerignore +
docker-compose.yml env passthrough (no hardcoded secrets, fail-fast on
missing required vars).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:55:05 +08:00
548f53ccbf chore: untrack .env and ignore Autoflow personal layer
- Stop tracking .env (was committed in 31f61b5 with plaintext MinIO
  secrets; existing history cleanup is tracked as Risk-A in Phase 1
  backlog and will be addressed before launch)
- Ignore .autoflow/ entirely (per-branch personal notes layer)
- Ignore two local-only reference assets: drawio architecture export
  and toolchain manual PDF

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:54:45 +08:00
75e3a9b2cc Add worker dependencies documentation with binary mapping per stage
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 00:09:14 +08:00
efa67d59a4 Add web frontend, MinIO storage, monitoring, and docker-compose deployment
- Frontend: rewrite Home.vue to match backend POST /jobs API (remove single-stage options)
- Frontend: add Monitor page (/monitor) for queue and job monitoring
- Frontend: add job history with localStorage tracking (per-browser)
- Frontend: fix Nginx proxy rewrite (/api -> /) and add 500MB upload limit
- Backend: add MinIO storage support (STORAGE_BACKEND=minio) alongside local mode
- Backend: add GET /queues/stats API for queue monitoring
- Backend: fix download handler for MinIO (buffer mode for Node 18 compat)
- Workers: add S3/MinIO download/upload in consumer.py with isolated temp dirs
- Workers: add s3_storage.py helper with lifecycle rule (7-day TTL)
- Docker: add docker-compose.yml with all services (web, scheduler, redis, workers)
- Docker: ports mapped to 9500 (web) and 9501 (scheduler)
- Config: add .env to .gitignore to protect secrets

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 15:04:09 +08:00
warrenchen
fdebf4db5d Refactor workers to use backend interfaces for quantization, compilation, and evaluation; add optional flags for simulation in request schemas and update documentation accordingly. 2026-02-06 08:24:08 +00:00
warrenchen
bc98456d74 Add initial draft of Toolchain Flow Modularization Notes 2026-02-05 17:08:17 +00:00
warrenchen
e93a1a5996 Complete tests for workers, with source file format onnx and tflite 2026-01-29 03:04:41 +00:00
warrenchen
0000f19d5e Refactor requirements.txt to update and organize dependencies 2026-01-28 06:17:12 +00:00
warrenchen
31f61b5122 Initail project 2026-01-28 06:16:04 +00:00