Auth pillar 從 OAuth 2.0 resource server 改成 pre-shared API key (visionA ↔ converter 1:1 internal trust)。新增 GET /api/v1/jobs/:id/result streaming endpoint 給 visionA backend 中轉 NEF 下載。 Phase A(auth 切換): - 新增 apiKeyMiddleware(constant-time compare、tokenFingerprint、4 audit events) - 砍 OAuth middleware + JWKS(保留 oauthClient 供 promote → FAA 使用) - 4 個 endpoint 換掛 requireApiKey - 加 TRUST_PROXY env + Express trust proxy 設定(forensic source_ip) Phase B(/result endpoint): - streaming NEF download with 5min timeout + concurrent cap 10 - Two-tier rate limit(burst 5/10s + sustained 20/min) - Bandwidth quota(1 GB/hr + 6 GB/24hr)by token_fingerprint - Range header silently ignored + Accept-Ranges: none - filename quote-escape + RFC 5987 fallback + sanitize - 8 個 /result audit events(forensic 完整) 設計演進記錄:docs/TODO-visionA-integration-v2.md(5/2 OAuth → 5/16 API key → 5/16 download via converter;對應 visionA repo ADR-015/016) Tests: 597 → 666 (+69)、29 suites all pass Security: APPROVE WITH CONDITIONS(單 instance 部署、6 新 env、24hr 監控) npm audit: 3 vuln → 0(transitive AWS SDK xml chain) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
664 lines
29 KiB
Markdown
664 lines
29 KiB
Markdown
# Task Scheduler — Kneron Model Converter Phase 1
|
||
|
||
Kneron Model Converter 的 Job 管理與 queue orchestration 服務。負責接收上游
|
||
(visionA-backend / Web UI)的轉檔請求,協調 ONNX → BIE → NEF pipeline,並把成功
|
||
的結果檔 promote 到 File Access Agent / NAS 模型庫。
|
||
|
||
> **Phase 1 對外 API 完整規格** → 見 `docs/openapi.yaml`
|
||
|
||
---
|
||
|
||
## 1. 專案介紹
|
||
|
||
### 1.1 服務角色
|
||
|
||
```
|
||
public Internet internal
|
||
↓
|
||
visionA-backend ─→ Nginx (443, public vhost) ─→ /api/v1/* ─→ task-scheduler ─→ Worker
|
||
│
|
||
Web UI ─→ Nginx (80, internal vhost) ─→ /jobs ──┘ │
|
||
↓
|
||
ONNX → BIE → NEF
|
||
↓
|
||
MinIO Bucket
|
||
↓
|
||
POST /api/v1/jobs/:id/promote
|
||
↓
|
||
File Access Agent
|
||
↓
|
||
NAS 模型庫
|
||
```
|
||
|
||
task-scheduler 是 Phase 1 唯一暴露給上游的應用層元件,承擔:
|
||
|
||
- 對外 API(**Phase 1 新增**):`/api/v1/*` 共 4 個端點 + 2 個 Phase 2 預留
|
||
- 內部 API(**保留既有**):`/jobs/*` 共 6 個 legacy 端點(Web UI 用)
|
||
- 健康檢查:`/health`(公開)
|
||
|
||
### 1.2 技術堆疊
|
||
|
||
| 層級 | 技術 | 版本 |
|
||
|------|------|------|
|
||
| 執行環境 | Node.js | 18+ (alpine image, 部署用) |
|
||
| Web framework | Express | 4.x |
|
||
| Queue | Redis Stream + ioredis | 5.x |
|
||
| 物件儲存 | MinIO(S3 compatible,AWS SDK v3) | latest |
|
||
| 對外認證 | Pre-shared API key(Phase 0.8b)| — |
|
||
| 對 FAA 認證 | OAuth 2.0 client_credentials | jose 5.x |
|
||
| 上傳 | multer (memoryStorage) | 1.4.x |
|
||
| 速率限制 | express-rate-limit | 6.x |
|
||
| 安全 headers | helmet | 7.x |
|
||
| 測試 | Jest | 29.x |
|
||
|
||
---
|
||
|
||
## 2. 前置需求
|
||
|
||
| 項目 | 版本 / 說明 |
|
||
|------|-----------|
|
||
| Node.js | 18+(fetch 原生支援、`duplex: 'half'`) |
|
||
| npm | 9+ |
|
||
| Docker / docker-compose(可選) | 24.x+ |
|
||
| Redis | 7.x(dev / prod 都需要) |
|
||
| MinIO | latest(POST /api/v1/jobs 必須啟用) |
|
||
| Member Center | OAuth 2.0 Authorization Server,**僅 promote 階段使用**(converter → FAA 取 token),visionA → converter 改採 API key 後不再依賴 JWKS |
|
||
| File Access Agent | promote 階段呼叫,需支援 `PUT /files/{key}` |
|
||
| `CONVERTER_API_KEY` | 64 hex chars,由 `openssl rand -hex 32` 產生,與 visionA-backend 共用 |
|
||
|
||
dev 環境若無真實 Member Center / FAA,可用 placeholder 值(見 `env.example`)。
|
||
|
||
---
|
||
|
||
## 3. 啟動方式
|
||
|
||
### 3.1 本機開發(純 Node)
|
||
|
||
```bash
|
||
cd apps/task-scheduler
|
||
cp env.example .env
|
||
# 編輯 .env,至少把以下 placeholder 替換為真實值:
|
||
# - CONVERTER_API_KEY(visionA → converter 對外 auth,必填;用 `openssl rand -hex 32` 產)
|
||
# - MEMBER_CENTER_TOKEN_URL(promote 階段取 FAA token 用)
|
||
# - KNERON_CONVERTER_CLIENT_ID / CLIENT_SECRET(promote 階段身分)
|
||
# - FILE_ACCESS_AGENT_*(promote 目標)
|
||
# - MINIO_*(若 STORAGE_BACKEND=minio)
|
||
|
||
npm install
|
||
npm start
|
||
# → 監聽 PORT(預設 4000)
|
||
```
|
||
|
||
### 3.2 Docker 單體
|
||
|
||
```bash
|
||
docker build -t task-scheduler:dev apps/task-scheduler
|
||
docker run --rm --env-file apps/task-scheduler/.env -p 4000:4000 task-scheduler:dev
|
||
```
|
||
|
||
### 3.3 docker-compose(推薦)
|
||
|
||
專案根目錄已有 `docker-compose.yml`,會一併啟動 Redis、MinIO、Workers、frontend:
|
||
|
||
```bash
|
||
cd /path/to/kneron_model_converter
|
||
cp apps/task-scheduler/env.example .env # 或維護一份 root .env
|
||
docker compose up -d --build
|
||
```
|
||
|
||
服務埠對外:
|
||
- Scheduler API:`http://localhost:4000`
|
||
- Web UI:`http://localhost:3000`
|
||
- MinIO Console:`http://localhost:9001`
|
||
|
||
### 3.4 Health check
|
||
|
||
```bash
|
||
curl http://localhost:4000/health | jq .
|
||
```
|
||
|
||
回應為三層 status(healthy / degraded / unhealthy)+ 各依賴狀態,
|
||
詳見 [§ 7. 監控](#7-監控)。
|
||
|
||
### 3.5 Graceful shutdown
|
||
|
||
服務監聽 `SIGTERM` / `SIGINT`:收到後會先停掉 health background polling,
|
||
再讓 Express 自然關閉。容器 / K8s 部署時 `terminationGracePeriodSeconds`
|
||
建議至少 30 秒。
|
||
|
||
---
|
||
|
||
## 4. 專案結構
|
||
|
||
```
|
||
apps/task-scheduler/
|
||
├── server.js ← entry(< 140 行;組裝 deps、啟動 listener、listen)
|
||
├── src/
|
||
│ ├── app.js ← Express app factory
|
||
│ ├── config.js ← 集中讀 env,啟動時 fail-fast
|
||
│ ├── redis.js ← Redis client + helpers
|
||
│ ├── auth/
|
||
│ │ ├── apiKeyMiddleware.js ← requireApiKey() Express middleware(Phase 0.8b A3 起,
|
||
│ │ │ visionA → converter 認證;取代既有 OAuth resource-server)
|
||
│ │ └── oauthClient.js ← Converter as OAuth Client(client_credentials,
|
||
│ │ promote 階段對 FAA 取 token 用)
|
||
│ ├── fileAccessAgent/
|
||
│ │ ├── client.js ← FAA HTTP client(PUT only,重試 + 401 invalidate)
|
||
│ │ └── errors.js
|
||
│ ├── middleware/
|
||
│ │ ├── errorHandler.js ← 統一 error 格式(v1 限定)
|
||
│ │ ├── requestId.js ← X-Request-Id 透傳 / 生成
|
||
│ │ ├── perClientRateLimit.js ← per-client_id rate limiter
|
||
│ │ ├── upload.js ← multer 設定
|
||
│ │ └── uploadConcurrency.js ← per-process upload semaphore(防 OOM)
|
||
│ ├── routes/
|
||
│ │ ├── legacy.js ← /jobs* 6 個端點(Web UI 用)
|
||
│ │ └── v1/
|
||
│ │ ├── index.js ← /api/v1 mount + 內部 errorHandler
|
||
│ │ ├── jobs.js ← POST/GET /jobs, GET /jobs/:id, 預留 501
|
||
│ │ ├── promote.js ← POST /jobs/:id/promote
|
||
│ │ └── validators/
|
||
│ │ └── createJob.js ← multipart fields validator
|
||
│ ├── services/
|
||
│ │ ├── jobService.js ← Job CRUD + claim_active / advance / fail
|
||
│ │ ├── doneListener.js ← Redis Stream 背景 listener
|
||
│ │ ├── healthService.js ← /health 背景 polling cache
|
||
│ │ ├── statusMapper.js ← 內部大寫 status → 對外 status + stage
|
||
│ │ └── sseService.js ← SSE 推送(legacy)
|
||
│ ├── storage/
|
||
│ │ ├── minio.js ← AWS SDK v3 S3 facade
|
||
│ │ └── local.js ← STORAGE_BACKEND=local 模式
|
||
│ ├── redis/
|
||
│ │ └── luaScripts.js ← claim_active_job / release_active_job
|
||
│ └── utils/
|
||
│ └── sanitize.js ← filename / user_id / path 安全處理
|
||
├── docs/
|
||
│ └── openapi.yaml ← Phase 1 對外 API spec(給 visionA 等消費者)
|
||
├── tests/ ← 單元 + 整合測試(見 src/**/__tests__/)
|
||
├── package.json
|
||
├── Dockerfile ← 多層快取 + 非 root user + HEALTHCHECK
|
||
├── env.example ← 完整環境變數範本(不含真實 secret)
|
||
└── README.md ← 本檔
|
||
```
|
||
|
||
---
|
||
|
||
## 5. 環境變數
|
||
|
||
完整清單(含預設、必填與否、說明)見 [`env.example`](./env.example)。
|
||
|
||
簡表(依分類):
|
||
|
||
### 5.1 必填(缺漏會 fail-fast、process exit code 1)
|
||
|
||
| 變數 | 用途 |
|
||
|------|------|
|
||
| `REDIS_URL` | Redis 連線(含 password) |
|
||
| `STORAGE_BACKEND` | `local` / `minio`;POST /api/v1/jobs 必須 `minio` |
|
||
| `MEMBER_CENTER_TOKEN_URL` | promote 階段取 FAA token 用(converter 端 OAuth client) |
|
||
| `KNERON_CONVERTER_CLIENT_ID` | Converter 自己 OAuth client 身份(promote 用) |
|
||
| `KNERON_CONVERTER_CLIENT_SECRET` | **不要進 git;用 secret manager** |
|
||
| `FILE_ACCESS_AGENT_BASE_URL` | promote 目標;production 強制 https |
|
||
| `FILE_ACCESS_AGENT_AUDIENCE` | promote token 的 aud |
|
||
|
||
`STORAGE_BACKEND=minio` 時還需:`MINIO_ENDPOINT_URL` / `MINIO_BUCKET` /
|
||
`MINIO_ACCESS_KEY` / `MINIO_SECRET_KEY`。
|
||
|
||
### 5.1b API Key(visionA → converter 認證,Phase 0.8b 必填於 stage / prod)
|
||
|
||
| 變數 | 用途 |
|
||
|------|------|
|
||
| `CONVERTER_API_KEY` | 64 hex chars pre-shared key;對外 `/api/v1/*` 認證憑證。未設定時所有 `/api/v1/*` 一律回 503 `service_unavailable`(fail-secure) |
|
||
|
||
- 產生:`openssl rand -hex 32`
|
||
- 設置:converter `.env` 與 visionA `.env.stage` 兩端用**相同字串**
|
||
- 詳見 §7 Auth 流程 + `docs/autoflow/04-architecture/auth.md`
|
||
|
||
### 5.2 可選(有合理預設)
|
||
|
||
涵蓋:
|
||
|
||
- 上傳上限(`MULTIPART_MODEL_MAX_BYTES` 預設 500MB、`MULTIPART_REF_IMAGE_MAX_BYTES`
|
||
預設 10MB、`MULTIPART_REF_IMAGES_MAX_COUNT` 預設 100)
|
||
- 上傳並發(`MAX_CONCURRENT_UPLOADS` 預設 5、`UPLOAD_RETRY_AFTER_SECONDS` 預設 30)
|
||
- Rate limit(`API_V1_RATE_LIMIT_WINDOW_MS` 預設 5min、`API_V1_RATE_LIMIT_MAX` 預設 300)
|
||
- OAuth client (converter → FAA,僅 promote 用):`OAUTH_TOKEN_REFRESH_SKEW_MS`、`OAUTH_TOKEN_TIMEOUT_MS`
|
||
- promote timeout(`PROMOTE_TIMEOUT_MS` 預設 300s)
|
||
|
||
> **Phase 0.8b A4 已砍除**:`MEMBER_CENTER_ISSUER` / `MEMBER_CENTER_JWKS_URL` /
|
||
> `KNERON_CONVERTER_AUDIENCE` / `CONVERTER_TENANT_ID` / `CONVERTER_SCOPE_*` /
|
||
> `JWKS_*` / `JWT_CLOCK_TOLERANCE_SEC` —— 這些都是 OAuth resource-server 模式
|
||
> 才需要的;改 API key 後不再使用。若部署環境仍設這些 env,server 啟動會忽略(不報錯)。
|
||
|
||
### 5.3 安全提醒
|
||
|
||
- `.env` 已在 `.gitignore`;不要 commit
|
||
- production 用 secret manager(Vault / AWS Secrets Manager / K8s Secret),
|
||
而不是把 secret 直接放進 docker-compose env
|
||
- 任何含 `REPLACE-ME` 字樣或 `.invalid` TLD 的 placeholder,**部署前必須替換**
|
||
|
||
---
|
||
|
||
## 6. API 概覽
|
||
|
||
### 6.1 Phase 1 對外 API(`/api/v1/*`)
|
||
|
||
所有 endpoint 統一以 `Authorization: Bearer <CONVERTER_API_KEY>` 認證(Phase 0.8b A3 起);
|
||
API key 即「caller 是 visionA」的完整證明,不分 read/write scope。
|
||
|
||
| 方法 | 路徑 | 說明 |
|
||
|------|------|------|
|
||
| POST | `/api/v1/jobs` | 建立轉檔 job(multipart) |
|
||
| GET | `/api/v1/jobs` | Recovery 列表(user_id 必填) |
|
||
| GET | `/api/v1/jobs/:id` | 單一 job 狀態(含 ETag) |
|
||
| POST | `/api/v1/jobs/:id/promote` | 結果檔搬到 FAA |
|
||
| GET | `/api/v1/jobs/:id/result` | **Phase 0.8b Phase B 新增** — NEF binary stream proxy 給 visionA-backend |
|
||
| POST | `/api/v1/jobs/:id/download-tokens` | **Phase 2 預留**,回 501 |
|
||
| DELETE | `/api/v1/jobs/:id` | **Phase 2 預留**,回 501 |
|
||
|
||
完整規格、所有 schema、所有錯誤情境的 example:見 [`docs/openapi.yaml`](./docs/openapi.yaml)。
|
||
|
||
#### 6.1.a `/result` 端點細節(Phase 0.8b Phase B)
|
||
|
||
`GET /api/v1/jobs/:id/result` 是 streaming proxy(200 + `application/octet-stream`),給
|
||
visionA-backend 從 Converter Bucket 直接拉 NEF 結果檔。取代「visionA → 拿 delegated download
|
||
token → FAA」路徑(該路徑因 MC 沒實作 endpoint 而從未跑通)。
|
||
|
||
**安全限制**(對齊 [api-result.md §9 / §15](../../docs/autoflow/04-architecture/api/api-result.md)):
|
||
|
||
| 限制 | 預設值 | env 覆寫 | 失敗回應 |
|
||
|------|--------|----------|---------|
|
||
| Burst rate limit | 5 req / 10s per token_fingerprint | `RESULT_RATE_LIMIT_BURST_PER_10S` | 429 `rate_limit_exceeded` + `limit_type: burst` |
|
||
| Sustained rate limit | 20 req / 1min per token_fingerprint | `RESULT_RATE_LIMIT_SUSTAINED_PER_MIN` | 429 `rate_limit_exceeded` + `limit_type: sustained` |
|
||
| Hourly bandwidth quota | 1 GB / hr per token_fingerprint | `RESULT_BANDWIDTH_QUOTA_PER_HOUR_BYTES` | 429 `bandwidth_quota_exceeded` + `limit_type: bandwidth_hourly` |
|
||
| Daily bandwidth quota | 6 GB / 24hr per token_fingerprint | `RESULT_BANDWIDTH_QUOTA_PER_DAY_BYTES` | 429 `bandwidth_quota_exceeded` + `limit_type: bandwidth_daily` |
|
||
| Concurrent stream cap | 10 同時 stream(per-instance) | `MAX_CONCURRENT_RESULT_STREAMS` | 503 `service_busy` + `Retry-After: 30` |
|
||
| Stream response timeout | 5 分鐘 | `RESULT_STREAM_TIMEOUT_MS` | connection destroy + audit log `result.stream_timeout` |
|
||
|
||
**Range header 處理**:silently ignored,response 永遠 200 整段 + `Accept-Ranges: none`
|
||
(不回 416、不切片)。收到 Range header 時會寫 audit log `result.range_attempted`(INFO)。
|
||
詳見 [api-result.md §10](../../docs/autoflow/04-architecture/api/api-result.md)。
|
||
|
||
**audit log 12 種事件**(對齊 [api-result.md §11.3](../../docs/autoflow/04-architecture/api/api-result.md)):
|
||
`result.streamed` / `result.stream_error` / `result.client_closed` / `result.stream_timeout` /
|
||
`result.not_found` / `result.not_completed` / `result.expired` / `result.storage_unavailable` /
|
||
`result.rate_limited` / `result.bandwidth_quota_exceeded` / `result.range_attempted` /
|
||
`result.filename_assertion_failed`。每個事件含 A.7 五欄(source_ip / token_fingerprint /
|
||
request_id / http_method / http_path)+ /result 四欄(job_id / size_bytes / duration_ms /
|
||
stream_completed,按事件類型按需)。
|
||
|
||
**Multi-instance 限制**:上述 in-memory counter 均為 per-process;Phase 2 多 instance
|
||
部署前必切 Redis backend,否則 limit 會被「乘以 instance 數」放鬆。見 [security.md
|
||
候補 #8](../../docs/autoflow/04-architecture/security.md)(HIGH)。
|
||
|
||
### 6.2 Legacy / 內部 API(`/jobs/*`,僅內網 vhost 暴露)
|
||
|
||
對 Web UI 100% 不變更行為(T4 重構僅是「移動 + 抽象」):
|
||
|
||
| 方法 | 路徑 | 說明 |
|
||
|------|------|------|
|
||
| POST | `/jobs` | Web UI 上傳建 job(multipart,無 user_id 概念) |
|
||
| GET | `/jobs` | 列出全部 job(legacy KEYS scan) |
|
||
| GET | `/jobs/:jobId` | 查單一 job |
|
||
| GET | `/jobs/:jobId/events` | SSE 推送 |
|
||
| GET | `/jobs/:jobId/download/:filename` | 下載結果檔 |
|
||
| GET | `/queues/stats` | Redis Stream / Group 統計 |
|
||
|
||
### 6.3 健康檢查
|
||
|
||
| 方法 | 路徑 | 說明 |
|
||
|------|------|------|
|
||
| GET | `/health` | 公開,不需認證 |
|
||
|
||
---
|
||
|
||
## 7. Auth 流程(Phase 0.8b)
|
||
|
||
> **設計演進**:Phase 0.8b 起,visionA → converter 對外認證從 OAuth `client_credentials`
|
||
> 改為 pre-shared API key(1:1 internal trust)。converter → FAA 仍走 OAuth client_credentials。
|
||
> 歷史 OAuth resource-server 設計詳見 visionA repo `ADR-014` / `ADR-015` v2.1。
|
||
|
||
### 7.1 visionA → Converter(API key)
|
||
|
||
#### 7.1.1 設置
|
||
|
||
1. 在 converter `.env`(或 secret manager)設:
|
||
```bash
|
||
CONVERTER_API_KEY=$(openssl rand -hex 32)
|
||
```
|
||
產出 64 hex chars(128 bits 熵)。
|
||
|
||
2. visionA 端 `.env.stage` 設**相同字串**:
|
||
```bash
|
||
VISIONA_CONVERTER_API_KEY=<same string>
|
||
```
|
||
|
||
#### 7.1.2 呼叫範例
|
||
|
||
```bash
|
||
# 健康檢查(不需 API key)
|
||
curl http://localhost:4000/health
|
||
|
||
# 建立 job(需 API key)
|
||
curl -X POST http://localhost:4000/api/v1/jobs \
|
||
-H "Authorization: Bearer $CONVERTER_API_KEY" \
|
||
-F "model=@./model.onnx" \
|
||
-F "user_id=alice" \
|
||
-F "model_id=1001" \
|
||
-F "version=v1.0.0" \
|
||
-F "platform=520"
|
||
|
||
# 查 job 狀態
|
||
curl -H "Authorization: Bearer $CONVERTER_API_KEY" \
|
||
http://localhost:4000/api/v1/jobs/<job-id>
|
||
```
|
||
|
||
#### 7.1.3 Middleware 行為
|
||
|
||
每個 `/api/v1/*` request 進入時:
|
||
|
||
1. 解 `Authorization: Bearer <token>`
|
||
2. `crypto.timingSafeEqual` constant-time compare(防 timing attack)
|
||
3. 通過後設 `req.auth`:
|
||
```js
|
||
req.auth = {
|
||
sub: 'visionA-service',
|
||
clientId: 'visionA-service',
|
||
tenantId: null,
|
||
scopes: ['converter:job.write', 'converter:job.read'],
|
||
raw: { authType: 'api_key' },
|
||
};
|
||
```
|
||
|
||
驗證失敗時:
|
||
|
||
| 情境 | HTTP | error.code |
|
||
|------|------|-----------|
|
||
| 缺 Authorization header / 非 Bearer 格式 / token 為空 | 401 | `invalid_token` |
|
||
| Token 與 `CONVERTER_API_KEY` 不符 | 401 | `invalid_token` |
|
||
| `CONVERTER_API_KEY` env 未設定(fail-secure) | 503 | `service_unavailable` |
|
||
|
||
所有失敗:
|
||
- 回 v1 標準錯誤格式(`{error: {code, message, request_id}}`)
|
||
- 設 `Connection: close` + `req.socket.destroy()`,阻止 unauthorized client 繼續灌大檔(best-effort;真正的 body 上限靠 Nginx `client_max_body_size`)
|
||
|
||
#### 7.1.5 Audit log(Phase 0.8b A7)
|
||
|
||
每個 `/api/v1/*` request 都會寫一筆 audit log(JSON、stdout):
|
||
|
||
| `action` | 時機 | 欄位 |
|
||
|----------|------|------|
|
||
| `auth.api_key.authenticated` | 驗證成功 | level=INFO、`source_ip`、`token_fingerprint`、`request_id`、`http_method`、`http_path`、`client_id` |
|
||
| `auth.api_key.missing` | 缺 Authorization / 格式錯 / token 空 | level=WARN、`source_ip`、`request_id`、`http_method`、`http_path`(無 fingerprint) |
|
||
| `auth.api_key.invalid` | Token 不符 | level=WARN、`source_ip`、`request_id`、`http_method`、`http_path`、`token_fingerprint`(wrong token 的 fingerprint) |
|
||
| `auth.api_key.not_configured` | `CONVERTER_API_KEY` env 未設 | level=ERROR、`source_ip`、`request_id`、`http_method`、`http_path`(無 fingerprint、不洩漏 caller token) |
|
||
|
||
關鍵設計:
|
||
|
||
- **`source_ip` 從 `req.ip` 取**:依賴 `app.set('trust proxy', ...)` 正確配置(見 `TRUST_PROXY` env)。設錯會讓 source_ip 失去 forensic 價值或被 attacker 偽造。
|
||
- **`token_fingerprint` = `sha256(token)` 前 12 hex chars(48 bits 識別空間)**:足以 cluster 同一把 key 的多 caller 或同 attacker 的多次嘗試,不可逆推 token 本身。
|
||
- **絕不 log token 內容**:失敗 path 也只 log fingerprint。
|
||
|
||
範例(成功 path):
|
||
|
||
```json
|
||
{
|
||
"service": "task-scheduler",
|
||
"timestamp": "2026-05-16T10:30:00.123Z",
|
||
"level": "INFO",
|
||
"action": "auth.api_key.authenticated",
|
||
"auth_type": "api_key",
|
||
"client_id": "visionA-service",
|
||
"source_ip": "203.0.113.42",
|
||
"request_id": "7c6e4f3b-...",
|
||
"http_method": "POST",
|
||
"http_path": "/api/v1/jobs",
|
||
"token_fingerprint": "8a1b3c2d4e5f"
|
||
}
|
||
```
|
||
|
||
範例(失敗 path — wrong token):
|
||
|
||
```json
|
||
{
|
||
"service": "task-scheduler",
|
||
"timestamp": "2026-05-16T10:30:01.456Z",
|
||
"level": "WARN",
|
||
"action": "auth.api_key.invalid",
|
||
"auth_type": "api_key",
|
||
"source_ip": "203.0.113.99",
|
||
"request_id": "abc1-...",
|
||
"http_method": "POST",
|
||
"http_path": "/api/v1/jobs",
|
||
"token_fingerprint": "f9e8d7c6b5a4"
|
||
}
|
||
```
|
||
|
||
⚠️ **`TRUST_PROXY` env 配置(關鍵!)**:
|
||
|
||
| 部署架構 | `TRUST_PROXY` 設定 | 風險 |
|
||
|---------|-------------------|------|
|
||
| Local dev / 測試環境 | 留空(預設 `loopback`) | — |
|
||
| Stage / prod(前面 1 層 Nginx) | `TRUST_PROXY=1` | — |
|
||
| Stage / prod(cloud LB + Nginx) | `TRUST_PROXY=2` | — |
|
||
| 任何位置 | `TRUST_PROXY=true`(信任所有 hop) | ⚠️ Attacker 可偽造 `X-Forwarded-For` 欺騙 audit log |
|
||
|
||
設過嚴(stage / prod 留 `loopback`)→ `source_ip` 永遠是 Nginx 內部 IP、forensic 失效。設過寬(`true`)→ attacker 可偽造 IP。**必須與實際部署 hop 數一致**。詳見 `env.example` §16 或 [Express trust proxy docs](https://expressjs.com/en/guide/behind-proxies.html)。
|
||
|
||
#### 7.1.4 Rotation 流程
|
||
|
||
1. 雙端各自 stop(或允許短暫 401 期)
|
||
2. `openssl rand -hex 32` 產新 key
|
||
3. 更新雙端 `.env` 為新 key
|
||
4. converter 先 redeploy;visionA 後 redeploy
|
||
5. 驗證:任意 `/api/v1/*` endpoint 帶新 key 應 200
|
||
|
||
詳見 `docs/autoflow/04-architecture/auth.md` §4。
|
||
|
||
### 7.2 Converter → File Access Agent(OAuth client_credentials,保留)
|
||
|
||
promote 流程(`POST /api/v1/jobs/:id/promote`)中,Converter 切換成 OAuth Client,
|
||
用 `client_credentials` 取 `files:upload.write` scope token,PUT 結果檔到 FAA。
|
||
**Phase 0.8b 完全不動**。
|
||
|
||
token cache per scope,過期前 60s 主動 refresh;FAA 回 401 時自動 invalidate
|
||
cache 並重試一次。詳見 `src/auth/oauthClient.js`。
|
||
|
||
需要的 env:`MEMBER_CENTER_TOKEN_URL` / `KNERON_CONVERTER_CLIENT_ID` /
|
||
`KNERON_CONVERTER_CLIENT_SECRET` / `FILE_ACCESS_AGENT_*`。
|
||
|
||
---
|
||
|
||
## 8. 錯誤碼總表
|
||
|
||
| HTTP | code | 說明 |
|
||
|------|------|------|
|
||
| 400 | `validation_error` | 欄位格式錯(`details.fields[]` 列具體欄位) |
|
||
| 400 | `invalid_multipart` | multipart parse 失敗、缺必要 file、副檔名不符 |
|
||
| 401 | `invalid_token` | API key 不符 / 缺 Authorization header / 格式錯 |
|
||
| 404 | `job_not_found` | job 不存在或不屬於該 client(不洩漏存在性) |
|
||
| 404 | `not_found` | 路徑不存在 |
|
||
| 409 | `user_has_active_job` | 同 user 已有未完成 job(`details.active_job_*`) |
|
||
| 409 | `job_not_ready_for_promote` | promote 時 job 非 completed |
|
||
| 409 | `source_not_available` | promote 的 source stage 沒產出 |
|
||
| 413 | `file_too_large` | 上傳超過大小上限(model 500MB / ref_image 10MB) |
|
||
| 422 | `invalid_object_key` | promote target_object_key 格式不合法 |
|
||
| 429 | `rate_limit_exceeded` | per-client rate limit |
|
||
| 500 | `misconfiguration` | 伺服器設定錯(如 STORAGE_BACKEND 非 minio) |
|
||
| 500 | `internal_error` | 其他未分類錯誤 |
|
||
| 501 | `not_implemented` | Phase 2 預留端點 |
|
||
| 502 | `storage_unavailable` | MinIO 寫入失敗 |
|
||
| 502 | `file_gateway_unavailable` | FAA 不可用 / 拒絕 |
|
||
| 503 | `auth_service_unavailable` | Member Center 取 token 失敗(**僅 promote 階段**,converter → FAA 那條鏈) |
|
||
| 503 | `service_busy` | upload concurrency 已滿(`Retry-After` header) |
|
||
| 503 | `service_unavailable` | `CONVERTER_API_KEY` env 未設定(visionA → converter 對外 API fail-secure) |
|
||
|
||
response 完整 schema 見 [`docs/openapi.yaml`](./docs/openapi.yaml#components/schemas/ApiError)。
|
||
|
||
---
|
||
|
||
## 9. 與其他服務的關係
|
||
|
||
| 服務 | 連接方式 | 用途 | 失敗影響 |
|
||
|------|---------|------|---------|
|
||
| Member Center | HTTPS | 驗 visionA token / 取 promote token | 新 token 無法驗(cache 內舊 token 仍可用);promote 階段失敗 |
|
||
| File Access Agent | HTTPS | promote 結果檔搬到 NAS | promote 失敗,但 job 本身已 completed,可重試 |
|
||
| MinIO | HTTP / HTTPS | 原始模型 / 結果檔暫存(7 天 lifecycle) | POST /jobs 直接 502,promote 也會失敗 |
|
||
| Redis | TCP | Job state、active_job lock、Stream queue | 整個服務 unhealthy |
|
||
| Worker(onnx / bie / nef) | Redis Stream | 跑 pipeline | Job 卡在某個 stage,TTL 7 天會自動清 |
|
||
|
||
---
|
||
|
||
## 10. 監控
|
||
|
||
### 10.1 `/health` 的三層 status
|
||
|
||
| status | HTTP | 對應狀態 |
|
||
|--------|------|---------|
|
||
| `healthy` | 200 | Redis / MC / FAA 都連通 |
|
||
| `degraded` | 200 | Redis 連通,但 MC / FAA 任一不可達 |
|
||
| `unhealthy` | 503 | Redis 斷線 |
|
||
|
||
response body 同時包含 `dependencies.{redis, member_center, file_access_agent}`
|
||
細節,可給 K8s readiness / liveness probe 區分嚴重度。
|
||
|
||
### 10.2 結構化日誌
|
||
|
||
所有 v1 路徑的 handler 都輸出 JSON log(stdout):
|
||
|
||
```json
|
||
{
|
||
"service": "task-scheduler",
|
||
"timestamp": "2026-04-25T12:00:00.123Z",
|
||
"level": "INFO",
|
||
"action": "jobs.create.success",
|
||
"request_id": "7c6e4f3b-...",
|
||
"job_id": "550e8400-...",
|
||
"user_id": "alice",
|
||
"client_id": "kneron_converter_dev",
|
||
"size_bytes": 204800000,
|
||
"ref_images_count": 0,
|
||
"duration_ms": 234
|
||
}
|
||
```
|
||
|
||
`action` 欄位採 `domain.event` 格式,便於用 jq / loki 過濾。
|
||
|
||
### 10.3 Rate limit headers
|
||
|
||
回應自動帶:
|
||
|
||
- `X-RateLimit-Limit` / `RateLimit-Limit`
|
||
- `X-RateLimit-Remaining` / `RateLimit-Remaining`
|
||
- 超限時:`Retry-After`(秒)
|
||
|
||
---
|
||
|
||
## 11. Phase 1 已知接受風險
|
||
|
||
> 本節為摘要,完整內容見 [`docs/autoflow/04-architecture/security.md`](../../docs/autoflow/04-architecture/security.md)。
|
||
|
||
### 11.1 user_id 信任邊界(最重要)
|
||
|
||
- `user_id` 來自 multipart form field(POST)或 query string(GET),
|
||
**不**從 JWT claim derive
|
||
- Converter 完全信任 visionA-backend 帶來的 user_id 是對的,**不做 user 層級 ACL**
|
||
- visionA-backend 一旦被 compromise,attacker 可冒充任何 user_id
|
||
|
||
**Phase 1 接受此風險的理由**:
|
||
|
||
1. visionA-backend 是內部受控系統,非 Internet-facing
|
||
2. Phase 1 重點是 pipeline 跑通;安全強化排在 Phase 2
|
||
3. HMAC / OBO 流程要 visionA / Member Center 配合,已對齊但尚未實作
|
||
|
||
**Phase 1 mitigation**:
|
||
|
||
- per-client_id rate limit(300 req / 5 min)
|
||
- 結構化 audit log 含 `client_id` + `user_id`
|
||
- 7 天 active_job TTL(避免 lock 永久不釋放)
|
||
- `user_id` 嚴格白名單(`^[A-Za-z0-9._-]{1,128}$`)擋 XSS / Redis key injection
|
||
|
||
**Phase 2 候補**:HMAC-signed user_id(短期)/ OAuth Token Exchange(中期)。
|
||
|
||
### 11.2 大檔上傳的 OOM 風險
|
||
|
||
- multer 用 `memoryStorage` — 每個並發 upload 吃 model size 大小的 heap
|
||
- 5 並發 × 500MB = 2.5GB;`MAX_CONCURRENT_UPLOADS` 預設 5(4GB 容器安全)
|
||
- 超過時 503 + `Retry-After`,client 主動 backoff
|
||
|
||
### 11.3 Trust boundary 與 Nginx 層
|
||
|
||
- 401/403 後 server 雖會 `socket.destroy()`,但這是 best-effort
|
||
- 真正的 body 大小上限由 Nginx vhost `client_max_body_size 600M` 把關
|
||
- Nginx 雙 vhost 設定詳見 TDD §7.1(DevOps 範圍,非後端)
|
||
|
||
### 11.4 Per-process state(Phase 2 才需處理)
|
||
|
||
- rate limiter / upload concurrency 都是 in-process counter
|
||
- Phase 1 部署為單 instance,無問題;Phase 2 多 instance 時要改 Redis store
|
||
|
||
---
|
||
|
||
## 12. 測試
|
||
|
||
```bash
|
||
npm test # 跑所有 unit + integration test(Phase 0.8b A6 後 ~640 tests,< 10 秒)
|
||
npm test -- --watch # watch 模式
|
||
npm test -- src/auth # 只跑 auth 模組的測試
|
||
```
|
||
|
||
測試金字塔:
|
||
- 單元測試(70%):service / validator / utils / middleware
|
||
- 整合測試(20%):route + middleware + Redis 模擬 / FAA mock
|
||
- E2E(10%):由 Testing Agent 跑(不在本套件內)
|
||
|
||
CI 用:`npm test`。
|
||
|
||
---
|
||
|
||
## 13. 故障排除(常見場景)
|
||
|
||
| 症狀 | 可能原因 | 排查 |
|
||
|------|---------|------|
|
||
| 啟動立刻 exit 1 | env 缺漏 | 看 `[Scheduler] Config validation failed` log;對照 `env.example` |
|
||
| 啟動 warn `config.api_key_not_set` | `CONVERTER_API_KEY` env 未設定 | 設 `CONVERTER_API_KEY` 為 64 hex(`openssl rand -hex 32`);未設時 `/api/v1/*` 一律 503 |
|
||
| 401 invalid_token | API key 不符 / 缺 Authorization header / 格式錯 | 確認 visionA 與 converter 兩端 `CONVERTER_API_KEY` 字串完全相同 |
|
||
| 401 後 client 連線立刻斷 | 設計如此(`Connection: close` + `socket.destroy()`) | 正常行為,避免 client 繼續灌 body |
|
||
| 503 service_unavailable on `/api/v1/*` | converter 端 `CONVERTER_API_KEY` 未設 | 設 env 後重啟 |
|
||
| 409 user_has_active_job 但前一個 job 已 failed | active_job lock 沒被釋放 | 看 worker done listener 是否運作;最壞情況 7 天 TTL 會自動清 |
|
||
| 502 storage_unavailable | MinIO 不可達 / 認證錯 | 檢查 `MINIO_*` env、bucket 是否存在 |
|
||
| 502 file_gateway_unavailable | FAA 5xx 或 4xx 拒絕(非 401) | 看 server log `promote.faa_put_failed`,FAA 端排查 |
|
||
| 503 auth_service_unavailable | Member Center token endpoint 死 / 401 兩次 | 確認 `MEMBER_CENTER_TOKEN_URL` 可達、`KNERON_CONVERTER_CLIENT_*` 對 |
|
||
| 503 service_busy + Retry-After | upload concurrency 已滿 | 等 Retry-After,或調高 `MAX_CONCURRENT_UPLOADS`(注意 OOM) |
|
||
| 503 unhealthy(/health) | Redis 斷線 | 檢查 `REDIS_URL` 與 Redis 服務狀態 |
|
||
| GET /jobs 回 400 missing user_id | Phase 1 強制 user_id 必填 | client 端帶 user_id query string |
|
||
| 大檔上傳跑到一半 5xx | Nginx `client_max_body_size` 太小 | 部署層調 `client_max_body_size 600M`(不在 backend 範圍) |
|
||
|
||
更多細節:
|
||
|
||
- `docs/autoflow/04-architecture/TDD.md`(完整規格索引)
|
||
- `docs/autoflow/04-architecture/auth.md`(Phase 0.8b API key 認證設計)
|
||
- `docs/autoflow/04-architecture/security.md`(安全模型 / 接受風險)
|
||
- `.autoflow/05-implementation/`(per-branch 實作筆記與 Phase 0.8b A1–A6 報告)
|
||
|
||
---
|
||
|
||
## 14. 文件參照
|
||
|
||
| 文件 | 內容 |
|
||
|------|------|
|
||
| [`docs/openapi.yaml`](./docs/openapi.yaml) | Phase 1 對外 API spec(給 visionA-backend 等消費者 import) |
|
||
| [`env.example`](./env.example) | 完整環境變數清單(含說明、預設、必填與否) |
|
||
| `../../docs/autoflow/04-architecture/TDD.md` | 完整技術設計文件 |
|
||
| `../../docs/autoflow/04-architecture/auth.md` | Phase 0.8b API key 認證設計(visionA → converter)+ FAA OAuth client(保留) |
|
||
| `../../docs/autoflow/04-architecture/security.md` | 安全模型 / 接受風險 / Phase 2 候補 |
|
||
| `../../docs/autoflow/04-architecture/design-doc.md` | 架構決策(為什麼選這些方案) |
|
||
| `../../docs/autoflow/02-prd/PRD.md` | 產品需求 / user stories |
|
||
| `../../docs/TODO-visionA-integration-v2.md` | Phase 0.8b 對 visionA 整合的交接紀錄 |
|
||
|
||
---
|
||
|
||
## 15. License
|
||
|
||
MIT
|