jim800121chen d8a9517c9d feat(task-scheduler): Phase 0.8b — API key auth + /result endpoint
Auth pillar 從 OAuth 2.0 resource server 改成 pre-shared API key
(visionA ↔ converter 1:1 internal trust)。新增 GET /api/v1/jobs/:id/result
streaming endpoint 給 visionA backend 中轉 NEF 下載。

Phase A(auth 切換):
- 新增 apiKeyMiddleware(constant-time compare、tokenFingerprint、4 audit events)
- 砍 OAuth middleware + JWKS(保留 oauthClient 供 promote → FAA 使用)
- 4 個 endpoint 換掛 requireApiKey
- 加 TRUST_PROXY env + Express trust proxy 設定(forensic source_ip)

Phase B(/result endpoint):
- streaming NEF download with 5min timeout + concurrent cap 10
- Two-tier rate limit(burst 5/10s + sustained 20/min)
- Bandwidth quota(1 GB/hr + 6 GB/24hr)by token_fingerprint
- Range header silently ignored + Accept-Ranges: none
- filename quote-escape + RFC 5987 fallback + sanitize
- 8 個 /result audit events(forensic 完整)

設計演進記錄:docs/TODO-visionA-integration-v2.md(5/2 OAuth → 5/16 API key
→ 5/16 download via converter;對應 visionA repo ADR-015/016)

Tests: 597 → 666 (+69)、29 suites all pass
Security: APPROVE WITH CONDITIONS(單 instance 部署、6 新 env、24hr 監控)
npm audit: 3 vuln → 0(transitive AWS SDK xml chain)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 22:47:28 +08:00

7.0 KiB
Raw Blame History

Observability 設計

狀態Phase 1 完工 — Phase 0.8b 新增 /result endpoint 的 log + metrics。

配套security.mdlog 不含 secret 規則)、performance.mdSLO 量測)。


1. 三支柱

Phase 1 + 0.8bLogs onlyMetrics / Traces 留 Phase 2

1.1 Logs結構化 JSON

全部走 stdout由 docker / k8s collector 撈走(不 ship 到外部)。

每筆 log 必含:

欄位 範例
timestamp ISO 8601 2026-05-16T12:00:00.123Z
level INFO / WARN / ERROR
service task-scheduler
action domain.event(如 result.successauth.api_key.not_configured
request_id UUIDv4中介層自動帶

按 endpoint 額外欄位見下方各章。

1.2 MetricsPhase 2

預留 Prometheus exposition。Phase 0.8b 不實作。

1.3 TracesPhase 2

預留 OpenTelemetry。Phase 0.8b 不實作。


2. 各 endpoint log 欄位

2.1 POST /api/v1/jobs

{
  "level": "INFO",
  "service": "task-scheduler",
  "timestamp": "...",
  "action": "jobs.created",  // 或 jobs.create_failed
  "request_id": "...",
  "job_id": "...",
  "user_id": "...",
  "client_id": "visionA-service",
  "model_filename": "model.onnx",  // sanitized
  "model_size_bytes": 204800000,
  "ref_images_count": 0,
  "platform": "520",
  "duration_ms": 4231,
  "error_code": null  // or 'user_has_active_job' / 'file_too_large' etc
}

2.2 GET /api/v1/jobs/:id

{
  "level": "INFO",
  "action": "jobs.get_one",
  "request_id": "...",
  "job_id": "...",
  "user_id": "...",
  "client_id": "visionA-service",
  "internal_status": "ONNX",  // 內部大寫
  "external_status": "running",
  "etag_match": false,
  "duration_ms": 18
}

2.3 GET /api/v1/jobs

{
  "level": "INFO",
  "action": "jobs.list",
  "request_id": "...",
  "user_id": "...",
  "filter_status": "in_progress",
  "result_count": 3,
  "duration_ms": 25
}

2.4 POST /api/v1/jobs/:id/promote

{
  "level": "INFO",
  "action": "promote.success",  // 或 promote.idempotent_hit / promote.not_ready / promote.faa_put_failed
  "request_id": "...",
  "job_id": "...",
  "client_id": "visionA-service",
  "target_count": 1,
  "duration_ms": 580,
  "error_name": null  // or 'FAAUnauthorizedError' / 'FAATimeoutError' etc
}

2.5 GET /api/v1/jobs/:id/resultPhase 0.8b 新增)

{
  "level": "INFO",
  "action": "result.success",  // 或 result.not_available / result.minio_failed / result.stream_error / result.client_closed
  "request_id": "...",
  "job_id": "...",
  "client_id": "visionA-service",
  "nef_key": "jobs/.../output/out.nef",  // server-controlled不算敏感
  "size_bytes": 52428800,
  "filename_sent": "yolov5s_kl720.nef",
  "duration_ms": 1234,
  "error_code": null,  // or 'result_expired' / 'job_not_completed' / 'storage_unavailable'
  "stream_completed": true  // false if client closed mid-stream
}

Result endpoint 特別注意

  • 不 log NEF binary 內容(只 log object key + size
  • stream_completed: false 代表 client 中途斷線(可能正常、可能網路爛、可能 client bug
  • error_code = stream_errorheaders 已送出後 stream 失敗,沒辦法回 4xx 給 client

3. Auth 相關 log

3.1 API key middleware

{
  "level": "ERROR",
  "action": "auth.api_key.not_configured",  // env 未設定
  "message": "CONVERTER_API_KEY env not set; rejecting all requests"
}
{
  "level": "INFO",
  "action": "config.api_key_enabled",  // 啟動時印
  "message": "API key middleware enabled",
  "api_key_length": 64,  // 不印 key 本身
  "timestamp": "..."
}

注意API key 驗證失敗401不 log 個別 request(每次失敗都 log 會:(1) 攻擊面被打就會 log 爆炸;(2) log injection 風險)。改 metrics 計數。

3.2 OAuth clientpromote 取 FAA token

{
  "level": "INFO",
  "service": "oauth-client",
  "action": "oauth.token_obtained",
  "scope": "files:upload.write",
  "token_type": "Bearer",
  "expires_in_sec": 3600,
  "access_token_length": 1024  // 不印 token 本身
}
{
  "level": "WARN",
  "service": "oauth-client",
  "action": "oauth.token_endpoint_error",
  "scope": "files:upload.write",
  "status": 401,
  "error_code": "invalid_client"
}

4. 敏感資料保護

4.1 絕對不 log

  • Authorization header 完整內容(含 API key、JWT
  • CONVERTER_API_KEYKNERON_CONVERTER_CLIENT_SECRET、MinIO secret
  • File body / model 內容
  • JWT payload 完整 dump
  • FAA error body可能含內部 endpoint / region 等)
  • MinIO error message可能含 endpoint / region / bucket name

4.2 可以 log

  • client_iduser_idAPI key 模式下 client_id 固定為 visionA-service
  • tenant_id
  • request_id
  • File metadatafilenamesanitizedsize_bytesmimetype
  • Object keyserver controlled例如 jobs/{job_id}/output/out.nef
  • Error 分類資訊:error_codeerror_namestatusHTTP
  • Duration、timestamp

4.3 條件 log

  • IPlog 仍記、GDPR 場景可能需要遮罩
  • model_filename:已 sanitized、通常不視為敏感
  • 失敗時的 error_message:截短 100 chars 且不含 secret 才 log

5. 日誌等級

Level 用途
DEBUG 不用production 不開)
INFO 正常事件job created、result.success、token_obtained 等)
WARN 可恢復異常FAA 5xx 重試、token cooldown、rate limit hit
ERROR 不可恢復 / 需人工關注MinIO down、API key 未配置、stream 中斷)

6. 告警策略Phase 0.8b 規劃Phase 2 實作)

等級 條件 回應時間
P0 Scheduler down / Redis down 15 min
P1 API 5xx 比例 > 5% / 持續 5min 1 hr
P1 auth.api_key.not_configured 出現(代表 env 漏設) 1 hr
P2 result.stream_error 比例 > 1% 當日
P2 promote.faa_put_failed 重試後仍失敗 當日
P3 Token cache miss 突增 下個工作日

7. DashboardPhase 2 設計)

全域 dashboard

  • 每 endpoint QPS / 5min
  • p50 / p95 / p99 延遲
  • 4xx / 5xx 比例
  • API key 401 比例(應接近 0%> 0.1% 告警)

Result endpoint dashboardPhase 0.8b 新增):

  • /result QPS
  • result.success / result.not_available10/404/409/410 分布)
  • stream_completed: true vs false 比例
  • 平均 NEF size

8. Phase 0.8b 變動總結

8.1 新增

  • result.* action 系列 logsuccess / not_available / minio_failed / stream_error / client_closed
  • auth.api_key.* action 系列 log
  • config.api_key_* 啟動 log

8.2 移除

  • auth.verify_failedOAuth JWT 驗證失敗)
  • auth.middleware_unexpected_errorOAuth middleware 兜底)
  • JWKS-related log沒有 JWKS 了)

8.3 保留

  • jobs.created / jobs.get_one / jobs.list
  • promote.* 全系列
  • oauth.token_*promote 用的 OAuth client log