Auth pillar 從 OAuth 2.0 resource server 改成 pre-shared API key (visionA ↔ converter 1:1 internal trust)。新增 GET /api/v1/jobs/:id/result streaming endpoint 給 visionA backend 中轉 NEF 下載。 Phase A(auth 切換): - 新增 apiKeyMiddleware(constant-time compare、tokenFingerprint、4 audit events) - 砍 OAuth middleware + JWKS(保留 oauthClient 供 promote → FAA 使用) - 4 個 endpoint 換掛 requireApiKey - 加 TRUST_PROXY env + Express trust proxy 設定(forensic source_ip) Phase B(/result endpoint): - streaming NEF download with 5min timeout + concurrent cap 10 - Two-tier rate limit(burst 5/10s + sustained 20/min) - Bandwidth quota(1 GB/hr + 6 GB/24hr)by token_fingerprint - Range header silently ignored + Accept-Ranges: none - filename quote-escape + RFC 5987 fallback + sanitize - 8 個 /result audit events(forensic 完整) 設計演進記錄:docs/TODO-visionA-integration-v2.md(5/2 OAuth → 5/16 API key → 5/16 download via converter;對應 visionA repo ADR-015/016) Tests: 597 → 666 (+69)、29 suites all pass Security: APPROVE WITH CONDITIONS(單 instance 部署、6 新 env、24hr 監控) npm audit: 3 vuln → 0(transitive AWS SDK xml chain) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.0 KiB
7.0 KiB
Observability 設計
狀態:Phase 1 完工 — Phase 0.8b 新增
/resultendpoint 的 log + metrics。配套:
security.md(log 不含 secret 規則)、performance.md(SLO 量測)。
1. 三支柱
Phase 1 + 0.8b:Logs only(Metrics / Traces 留 Phase 2)。
1.1 Logs(結構化 JSON)
全部走 stdout,由 docker / k8s collector 撈走(不 ship 到外部)。
每筆 log 必含:
| 欄位 | 範例 |
|---|---|
timestamp |
ISO 8601 2026-05-16T12:00:00.123Z |
level |
INFO / WARN / ERROR |
service |
task-scheduler |
action |
domain.event(如 result.success、auth.api_key.not_configured) |
request_id |
UUIDv4(中介層自動帶) |
按 endpoint 額外欄位見下方各章。
1.2 Metrics(Phase 2)
預留 Prometheus exposition。Phase 0.8b 不實作。
1.3 Traces(Phase 2)
預留 OpenTelemetry。Phase 0.8b 不實作。
2. 各 endpoint log 欄位
2.1 POST /api/v1/jobs
{
"level": "INFO",
"service": "task-scheduler",
"timestamp": "...",
"action": "jobs.created", // 或 jobs.create_failed
"request_id": "...",
"job_id": "...",
"user_id": "...",
"client_id": "visionA-service",
"model_filename": "model.onnx", // sanitized
"model_size_bytes": 204800000,
"ref_images_count": 0,
"platform": "520",
"duration_ms": 4231,
"error_code": null // or 'user_has_active_job' / 'file_too_large' etc
}
2.2 GET /api/v1/jobs/:id
{
"level": "INFO",
"action": "jobs.get_one",
"request_id": "...",
"job_id": "...",
"user_id": "...",
"client_id": "visionA-service",
"internal_status": "ONNX", // 內部大寫
"external_status": "running",
"etag_match": false,
"duration_ms": 18
}
2.3 GET /api/v1/jobs
{
"level": "INFO",
"action": "jobs.list",
"request_id": "...",
"user_id": "...",
"filter_status": "in_progress",
"result_count": 3,
"duration_ms": 25
}
2.4 POST /api/v1/jobs/:id/promote
{
"level": "INFO",
"action": "promote.success", // 或 promote.idempotent_hit / promote.not_ready / promote.faa_put_failed
"request_id": "...",
"job_id": "...",
"client_id": "visionA-service",
"target_count": 1,
"duration_ms": 580,
"error_name": null // or 'FAAUnauthorizedError' / 'FAATimeoutError' etc
}
2.5 GET /api/v1/jobs/:id/result(Phase 0.8b 新增)
{
"level": "INFO",
"action": "result.success", // 或 result.not_available / result.minio_failed / result.stream_error / result.client_closed
"request_id": "...",
"job_id": "...",
"client_id": "visionA-service",
"nef_key": "jobs/.../output/out.nef", // server-controlled,不算敏感
"size_bytes": 52428800,
"filename_sent": "yolov5s_kl720.nef",
"duration_ms": 1234,
"error_code": null, // or 'result_expired' / 'job_not_completed' / 'storage_unavailable'
"stream_completed": true // false if client closed mid-stream
}
Result endpoint 特別注意:
- 不 log NEF binary 內容(只 log object key + size)
- stream_completed: false 代表 client 中途斷線(可能正常、可能網路爛、可能 client bug)
- error_code = stream_error:headers 已送出後 stream 失敗,沒辦法回 4xx 給 client
3. Auth 相關 log
3.1 API key middleware
{
"level": "ERROR",
"action": "auth.api_key.not_configured", // env 未設定
"message": "CONVERTER_API_KEY env not set; rejecting all requests"
}
{
"level": "INFO",
"action": "config.api_key_enabled", // 啟動時印
"message": "API key middleware enabled",
"api_key_length": 64, // 不印 key 本身
"timestamp": "..."
}
注意:API key 驗證失敗(401)不 log 個別 request(每次失敗都 log 會:(1) 攻擊面被打就會 log 爆炸;(2) log injection 風險)。改 metrics 計數。
3.2 OAuth client(promote 取 FAA token)
{
"level": "INFO",
"service": "oauth-client",
"action": "oauth.token_obtained",
"scope": "files:upload.write",
"token_type": "Bearer",
"expires_in_sec": 3600,
"access_token_length": 1024 // 不印 token 本身
}
{
"level": "WARN",
"service": "oauth-client",
"action": "oauth.token_endpoint_error",
"scope": "files:upload.write",
"status": 401,
"error_code": "invalid_client"
}
4. 敏感資料保護
4.1 絕對不 log
Authorizationheader 完整內容(含 API key、JWT)CONVERTER_API_KEY、KNERON_CONVERTER_CLIENT_SECRET、MinIO secret- File body / model 內容
- JWT payload 完整 dump
- FAA error body(可能含內部 endpoint / region 等)
- MinIO error message(可能含 endpoint / region / bucket name)
4.2 可以 log
client_id、user_id(API key 模式下 client_id 固定為visionA-service)tenant_idrequest_id- File metadata:
filename(sanitized)、size_bytes、mimetype - Object key(server controlled,例如
jobs/{job_id}/output/out.nef) - Error 分類資訊:
error_code、error_name、status(HTTP) - Duration、timestamp
4.3 條件 log
- IP:log 仍記、GDPR 場景可能需要遮罩
model_filename:已 sanitized、通常不視為敏感- 失敗時的
error_message:截短 100 chars 且不含 secret 才 log
5. 日誌等級
| Level | 用途 |
|---|---|
| DEBUG | 不用(production 不開) |
| INFO | 正常事件(job created、result.success、token_obtained 等) |
| WARN | 可恢復異常(FAA 5xx 重試、token cooldown、rate limit hit) |
| ERROR | 不可恢復 / 需人工關注(MinIO down、API key 未配置、stream 中斷) |
6. 告警策略(Phase 0.8b 規劃,Phase 2 實作)
| 等級 | 條件 | 回應時間 |
|---|---|---|
| P0 | Scheduler down / Redis down | 15 min |
| P1 | API 5xx 比例 > 5% / 持續 5min | 1 hr |
| P1 | auth.api_key.not_configured 出現(代表 env 漏設) |
1 hr |
| P2 | result.stream_error 比例 > 1% |
當日 |
| P2 | promote.faa_put_failed 重試後仍失敗 |
當日 |
| P3 | Token cache miss 突增 | 下個工作日 |
7. Dashboard(Phase 2 設計)
全域 dashboard:
- 每 endpoint QPS / 5min
- p50 / p95 / p99 延遲
- 4xx / 5xx 比例
- API key 401 比例(應接近 0%,> 0.1% 告警)
Result endpoint dashboard(Phase 0.8b 新增):
/resultQPSresult.success/result.not_available(10/404/409/410 分布)- stream_completed: true vs false 比例
- 平均 NEF size
8. Phase 0.8b 變動總結
8.1 新增
result.*action 系列 log(success / not_available / minio_failed / stream_error / client_closed)auth.api_key.*action 系列 logconfig.api_key_*啟動 log
8.2 移除
auth.verify_failed(OAuth JWT 驗證失敗)auth.middleware_unexpected_error(OAuth middleware 兜底)- JWKS-related log(沒有 JWKS 了)
8.3 保留
jobs.created/jobs.get_one/jobs.listpromote.*全系列oauth.token_*(promote 用的 OAuth client log)