jim800121chen d8a9517c9d feat(task-scheduler): Phase 0.8b — API key auth + /result endpoint
Auth pillar 從 OAuth 2.0 resource server 改成 pre-shared API key
(visionA ↔ converter 1:1 internal trust)。新增 GET /api/v1/jobs/:id/result
streaming endpoint 給 visionA backend 中轉 NEF 下載。

Phase A(auth 切換):
- 新增 apiKeyMiddleware(constant-time compare、tokenFingerprint、4 audit events)
- 砍 OAuth middleware + JWKS(保留 oauthClient 供 promote → FAA 使用)
- 4 個 endpoint 換掛 requireApiKey
- 加 TRUST_PROXY env + Express trust proxy 設定(forensic source_ip)

Phase B(/result endpoint):
- streaming NEF download with 5min timeout + concurrent cap 10
- Two-tier rate limit(burst 5/10s + sustained 20/min)
- Bandwidth quota(1 GB/hr + 6 GB/24hr)by token_fingerprint
- Range header silently ignored + Accept-Ranges: none
- filename quote-escape + RFC 5987 fallback + sanitize
- 8 個 /result audit events(forensic 完整)

設計演進記錄:docs/TODO-visionA-integration-v2.md(5/2 OAuth → 5/16 API key
→ 5/16 download via converter;對應 visionA repo ADR-015/016)

Tests: 597 → 666 (+69)、29 suites all pass
Security: APPROVE WITH CONDITIONS(單 instance 部署、6 新 env、24hr 監控)
npm audit: 3 vuln → 0(transitive AWS SDK xml chain)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 22:47:28 +08:00

275 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Observability 設計
> **狀態**Phase 1 完工 — Phase 0.8b 新增 `/result` endpoint 的 log + metrics。
>
> **配套**`security.md`log 不含 secret 規則)、`performance.md`SLO 量測)。
---
## 1. 三支柱
Phase 1 + 0.8b**Logs only**Metrics / Traces 留 Phase 2
### 1.1 Logs結構化 JSON
全部走 stdout由 docker / k8s collector 撈走(不 ship 到外部)。
每筆 log 必含:
| 欄位 | 範例 |
|------|------|
| `timestamp` | ISO 8601 `2026-05-16T12:00:00.123Z` |
| `level` | INFO / WARN / ERROR |
| `service` | `task-scheduler` |
| `action` | `domain.event`(如 `result.success``auth.api_key.not_configured`|
| `request_id` | UUIDv4中介層自動帶|
按 endpoint 額外欄位見下方各章。
### 1.2 MetricsPhase 2
預留 Prometheus exposition。Phase 0.8b 不實作。
### 1.3 TracesPhase 2
預留 OpenTelemetry。Phase 0.8b 不實作。
---
## 2. 各 endpoint log 欄位
### 2.1 `POST /api/v1/jobs`
```jsonc
{
"level": "INFO",
"service": "task-scheduler",
"timestamp": "...",
"action": "jobs.created", // 或 jobs.create_failed
"request_id": "...",
"job_id": "...",
"user_id": "...",
"client_id": "visionA-service",
"model_filename": "model.onnx", // sanitized
"model_size_bytes": 204800000,
"ref_images_count": 0,
"platform": "520",
"duration_ms": 4231,
"error_code": null // or 'user_has_active_job' / 'file_too_large' etc
}
```
### 2.2 `GET /api/v1/jobs/:id`
```jsonc
{
"level": "INFO",
"action": "jobs.get_one",
"request_id": "...",
"job_id": "...",
"user_id": "...",
"client_id": "visionA-service",
"internal_status": "ONNX", // 內部大寫
"external_status": "running",
"etag_match": false,
"duration_ms": 18
}
```
### 2.3 `GET /api/v1/jobs`
```jsonc
{
"level": "INFO",
"action": "jobs.list",
"request_id": "...",
"user_id": "...",
"filter_status": "in_progress",
"result_count": 3,
"duration_ms": 25
}
```
### 2.4 `POST /api/v1/jobs/:id/promote`
```jsonc
{
"level": "INFO",
"action": "promote.success", // 或 promote.idempotent_hit / promote.not_ready / promote.faa_put_failed
"request_id": "...",
"job_id": "...",
"client_id": "visionA-service",
"target_count": 1,
"duration_ms": 580,
"error_name": null // or 'FAAUnauthorizedError' / 'FAATimeoutError' etc
}
```
### 2.5 `GET /api/v1/jobs/:id/result`Phase 0.8b 新增)
```jsonc
{
"level": "INFO",
"action": "result.success", // 或 result.not_available / result.minio_failed / result.stream_error / result.client_closed
"request_id": "...",
"job_id": "...",
"client_id": "visionA-service",
"nef_key": "jobs/.../output/out.nef", // server-controlled不算敏感
"size_bytes": 52428800,
"filename_sent": "yolov5s_kl720.nef",
"duration_ms": 1234,
"error_code": null, // or 'result_expired' / 'job_not_completed' / 'storage_unavailable'
"stream_completed": true // false if client closed mid-stream
}
```
**Result endpoint 特別注意**
- **不 log NEF binary 內容**(只 log object key + size
- **stream_completed: false** 代表 client 中途斷線(可能正常、可能網路爛、可能 client bug
- **error_code = stream_error**headers 已送出後 stream 失敗,沒辦法回 4xx 給 client
---
## 3. Auth 相關 log
### 3.1 API key middleware
```jsonc
{
"level": "ERROR",
"action": "auth.api_key.not_configured", // env 未設定
"message": "CONVERTER_API_KEY env not set; rejecting all requests"
}
```
```jsonc
{
"level": "INFO",
"action": "config.api_key_enabled", // 啟動時印
"message": "API key middleware enabled",
"api_key_length": 64, // 不印 key 本身
"timestamp": "..."
}
```
**注意**API key 驗證失敗401**不 log 個別 request**(每次失敗都 log 會:(1) 攻擊面被打就會 log 爆炸;(2) log injection 風險)。改 metrics 計數。
### 3.2 OAuth clientpromote 取 FAA token
```jsonc
{
"level": "INFO",
"service": "oauth-client",
"action": "oauth.token_obtained",
"scope": "files:upload.write",
"token_type": "Bearer",
"expires_in_sec": 3600,
"access_token_length": 1024 // 不印 token 本身
}
```
```jsonc
{
"level": "WARN",
"service": "oauth-client",
"action": "oauth.token_endpoint_error",
"scope": "files:upload.write",
"status": 401,
"error_code": "invalid_client"
}
```
---
## 4. 敏感資料保護
### 4.1 絕對不 log
- `Authorization` header 完整內容(含 API key、JWT
- `CONVERTER_API_KEY``KNERON_CONVERTER_CLIENT_SECRET`、MinIO secret
- File body / model 內容
- JWT payload 完整 dump
- FAA error body可能含內部 endpoint / region 等)
- MinIO error message可能含 endpoint / region / bucket name
### 4.2 可以 log
- `client_id``user_id`API key 模式下 client_id 固定為 `visionA-service`
- `tenant_id`
- `request_id`
- File metadata`filename`sanitized`size_bytes``mimetype`
- Object keyserver controlled例如 `jobs/{job_id}/output/out.nef`
- Error 分類資訊:`error_code``error_name``status`HTTP
- Duration、timestamp
### 4.3 條件 log
- IPlog 仍記、GDPR 場景可能需要遮罩
- `model_filename`:已 sanitized、通常不視為敏感
- 失敗時的 `error_message`:截短 100 chars 且不含 secret 才 log
---
## 5. 日誌等級
| Level | 用途 |
|-------|------|
| DEBUG | 不用production 不開)|
| INFO | 正常事件job created、result.success、token_obtained 等)|
| WARN | 可恢復異常FAA 5xx 重試、token cooldown、rate limit hit|
| ERROR | 不可恢復 / 需人工關注MinIO down、API key 未配置、stream 中斷)|
---
## 6. 告警策略Phase 0.8b 規劃Phase 2 實作)
| 等級 | 條件 | 回應時間 |
|------|------|---------|
| P0 | Scheduler down / Redis down | 15 min |
| P1 | API 5xx 比例 > 5% / 持續 5min | 1 hr |
| P1 | `auth.api_key.not_configured` 出現(代表 env 漏設)| 1 hr |
| P2 | `result.stream_error` 比例 > 1% | 當日 |
| P2 | `promote.faa_put_failed` 重試後仍失敗 | 當日 |
| P3 | Token cache miss 突增 | 下個工作日 |
---
## 7. DashboardPhase 2 設計)
**全域 dashboard**
- 每 endpoint QPS / 5min
- p50 / p95 / p99 延遲
- 4xx / 5xx 比例
- API key 401 比例(應接近 0%> 0.1% 告警)
**Result endpoint dashboard**Phase 0.8b 新增):
- `/result` QPS
- `result.success` / `result.not_available`10/404/409/410 分布)
- stream_completed: true vs false 比例
- 平均 NEF size
---
## 8. Phase 0.8b 變動總結
### 8.1 新增
- `result.*` action 系列 logsuccess / not_available / minio_failed / stream_error / client_closed
- `auth.api_key.*` action 系列 log
- `config.api_key_*` 啟動 log
### 8.2 移除
- `auth.verify_failed`OAuth JWT 驗證失敗)
- `auth.middleware_unexpected_error`OAuth middleware 兜底)
- JWKS-related log沒有 JWKS 了)
### 8.3 保留
- `jobs.created` / `jobs.get_one` / `jobs.list`
- `promote.*` 全系列
- `oauth.token_*`promote 用的 OAuth client log