Auth pillar 從 OAuth 2.0 resource server 改成 pre-shared API key (visionA ↔ converter 1:1 internal trust)。新增 GET /api/v1/jobs/:id/result streaming endpoint 給 visionA backend 中轉 NEF 下載。 Phase A(auth 切換): - 新增 apiKeyMiddleware(constant-time compare、tokenFingerprint、4 audit events) - 砍 OAuth middleware + JWKS(保留 oauthClient 供 promote → FAA 使用) - 4 個 endpoint 換掛 requireApiKey - 加 TRUST_PROXY env + Express trust proxy 設定(forensic source_ip) Phase B(/result endpoint): - streaming NEF download with 5min timeout + concurrent cap 10 - Two-tier rate limit(burst 5/10s + sustained 20/min) - Bandwidth quota(1 GB/hr + 6 GB/24hr)by token_fingerprint - Range header silently ignored + Accept-Ranges: none - filename quote-escape + RFC 5987 fallback + sanitize - 8 個 /result audit events(forensic 完整) 設計演進記錄:docs/TODO-visionA-integration-v2.md(5/2 OAuth → 5/16 API key → 5/16 download via converter;對應 visionA repo ADR-015/016) Tests: 597 → 666 (+69)、29 suites all pass Security: APPROVE WITH CONDITIONS(單 instance 部署、6 新 env、24hr 監控) npm audit: 3 vuln → 0(transitive AWS SDK xml chain) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
275 lines
7.0 KiB
Markdown
275 lines
7.0 KiB
Markdown
# Observability 設計
|
||
|
||
> **狀態**:Phase 1 完工 — Phase 0.8b 新增 `/result` endpoint 的 log + metrics。
|
||
>
|
||
> **配套**:`security.md`(log 不含 secret 規則)、`performance.md`(SLO 量測)。
|
||
|
||
---
|
||
|
||
## 1. 三支柱
|
||
|
||
Phase 1 + 0.8b:**Logs only**(Metrics / Traces 留 Phase 2)。
|
||
|
||
### 1.1 Logs(結構化 JSON)
|
||
|
||
全部走 stdout,由 docker / k8s collector 撈走(不 ship 到外部)。
|
||
|
||
每筆 log 必含:
|
||
|
||
| 欄位 | 範例 |
|
||
|------|------|
|
||
| `timestamp` | ISO 8601 `2026-05-16T12:00:00.123Z` |
|
||
| `level` | INFO / WARN / ERROR |
|
||
| `service` | `task-scheduler` |
|
||
| `action` | `domain.event`(如 `result.success`、`auth.api_key.not_configured`)|
|
||
| `request_id` | UUIDv4(中介層自動帶)|
|
||
|
||
按 endpoint 額外欄位見下方各章。
|
||
|
||
### 1.2 Metrics(Phase 2)
|
||
|
||
預留 Prometheus exposition。Phase 0.8b 不實作。
|
||
|
||
### 1.3 Traces(Phase 2)
|
||
|
||
預留 OpenTelemetry。Phase 0.8b 不實作。
|
||
|
||
---
|
||
|
||
## 2. 各 endpoint log 欄位
|
||
|
||
### 2.1 `POST /api/v1/jobs`
|
||
|
||
```jsonc
|
||
{
|
||
"level": "INFO",
|
||
"service": "task-scheduler",
|
||
"timestamp": "...",
|
||
"action": "jobs.created", // 或 jobs.create_failed
|
||
"request_id": "...",
|
||
"job_id": "...",
|
||
"user_id": "...",
|
||
"client_id": "visionA-service",
|
||
"model_filename": "model.onnx", // sanitized
|
||
"model_size_bytes": 204800000,
|
||
"ref_images_count": 0,
|
||
"platform": "520",
|
||
"duration_ms": 4231,
|
||
"error_code": null // or 'user_has_active_job' / 'file_too_large' etc
|
||
}
|
||
```
|
||
|
||
### 2.2 `GET /api/v1/jobs/:id`
|
||
|
||
```jsonc
|
||
{
|
||
"level": "INFO",
|
||
"action": "jobs.get_one",
|
||
"request_id": "...",
|
||
"job_id": "...",
|
||
"user_id": "...",
|
||
"client_id": "visionA-service",
|
||
"internal_status": "ONNX", // 內部大寫
|
||
"external_status": "running",
|
||
"etag_match": false,
|
||
"duration_ms": 18
|
||
}
|
||
```
|
||
|
||
### 2.3 `GET /api/v1/jobs`
|
||
|
||
```jsonc
|
||
{
|
||
"level": "INFO",
|
||
"action": "jobs.list",
|
||
"request_id": "...",
|
||
"user_id": "...",
|
||
"filter_status": "in_progress",
|
||
"result_count": 3,
|
||
"duration_ms": 25
|
||
}
|
||
```
|
||
|
||
### 2.4 `POST /api/v1/jobs/:id/promote`
|
||
|
||
```jsonc
|
||
{
|
||
"level": "INFO",
|
||
"action": "promote.success", // 或 promote.idempotent_hit / promote.not_ready / promote.faa_put_failed
|
||
"request_id": "...",
|
||
"job_id": "...",
|
||
"client_id": "visionA-service",
|
||
"target_count": 1,
|
||
"duration_ms": 580,
|
||
"error_name": null // or 'FAAUnauthorizedError' / 'FAATimeoutError' etc
|
||
}
|
||
```
|
||
|
||
### 2.5 `GET /api/v1/jobs/:id/result`(Phase 0.8b 新增)
|
||
|
||
```jsonc
|
||
{
|
||
"level": "INFO",
|
||
"action": "result.success", // 或 result.not_available / result.minio_failed / result.stream_error / result.client_closed
|
||
"request_id": "...",
|
||
"job_id": "...",
|
||
"client_id": "visionA-service",
|
||
"nef_key": "jobs/.../output/out.nef", // server-controlled,不算敏感
|
||
"size_bytes": 52428800,
|
||
"filename_sent": "yolov5s_kl720.nef",
|
||
"duration_ms": 1234,
|
||
"error_code": null, // or 'result_expired' / 'job_not_completed' / 'storage_unavailable'
|
||
"stream_completed": true // false if client closed mid-stream
|
||
}
|
||
```
|
||
|
||
**Result endpoint 特別注意**:
|
||
|
||
- **不 log NEF binary 內容**(只 log object key + size)
|
||
- **stream_completed: false** 代表 client 中途斷線(可能正常、可能網路爛、可能 client bug)
|
||
- **error_code = stream_error**:headers 已送出後 stream 失敗,沒辦法回 4xx 給 client
|
||
|
||
---
|
||
|
||
## 3. Auth 相關 log
|
||
|
||
### 3.1 API key middleware
|
||
|
||
```jsonc
|
||
{
|
||
"level": "ERROR",
|
||
"action": "auth.api_key.not_configured", // env 未設定
|
||
"message": "CONVERTER_API_KEY env not set; rejecting all requests"
|
||
}
|
||
```
|
||
|
||
```jsonc
|
||
{
|
||
"level": "INFO",
|
||
"action": "config.api_key_enabled", // 啟動時印
|
||
"message": "API key middleware enabled",
|
||
"api_key_length": 64, // 不印 key 本身
|
||
"timestamp": "..."
|
||
}
|
||
```
|
||
|
||
**注意**:API key 驗證失敗(401)**不 log 個別 request**(每次失敗都 log 會:(1) 攻擊面被打就會 log 爆炸;(2) log injection 風險)。改 metrics 計數。
|
||
|
||
### 3.2 OAuth client(promote 取 FAA token)
|
||
|
||
```jsonc
|
||
{
|
||
"level": "INFO",
|
||
"service": "oauth-client",
|
||
"action": "oauth.token_obtained",
|
||
"scope": "files:upload.write",
|
||
"token_type": "Bearer",
|
||
"expires_in_sec": 3600,
|
||
"access_token_length": 1024 // 不印 token 本身
|
||
}
|
||
```
|
||
|
||
```jsonc
|
||
{
|
||
"level": "WARN",
|
||
"service": "oauth-client",
|
||
"action": "oauth.token_endpoint_error",
|
||
"scope": "files:upload.write",
|
||
"status": 401,
|
||
"error_code": "invalid_client"
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 4. 敏感資料保護
|
||
|
||
### 4.1 絕對不 log
|
||
|
||
- `Authorization` header 完整內容(含 API key、JWT)
|
||
- `CONVERTER_API_KEY`、`KNERON_CONVERTER_CLIENT_SECRET`、MinIO secret
|
||
- File body / model 內容
|
||
- JWT payload 完整 dump
|
||
- FAA error body(可能含內部 endpoint / region 等)
|
||
- MinIO error message(可能含 endpoint / region / bucket name)
|
||
|
||
### 4.2 可以 log
|
||
|
||
- `client_id`、`user_id`(API key 模式下 client_id 固定為 `visionA-service`)
|
||
- `tenant_id`
|
||
- `request_id`
|
||
- File metadata:`filename`(sanitized)、`size_bytes`、`mimetype`
|
||
- Object key(server controlled,例如 `jobs/{job_id}/output/out.nef`)
|
||
- Error 分類資訊:`error_code`、`error_name`、`status`(HTTP)
|
||
- Duration、timestamp
|
||
|
||
### 4.3 條件 log
|
||
|
||
- IP:log 仍記、GDPR 場景可能需要遮罩
|
||
- `model_filename`:已 sanitized、通常不視為敏感
|
||
- 失敗時的 `error_message`:截短 100 chars 且不含 secret 才 log
|
||
|
||
---
|
||
|
||
## 5. 日誌等級
|
||
|
||
| Level | 用途 |
|
||
|-------|------|
|
||
| DEBUG | 不用(production 不開)|
|
||
| INFO | 正常事件(job created、result.success、token_obtained 等)|
|
||
| WARN | 可恢復異常(FAA 5xx 重試、token cooldown、rate limit hit)|
|
||
| ERROR | 不可恢復 / 需人工關注(MinIO down、API key 未配置、stream 中斷)|
|
||
|
||
---
|
||
|
||
## 6. 告警策略(Phase 0.8b 規劃,Phase 2 實作)
|
||
|
||
| 等級 | 條件 | 回應時間 |
|
||
|------|------|---------|
|
||
| P0 | Scheduler down / Redis down | 15 min |
|
||
| P1 | API 5xx 比例 > 5% / 持續 5min | 1 hr |
|
||
| P1 | `auth.api_key.not_configured` 出現(代表 env 漏設)| 1 hr |
|
||
| P2 | `result.stream_error` 比例 > 1% | 當日 |
|
||
| P2 | `promote.faa_put_failed` 重試後仍失敗 | 當日 |
|
||
| P3 | Token cache miss 突增 | 下個工作日 |
|
||
|
||
---
|
||
|
||
## 7. Dashboard(Phase 2 設計)
|
||
|
||
**全域 dashboard**:
|
||
|
||
- 每 endpoint QPS / 5min
|
||
- p50 / p95 / p99 延遲
|
||
- 4xx / 5xx 比例
|
||
- API key 401 比例(應接近 0%,> 0.1% 告警)
|
||
|
||
**Result endpoint dashboard**(Phase 0.8b 新增):
|
||
|
||
- `/result` QPS
|
||
- `result.success` / `result.not_available`(10/404/409/410 分布)
|
||
- stream_completed: true vs false 比例
|
||
- 平均 NEF size
|
||
|
||
---
|
||
|
||
## 8. Phase 0.8b 變動總結
|
||
|
||
### 8.1 新增
|
||
|
||
- `result.*` action 系列 log(success / not_available / minio_failed / stream_error / client_closed)
|
||
- `auth.api_key.*` action 系列 log
|
||
- `config.api_key_*` 啟動 log
|
||
|
||
### 8.2 移除
|
||
|
||
- `auth.verify_failed`(OAuth JWT 驗證失敗)
|
||
- `auth.middleware_unexpected_error`(OAuth middleware 兜底)
|
||
- JWKS-related log(沒有 JWKS 了)
|
||
|
||
### 8.3 保留
|
||
|
||
- `jobs.created` / `jobs.get_one` / `jobs.list`
|
||
- `promote.*` 全系列
|
||
- `oauth.token_*`(promote 用的 OAuth client log)
|