# Observability 設計 > **狀態**:Phase 1 完工 — Phase 0.8b 新增 `/result` endpoint 的 log + metrics。 > > **配套**:`security.md`(log 不含 secret 規則)、`performance.md`(SLO 量測)。 --- ## 1. 三支柱 Phase 1 + 0.8b:**Logs only**(Metrics / Traces 留 Phase 2)。 ### 1.1 Logs(結構化 JSON) 全部走 stdout,由 docker / k8s collector 撈走(不 ship 到外部)。 每筆 log 必含: | 欄位 | 範例 | |------|------| | `timestamp` | ISO 8601 `2026-05-16T12:00:00.123Z` | | `level` | INFO / WARN / ERROR | | `service` | `task-scheduler` | | `action` | `domain.event`(如 `result.success`、`auth.api_key.not_configured`)| | `request_id` | UUIDv4(中介層自動帶)| 按 endpoint 額外欄位見下方各章。 ### 1.2 Metrics(Phase 2) 預留 Prometheus exposition。Phase 0.8b 不實作。 ### 1.3 Traces(Phase 2) 預留 OpenTelemetry。Phase 0.8b 不實作。 --- ## 2. 各 endpoint log 欄位 ### 2.1 `POST /api/v1/jobs` ```jsonc { "level": "INFO", "service": "task-scheduler", "timestamp": "...", "action": "jobs.created", // 或 jobs.create_failed "request_id": "...", "job_id": "...", "user_id": "...", "client_id": "visionA-service", "model_filename": "model.onnx", // sanitized "model_size_bytes": 204800000, "ref_images_count": 0, "platform": "520", "duration_ms": 4231, "error_code": null // or 'user_has_active_job' / 'file_too_large' etc } ``` ### 2.2 `GET /api/v1/jobs/:id` ```jsonc { "level": "INFO", "action": "jobs.get_one", "request_id": "...", "job_id": "...", "user_id": "...", "client_id": "visionA-service", "internal_status": "ONNX", // 內部大寫 "external_status": "running", "etag_match": false, "duration_ms": 18 } ``` ### 2.3 `GET /api/v1/jobs` ```jsonc { "level": "INFO", "action": "jobs.list", "request_id": "...", "user_id": "...", "filter_status": "in_progress", "result_count": 3, "duration_ms": 25 } ``` ### 2.4 `POST /api/v1/jobs/:id/promote` ```jsonc { "level": "INFO", "action": "promote.success", // 或 promote.idempotent_hit / promote.not_ready / promote.faa_put_failed "request_id": "...", "job_id": "...", "client_id": "visionA-service", "target_count": 1, "duration_ms": 580, "error_name": null // or 'FAAUnauthorizedError' / 'FAATimeoutError' etc } ``` ### 2.5 `GET /api/v1/jobs/:id/result`(Phase 0.8b 新增) ```jsonc { "level": "INFO", "action": "result.success", // 或 result.not_available / result.minio_failed / result.stream_error / result.client_closed "request_id": "...", "job_id": "...", "client_id": "visionA-service", "nef_key": "jobs/.../output/out.nef", // server-controlled,不算敏感 "size_bytes": 52428800, "filename_sent": "yolov5s_kl720.nef", "duration_ms": 1234, "error_code": null, // or 'result_expired' / 'job_not_completed' / 'storage_unavailable' "stream_completed": true // false if client closed mid-stream } ``` **Result endpoint 特別注意**: - **不 log NEF binary 內容**(只 log object key + size) - **stream_completed: false** 代表 client 中途斷線(可能正常、可能網路爛、可能 client bug) - **error_code = stream_error**:headers 已送出後 stream 失敗,沒辦法回 4xx 給 client --- ## 3. Auth 相關 log ### 3.1 API key middleware ```jsonc { "level": "ERROR", "action": "auth.api_key.not_configured", // env 未設定 "message": "CONVERTER_API_KEY env not set; rejecting all requests" } ``` ```jsonc { "level": "INFO", "action": "config.api_key_enabled", // 啟動時印 "message": "API key middleware enabled", "api_key_length": 64, // 不印 key 本身 "timestamp": "..." } ``` **注意**:API key 驗證失敗(401)**不 log 個別 request**(每次失敗都 log 會:(1) 攻擊面被打就會 log 爆炸;(2) log injection 風險)。改 metrics 計數。 ### 3.2 OAuth client(promote 取 FAA token) ```jsonc { "level": "INFO", "service": "oauth-client", "action": "oauth.token_obtained", "scope": "files:upload.write", "token_type": "Bearer", "expires_in_sec": 3600, "access_token_length": 1024 // 不印 token 本身 } ``` ```jsonc { "level": "WARN", "service": "oauth-client", "action": "oauth.token_endpoint_error", "scope": "files:upload.write", "status": 401, "error_code": "invalid_client" } ``` --- ## 4. 敏感資料保護 ### 4.1 絕對不 log - `Authorization` header 完整內容(含 API key、JWT) - `CONVERTER_API_KEY`、`KNERON_CONVERTER_CLIENT_SECRET`、MinIO secret - File body / model 內容 - JWT payload 完整 dump - FAA error body(可能含內部 endpoint / region 等) - MinIO error message(可能含 endpoint / region / bucket name) ### 4.2 可以 log - `client_id`、`user_id`(API key 模式下 client_id 固定為 `visionA-service`) - `tenant_id` - `request_id` - File metadata:`filename`(sanitized)、`size_bytes`、`mimetype` - Object key(server controlled,例如 `jobs/{job_id}/output/out.nef`) - Error 分類資訊:`error_code`、`error_name`、`status`(HTTP) - Duration、timestamp ### 4.3 條件 log - IP:log 仍記、GDPR 場景可能需要遮罩 - `model_filename`:已 sanitized、通常不視為敏感 - 失敗時的 `error_message`:截短 100 chars 且不含 secret 才 log --- ## 5. 日誌等級 | Level | 用途 | |-------|------| | DEBUG | 不用(production 不開)| | INFO | 正常事件(job created、result.success、token_obtained 等)| | WARN | 可恢復異常(FAA 5xx 重試、token cooldown、rate limit hit)| | ERROR | 不可恢復 / 需人工關注(MinIO down、API key 未配置、stream 中斷)| --- ## 6. 告警策略(Phase 0.8b 規劃,Phase 2 實作) | 等級 | 條件 | 回應時間 | |------|------|---------| | P0 | Scheduler down / Redis down | 15 min | | P1 | API 5xx 比例 > 5% / 持續 5min | 1 hr | | P1 | `auth.api_key.not_configured` 出現(代表 env 漏設)| 1 hr | | P2 | `result.stream_error` 比例 > 1% | 當日 | | P2 | `promote.faa_put_failed` 重試後仍失敗 | 當日 | | P3 | Token cache miss 突增 | 下個工作日 | --- ## 7. Dashboard(Phase 2 設計) **全域 dashboard**: - 每 endpoint QPS / 5min - p50 / p95 / p99 延遲 - 4xx / 5xx 比例 - API key 401 比例(應接近 0%,> 0.1% 告警) **Result endpoint dashboard**(Phase 0.8b 新增): - `/result` QPS - `result.success` / `result.not_available`(10/404/409/410 分布) - stream_completed: true vs false 比例 - 平均 NEF size --- ## 8. Phase 0.8b 變動總結 ### 8.1 新增 - `result.*` action 系列 log(success / not_available / minio_failed / stream_error / client_closed) - `auth.api_key.*` action 系列 log - `config.api_key_*` 啟動 log ### 8.2 移除 - `auth.verify_failed`(OAuth JWT 驗證失敗) - `auth.middleware_unexpected_error`(OAuth middleware 兜底) - JWKS-related log(沒有 JWKS 了) ### 8.3 保留 - `jobs.created` / `jobs.get_one` / `jobs.list` - `promote.*` 全系列 - `oauth.token_*`(promote 用的 OAuth client log)