Castello · Internal
Logging & Monitoring Audit
Filed 2026.05.25 feat/logging-monitoring
Observability · Deep Dive

Two systems, 358 calls, one silent level.

A full survey of logging infrastructure across _next — what levels exist, where they fire, what stays silent, and where the architecture diverges from its own schema.

Scope · _next/ (49 files) Branch · feat/logging-monitoring Systems · JobLogger + log(LogLevel) Output · GCP Cloud Logging (JSON)
358
Total log calls
49
Files instrumented
6
Debug-level calls
0
Critical-level calls
5
Distinct log levels

Summary of findings

i
Dual logging infrastructure monitoring/logger.ts · packages/core/logging.ts
2 separate systems
Informational
ii
Current implementation — 7 subsections Levels · Dead Code · Debug Gap · Output · Product Map · GCP Logging · Dark Areas
358 call sites · 3 gaps
Gap
iii
Entire subsystems produce zero logs webhook.ts · portfolio_intraday.ts · watchlist/ · news/sentiment
4+ silent modules
Attention
iv
Eleven products — three observability pillars logs · traces · metrics — current status per product
11 products mapped
Attention
v
GCP-native implementation plan OTel Collector · 6 steps · 3 infra-only · 3 code changes
6 steps · 0 new vendors
Roadmap
i.

Two systems, one codebase.

monitoring/logger.ts · monitoring/schema.ts · packages/core/logging.ts
Informational

Logging is split across two independent systems that serve different execution contexts and produce structurally different output. They share the same GCP destination and the same physical output method (console.log / console.error), but the shape of what they emit is quite different.

System A — JobLogger

Factory-based structured logger for batch and cron jobs. Created via createLogger(meta: JobMeta) in monitoring/logger.ts. Every entry carries full job identity threaded through: job name, job type, run UUID, environment, a typed LogEvent enum value, a LogPhase, optional entity context IDs, quantitative metrics, and structured error detail. The schema is defined exhaustively in monitoring/schema.ts (12 KB).

Interface
JobLogger — 5 methods: debug(), info(), warn(), error(), critical()
Identity fields
jobName · jobType · runId (UUID per execution) · environment
Classification
LogEvent (22 event types) · LogPhase (5 phases)
Sub-objects
context (entity IDs) · metrics (counts, durations, tokens) · error (code + message + stack)
Routing
DEBUG/INFOconsole.log (stdout) · WARNING/ERROR/CRITICALconsole.error (stderr)

System B — log(LogLevel)

A lightweight generic logger in packages/core/logging.ts for application-layer code — API middleware, service packages, library utilities. Output is a flat JSON object: severity, message, timestamp, and any additional fields spread from the data argument. No job identity, no event taxonomy, no structured sub-objects.

Signature
log(level, message, data?, context?)
Levels
debug · info · warn · error (4, no critical)
Output shape
Flat JSON — severity + message + timestamp + spread data
Routing
error/warn → stderr · info/debug → stdout

Critical: JobLogger is not yet in production

The monitoring/ module — JobLogger, schema.ts, all 22 LogEvent types — exists on feat/logging-monitoring but has not been imported or called by any running job. Every batch and cron job in production today uses packages/core/logging.ts, the flat generic logger. The rich schema described above is the design target, not the current state.

The generic log() function is a zero-ceremony escape hatch for code that isn't job-scoped. The problem is that the batch jobs themselves still use it. SnapTrade packages also use the generic logger despite being primary consumers of job context.

JobName enum — 9 predefined jobs

Constant Job type Description
BATCH_PORTFOLIO_INSIGHTS batch LLM-powered portfolio analysis via Anthropic Batch API
BATCH_RISK_ANALYSIS batch LLM-powered risk analysis via Anthropic Batch API
CACHE_MARKET_MOVERS cron Refresh market movers cache
DAILY_SNAPTRADE_SNAPSHOT cron Daily SnapTrade portfolio snapshot
FETCH_OXR_FX_RATES cron Sync OpenExchangeRates FX data
PORTFOLIO_METRICS cron Compute portfolio performance metrics
SNAPTRADE_SYNC_RECONCILE cron Reconcile SnapTrade sync state
SYNC_BROKERAGE_INFO cron Sync brokerage metadata from SnapTrade
UPDATE_RECAP_PAGE_PATHS cron Refresh recap page path index
ii.

Current implementation.

Levels · Dead Code · Debug Gap · Output · Product Map · GCP Logging · Dark Areas

Seven subsections covering the current state of logging in _next: level distribution, dead code, diagnostic gaps, output format and GCP routing, per-product log destinations, Cloud Logging queryability, and unactivated observability infrastructure.

a.

Level taxonomy and distribution.

358 total call sites across 49 files · INFO + WARN dominate
Informational

Five levels are defined across the two systems. Four are in active use. The distribution is heavily weighted toward INFO and WARN — diagnostic depth at the DEBUG tier is nearly absent, and the highest severity tier (CRITICAL) is dead code.

INFO
132 · 36.9%
WARN
116 · 32.4%
ERROR
104 · 29.1%
DEBUG
6 · 1.7%
CRITICAL
0 · 0%

INFO — phase markers and operation summaries

INFO is used for high-level lifecycle events: job start/end, phase transitions, batch submission and completion, account sync milestones. It always includes entity IDs and summary metrics. It never fires per-item in tight loops — that's left to DEBUG (or not logged at all).

WARN — recoverable degradation

WARN is the second-busiest level and covers three distinct patterns: external system degradation (SnapTrade 429s, network retries), concurrent state machine race conditions in the sync pipeline, and data integrity gaps where processing continues with partial data (missing FX rates, unknown tickers, null brokerage IDs). Every WARN means the operation continued.

ERROR — terminal failures

ERROR marks operations that stopped. It is the signal for incidents and alerting. Always includes machine-readable error codes, entity IDs for correlation, and consequence language where data loss is possible. No ERROR should be informational — if the operation recovered, it was a WARN.

DEBUG — success signals on high-frequency paths

The six DEBUG calls cover narrow cases: market holiday cache refreshes, batch polling status, per-portfolio save confirmations, bulk-save outcomes, and skipped-item notices. These are correct targets for DEBUG. The gap is that almost no high-frequency path produces any diagnostic signal at all.

Level-to-GCP severity mapping

Logger levelGCP severityStreamAlert eligible
DEBUG DEBUG stdout No
INFO / info INFO stdout No
WARN / warn WARNING stderr Optional
ERROR / error ERROR stderr Yes
CRITICAL CRITICAL stderr Yes — immediate

WARN breakdown by pattern

PatternCountExample
External system retries (rate limits, network errors) ~28 SnapTrade 429, upstream 5xx, network timeout
Concurrent state race conditions ~18 Sync state machine: complete UPDATE no-op, row disappeared
Data integrity gaps (partial data, missing FX) ~32 Missing FX rate, null brokerage_id, skipped SnapTrade activity
Fallback mechanisms engaged ~14 Bulk save → per-item fallback, brokerage_id left NULL
Graceful shutdown signals ~8 SIGTERM received, forced shutdown
Unknown / unrecognised data ~16 Unknown custom_id in batch result, unsupported ticker
b.

CRITICAL is dead code.

monitoring/schema.ts:Severity · monitoring/logger.ts:JobLogger · 0 call sites
Gap

The Severity.CRITICAL level and the logger.critical() method are fully implemented. The schema documents their intended contract precisely: reserved for log entries emitted immediately before exit(1), mapping to GCP's highest alert tier. But the level has zero call sites across the entire codebase.

The actual pattern for fatal exits in every batch job is to log at ERROR and then call process.exit(1) separately. The two signals are disconnected — GCP sees an ERROR entry and then a container exit, but never a CRITICAL entry that would let an alerting policy fire specifically on "job crashed, not just errored."

Current exit pattern (batch-portfolio-insights)

// What actually happens today
catch (err) {
  log(LogLevel.ERROR, "Batch job crashed", { message: err.message });
  Deno.exit(1);
}

// What the schema intended
catch (err) {
  logger.critical("Batch job crashed", {
    event: LogEvent.JOB_FATAL,
    phase: LogPhase.TEARDOWN,
    error: toErrorDetail("JOB_CRASHED", err),
  });
  Deno.exit(1);
}

Why this matters

GCP Cloud Logging alerting policies distinguish severity tiers. An ERROR policy fires on all errors including non-fatal ones. A CRITICAL policy fires only on the most severe failures. Without CRITICAL, there is no GCP-native way to build a "job crashed" alert that doesn't also fire on recoverable per-item errors — or you have to filter by jsonPayload.event="JOB_FATAL" manually rather than by severity.

Scope

The fix is mechanical: replace the terminal log(LogLevel.ERROR) + Deno.exit(1) pair in each job's top-level catch block with logger.critical() using LogEvent.JOB_FATAL. Eight jobs are affected. No schema changes required — the infrastructure is already there.

c.

DEBUG covers 1.7% of calls.

6 total debug sites · no tracing in API, SnapTrade client, or DB operations
Attention

The six existing DEBUG calls are correctly placed — they cover success signals on high-cardinality operations that don't warrant INFO noise. The problem is scope: whole subsystems that would benefit from per-operation tracing produce no diagnostic output between the WARN/ERROR conditions they do log.

Existing DEBUG call sites

FileMessageData
trading_day_service.ts:163 Refreshed market holiday cache exchange · holidayCount
llm-batch/anthropic.ts Batch still processing batchId · status
portfolio/service.ts Saved portfolio insights userId · accountId · targetDate
portfolio/service.ts Bulk-saved portfolio insights portfolios · rows
llm-batch/retry.ts No retryable failed requests batchJobId
update-recap-page-paths/main.ts Page not found, skipping isin · status

Where DEBUG is absent but needed

Note on filtering

DEBUG entries are routed to stdout and excluded from GCP alerting by severity filter. Adding DEBUG logging to high-frequency paths (per-request, per-query) carries no alerting cost — the risk is log volume in production. A LOG_LEVEL env var or DEBUG=true flag should gate these calls before they are added.

d.

Output format and GCP routing.

JSON structured · stdout/stderr · Cloud Logging jsonPayload
Sound

Both logging systems emit newline-delimited JSON to standard streams. GCP Cloud Run captures these streams and forwards them to Cloud Logging, where the severity, message, and timestamp fields are promoted to first-class indexed columns. Everything else lands in jsonPayload.

JobLogger entry shape

{
  "severity":    "INFO",
  "message":     "Starting portfolio insights batch job",
  "timestamp":   "2026-05-25T14:32:18.123Z",
  "jobName":     "batch-portfolio-insights",
  "jobType":     "batch",
  "environment": "production",
  "runId":       "550e8400-e29b-41d4-a716-446655440000",
  "event":       "JOB_START",
  "phase":       "INIT",
  "context": {
    "targetDate": "2026-05-23"
  },
  "metrics": {
    "total": 450,
    "durationMs": 5432
  }
}

log(LogLevel) entry shape

{
  "severity":     "INFO",
  "message":      "-> user.deleteAccount",
  "functionName": "oRPC_procedure",
  "timestamp":    "2026-05-25T14:32:18.123Z"
}

GCP filter examples

Top-level fields in the JobLogger output are indexed and filterable directly in GCP Log Explorer. The event and errorCode fields were specifically designed for this — their values were chosen to be useful as alert conditions, not just human-readable labels.

-- Only job-fatal events
jsonPayload.event="JOB_FATAL"

-- All errors from one job
severity>=ERROR AND jsonPayload.jobName="batch-portfolio-insights"

-- Trace all logs for one run
jsonPayload.runId="550e8400-e29b-41d4-a716-446655440000"

-- Follow a specific user across all jobs
jsonPayload.context.userId="<uuid>"

What's working well

  • All entries are valid JSON — no mixed text/JSON lines, no unstructured console.log calls.
  • The runId UUID provides a complete per-execution trace — all logs for one job run can be isolated with a single filter.
  • The event enum vocabulary is well-designed and GCP-filter-ready.
  • The errorCode field is a flat top-level field specifically for fast alert condition matching.
  • Stderr/stdout routing respects GCP's severity inference — entries at WARNING and above are flagged automatically.

LogEvent taxonomy — 22 defined event types

CategoryEvents
Job lifecycle JOB_START · JOB_SUCCESS · JOB_PARTIAL_FAILURE · JOB_FATAL
Step-level STEP_START · STEP_SUCCESS · STEP_FAILURE
Item-level ITEM_PROCESSED · ITEM_FAILED · ITEM_SKIPPED
External systems EXTERNAL_API_REQUEST · EXTERNAL_API_ERROR · DB_OPERATION · DB_ERROR
LLM batch LLM_BATCH_SUBMITTED · LLM_BATCH_COMPLETE · LLM_BATCH_ERROR
Process signals SHUTDOWN_SIGNAL
e.

Where logs actually land.

GCP Cloud Logging · Supabase Dashboard · Sentry · system_logs · nowhere
Informational

All ten Cloud Run jobs and their orchestration layers log to GCP Cloud Logging. They arrive under different resource.type labels — so they appear separated in Log Explorer — but they are all in the same logging system. Cloud Scheduler and Cloud Workflows are not separate log destinations; they are separate resource types within Cloud Logging.

Full product × destination matrix

Product
Hosted on
Log destinations
How to navigate
Batch & cron jobs (Cloud Run)
batch-portfolio-insightsLLM batch · weekdays 2am
Cloud Run Job
GCP Cloud Logging
batch-risk-analysisLLM batch · weekdays
Cloud Run Job
GCP Cloud Logging
news-ingestcron · weekdays 6am
Cloud Run Job
GCP Cloud Logging
news-sentiment-submitworkflow-triggered
Cloud Run Job
GCP Cloud Logging
batch-pollworkflow loop · every 5min
Cloud Run Job
GCP Cloud Logging
news-sentiment-writeworkflow-triggered
Cloud Run Job
GCP Cloud Logging
snaptrade-sync-reconcilecron · every 15min
Cloud Run Job
GCP Cloud Logging
cache-market-moverscron · weekdays 9pm
Cloud Run Job
GCP Cloud Logging
fetch-openexchangerates-fx-ratescron · daily 2am
Cloud Run Job
GCP Cloud Logging
update-recap-page-pathscron · daily 2am
Cloud Run Job
GCP Cloud Logging
Workflow orchestration
news-sentiment workflowCloud Workflows YAML
Cloud Workflows
GCP Cloud Logging
Cloud Scheduler jobsall job triggers
Cloud Scheduler
GCP Cloud Logging
f.

GCP Cloud Logging: what's queryable.

resource.type="cloud_run_job" · project: castello-backend · region: us-central1
Informational

All ten Cloud Run jobs write structured JSON to stdout and stderr. Cloud Run captures these streams automatically and forwards them to GCP Cloud Logging — no SDK or explicit client required. The GCP project is castello-backend, region us-central1, and all images are stored in Artifact Registry under castello-batch.

What the logs look like today

Because the monitoring module's JobLogger is not yet connected, every job currently emits flat JSON via packages/core/logging.ts — not the rich schema defined in monitoring/schema.ts. There are no jobName, runId, event, or phase fields in production logs today. The fields available for filtering are limited to those spread from each log call's data argument, which varies by callsite.

// What GCP receives today (flat, core/logging.ts)
{
  "severity": "INFO",
  "message":  "Fetched X brokerages from SnapTrade",
  "timestamp":"2026-05-25T14:32:18.123Z",
  "count":    42
}

// What GCP will receive after monitoring module is connected
{
  "severity":  "INFO",
  "message":   "Fetched X brokerages from SnapTrade",
  "timestamp": "2026-05-25T14:32:18.123Z",
  "jobName":   "sync-brokerage-info",
  "runId":     "550e8400-e29b-41d4-a716-446655440000",
  "event":     "STEP_SUCCESS",
  "phase":     "PREFETCH",
  "metrics":   { "total": 42, "durationMs": 812 }
}

How to navigate to job logs in GCP Console

Log Explorer URL
GCP Console → Logging → Log Explorer (project: castello-backend)
All job logs
resource.type="cloud_run_job"
Specific job
resource.type="cloud_run_job" AND resource.labels.job_name="portfolio-insights"
Errors only
resource.type="cloud_run_job" AND severity>=ERROR
One execution
resource.labels.execution_id="portfolio-insights-abc12"

Cloud Workflows logs

Cloud Workflows writes to Cloud Logging under resource.type="workflows.googleapis.com/Workflow". By default only errors are logged. Setting call_log_level = "LOG_ALL_CALLS" on the Terraform resource enables step-level entries — which steps executed, what arguments they passed to each Cloud Run job, and what the response was. Custom messages can also be written from within the workflow YAML using sys.log(), which land in the same stream.

Log filter
resource.type="workflows.googleapis.com/Workflow" AND resource.labels.workflow_id="news-sentiment"
Enable detail
Set call_log_level = "LOG_ALL_CALLS" in infra/jobs/news-sentiment-workflow.tf
Executions tab
The Workflows console Executions tab is a separate step-state UI — not a log store. It shows the execution graph, inputs, outputs, and current step. Useful alongside Cloud Logging, not instead of it.
Polling loop
batch-poll runs up to 288 times (24h max, every 5min) — each invocation is a separate Cloud Run Job execution with its own log stream under resource.type="cloud_run_job"

Cloud Scheduler logs

Cloud Scheduler writes to Cloud Logging under resource.type="cloud_scheduler_job" on every invocation — the time it fired, the HTTP target it called, and the response status code it received. The "View Logs" button in the Cloud Scheduler console is a shortcut link that opens Cloud Logging with this filter pre-applied. It is not a separate log store.

Log filter
resource.type="cloud_scheduler_job" AND resource.labels.job_name="portfolio-insights-daily"
What you see
Did the scheduler fire? Did Cloud Run accept the request? — not what the job did internally. Follow the Cloud Run execution ID to find the job's own logs.

What GCP Cloud Logging does well here

  • Zero-config ingestion — stdout is all that's needed; no SDK, no sidecar.
  • Each Cloud Run execution gets a distinct execution_id label, so you can isolate one run's logs with a single filter even before runId is added.
  • Severity routing is correct — WARN/ERROR already land on stderr and are flagged at the right GCP tier.
  • 30-day retention by default; configurable log sinks to BigQuery for longer-term analysis.
g.

Three systems producing no signal.

JobLogger unconnected · OpenTelemetry no-op · Cloud Monitoring absent
Gap

Beyond the silent code paths identified in §v, three pieces of observability infrastructure exist in the codebase but produce no output in production. Each is fully designed, partially implemented, and then stopped short of being activated.

1. JobLogger — fully designed, never deployed

The monitoring/ module defines a complete structured logging schema: 22 LogEvent types, 5 LogPhase values, typed sub-objects for context, metrics, and error detail, and a createLogger() factory that produces a JobLogger bound to a specific job's identity. This is the infrastructure this branch (feat/logging-monitoring) was created to build.

No batch job or cron job currently imports or calls createLogger(). GCP Cloud Logging receives flat JSON today. The monitoring module exists only on disk.

Impact

Without runId, isolating one job execution in GCP Log Explorer requires filtering by Cloud Run's execution_id label — which works, but isn't surfaced in the log entries themselves. Without event and phase fields, GCP alerting policies cannot distinguish a per-item failure from a job-fatal crash. Without errorCode, programmatic alert routing is impossible.

2. OpenTelemetry — wired but dormant

packages/llm-batch/tracing.ts wraps every Anthropic Batch API call in an OpenTelemetry span via a tracedFetch() helper. It adds gen_ai.* semantic attributes, sanitises URLs, and sets span status on HTTP errors. The implementation is complete and correct.

Activation requires two things that are not currently set in production: the OTEL_DENO=true environment variable (gates all tracing calls behind a no-op check), and an OTEL exporter endpoint (OTEL_EXPORTER_OTLP_ENDPOINT or equivalent). Without both, every tracedFetch() call is a plain fetch() — no spans are emitted anywhere.

What would activate it
Set OTEL_DENO=true + configure an OTEL collector endpoint in Cloud Run job env vars
What it would produce
Per-request spans for every Anthropic Batch API call, with token usage, latency, and HTTP status as span attributes
Where spans would go
Wherever the OTEL exporter is pointed — GCP Cloud Trace, Jaeger, Honeycomb, etc.

3. Cloud Monitoring — no metrics layer configured

There is no Cloud Monitoring (formerly Stackdriver Metrics) setup in the Terraform configuration. No custom dashboards, no uptime checks, no alerting policies, no log-based metrics. Job success and failure are recorded as log events but never elevated to time-series metrics.

This means there is no way to answer operational questions like "how many jobs failed this week?" or "what is the P95 duration of the portfolio insights batch?" without writing ad-hoc Log Explorer queries. There are no alerting policies that fire when a job fails — failure is visible in logs, but only if someone is watching.

iii.

Four subsystems produce no logs.

webhook.ts · portfolio_intraday.ts · watchlist/ · news/sentiment
Attention

The following modules handle real user data and real-time operations but emit nothing to stdout or stderr under any condition. When they fail silently, the only signal is a downstream symptom — stale portfolio data, missing webhook updates, absent sentiment scores — with no log trail to explain the gap.

Package-level breakdown

Package Debug Info Warn Error Total Status
apps/ (batch jobs) 1 68 28 46 143 Well logged
packages/snaptrade/ 0 15 36 24 75 Well logged
packages/llm-batch/ 1 6 0 0 7 Partial
packages/portfolio/ 2 4 0 0 6 Partial
packages/risk-analysis/ 1 2 0 0 3 Partial
packages/trading-days/ 1 0 1 1 3 Partial
packages/api/ 0 2 0 0 2 Minimal
packages/snaptrade/webhook.ts 0 Silent
packages/snaptrade/portfolio_intraday.ts 0 Silent
packages/watchlist/service.ts 0 Silent
jobs/news/sentiment/ 0 Silent

Risk surface

Webhook processing (webhook.ts) is the highest-risk silent area. Webhooks are the primary mechanism by which SnapTrade notifies Castello of account changes — a silent failure here means user portfolio state diverges from reality without any observable log trail. Intraday portfolio compute (portfolio_intraday.ts) is similarly high-frequency and user-visible.

Minimum viable instrumentation

For each silent module: one INFO at entry (event type, entity ID), one WARN on any non-fatal skip or data gap, one ERROR on any unrecoverable failure. That's three log sites minimum to make a module observable. Comprehensive tracing can follow incrementally.

iv.

Eleven products, three pillars.

logs · traces · metrics · current implementation status per product
Attention

Every product in _next has some logging. Almost none have traces or metrics. The two pillars that make logging actionable — knowing where time was spent and whether the system is healthy over time — are either no-ops in production or absent entirely.

Product Logs Traces Metrics
LLM batch jobs
batch-portfolio-insightsapps/batch-portfolio-insights
Partial
Flat JSON via core/logging. Good lifecycle coverage. No runId, no event types. JobLogger not connected.
No-op in prod
Anthropic calls traced indirectly via llm-batch/anthropic.ts → tracedFetch(). No job-level root span. OTel SDK not configured.
No-op in prod
BatchMetricsTracker fires OTel counter + histogram on finalize. Token counts also persisted to DB. OTel instruments no-op without SDK.
batch-risk-analysisapps/batch-risk-analysis
Partial
Same as portfolio-insights. Good coverage, flat schema, no JobLogger.
No-op in prod
Same indirect tracing via llm-batch/anthropic.ts. No root span.
No-op in prod
Same BatchMetricsTracker pattern. Token counts to DB.
News pipeline
news-ingestjobs/news/ingest
Partial
Progress every 500 tickers, per-ticker failures. Silent on individual successes. No JobLogger.
None
Finnhub API calls use raw fetch() with no tracing wrapper. Zero span coverage.
None
Counts tracked as local variables and logged at job end. No OTel instruments.
news-sentiment-submitjobs/news/sentiment/submit
Partial
Thin main.ts delegates to shared runner. Logging behaviour is inside the runner — not directly auditable from main.
Likely no-op
Runner likely uses llm-batch for Anthropic submission. If so, indirect tracing applies. SDK still not configured.
Likely no-op
Runner likely uses BatchMetricsTracker. OTel no-op without SDK.
news-sentiment-writejobs/news/sentiment/write
Partial
Same delegation pattern as submit. Logging inside shared runner.
None
Write phase reads Anthropic results from DB, no outbound API calls. No spans.
Likely no-op
Runner likely uses BatchMetricsTracker for write-phase token counts.
batch-polljobs/batch/poll
Partial
Thin main.ts delegates to runBatchPoll(). Per-poll cycle logging inside runner.
Likely no-op
Polls Anthropic batch status via llm-batch. Likely indirect tracing on status calls. SDK not configured.
None
Poll phase does not accumulate result metrics — those belong to the write phase.
Cron jobs
snaptrade-sync-reconcileapps/snaptrade-sync-reconcile
Partial
Best-logged cron job. Per-pass structured counts (recovered, retried, stuck, timed-out). Silent on successful state transitions.
None
SnapTrade API calls go through @castello/snaptrade package with raw fetch and retry logic. No tracing wrapper.
None
Pass-level counts logged as structured fields. No OTel instruments anywhere in this job.
cache-market-moversapps/cache-market-movers
Partial
Checkpoint-only. Start, fetch failures, no-rows abort, RPC call, final success. No per-mover detail.
None
Yahoo Finance fetched via raw fetch() with header signing. No tracing.
None
No metrics of any kind.
fetch-openexchangerates-fx-ratesapps/fetch-openexchangerates-fx-rates
Partial
Checkpoint-only. Trading day check, fetch failure, invalid/overflow rates, success. Silent on per-rate processing.
None
OXR API called via raw fetch(). No tracing.
None
No metrics of any kind.
update-recap-page-pathsapps/update-recap-page-paths
Partial
Per-ISIN failures and skips logged. Successful path upserts are silent. Summary count at end.
None
HEAD requests to recap server via safeFetch(). No span instrumentation.
None
No metrics of any kind.
API layer
packages/api (oRPC)handler · middleware · procedures
Partial
Procedure entry/exit + duration via logging middleware. No domain-level logging at the handler boundary. Error codes logged at WARN for unmapped errors.
None
No span instrumentation at the handler or middleware level. Domain packages called from procedures have their own logging but no traces.
None
No request rate, error rate, or latency metrics. No OTel instruments anywhere in the API layer.

Cross-cutting shortcomings by pillar

Logs

Traces

Metrics

v.

The implementation plan.

6 steps · 3 infra-only · 3 code changes · GCP-native + OTel Collector
Roadmap

The plan activates all three observability pillars using only GCP services and the instrumentation already built on this branch. Steps 1–3 are pure infrastructure — no application code changes, no risk to running jobs. Steps 4–6 are code changes that deepen what the infrastructure can surface.

6
Total steps
3
Infra-only
3
Code changes
0
New vendors
1

Deploy the OTel Collector.

Deploy a Cloud Run Service running the OpenTelemetry Collector Contrib image. Configure it with a single YAML: accept OTLP/HTTP on port 4318, export to GCP via the googlecloud exporter. Set ingress to internal-only so it is reachable by other Cloud Run services but not the public internet.

Create a dedicated otel-collector service account and grant it roles/cloudtrace.agent and roles/monitoring.metricWriter. The collector handles all GCP authentication — application jobs never touch credentials for telemetry again. Token refresh is automatic via ADC.

New: infra/collector.tf
New: infra/collector-config.yaml
Edit: infra/iam.tf — otel-collector SA + role bindings
Infra only No app risk
2

Activate OTel on all jobs.

Add three environment variables to every Cloud Run job spec in Terraform. OTEL_DENO=true activates Deno's built-in OTel SDK. OTEL_EXPORTER_OTLP_ENDPOINT points to the collector's internal Cloud Run URL. OTEL_SERVICE_NAME labels spans by job for grouping in Cloud Trace.

This single Terraform apply gives you: every fetch() call in every job as a trace span — Finnhub, Yahoo Finance, OXR, SnapTrade, Anthropic — all automatically, with no code changes. The existing tracedFetch() and withSpan() calls in llm-batch activate on top. BatchMetricsTracker token and duration metrics start flowing to Cloud Monitoring.

Edit: infra/jobs/cloud-run.tf — batch-portfolio-insights
Edit: infra/jobs/batch-risk-analysis.tf
Edit: infra/jobs/news-ingest.tf
Edit: infra/jobs/news-sentiment-*.tf (3 files)
Edit: infra/jobs/snaptrade-sync-reconcile.tf
Edit: infra/jobs/cache-market-movers.tf
Edit: infra/jobs/fetch-openexchangerates-fx-rates.tf
Edit: infra/jobs/update-recap-page-paths.tf
Infra only No app risk
3

Metrics, alerting, and a dashboard.

Cloud Run Jobs already emit free execution metrics to Cloud Monitoring — execution count labelled by exit code, and execution latency — but nothing is watching them. Create a google_monitoring_dashboard that surfaces these alongside the custom OTel metrics arriving from step 2.

Create two alerting policies: one on run.googleapis.com/job/completed_execution_count filtered to exit code 1 (fatal crash), and one on log-severity ERROR rate per job using a google_logging_metric. This is the first time a job failure will produce a notification rather than silent log entries.

New: infra/monitoring.tf
  — google_logging_metric (error rate per job)
  — google_monitoring_alert_policy (exit code 1)
  — google_monitoring_alert_policy (error rate spike)
  — google_monitoring_dashboard
Infra only No app risk
4

Connect the JobLogger.

Replace log() calls with createLogger() from @castello/monitoring in each job's main.ts. Every log entry gains runId (a UUID stable across one execution), event, phase, and errorCode — the fields the schema was designed around. Switch fatal catch blocks from log(LogLevel.ERROR) to logger.critical() before Deno.exit(1), activating the CRITICAL severity tier in GCP for the first time.

Once deployed, update the log-based metric filter in Terraform from severity>=ERROR to jsonPayload.event="JOB_FATAL" — a more precise signal that eliminates false positives from per-item errors.

Edit: apps/batch-portfolio-insights/main.ts
Edit: apps/batch-risk-analysis/main.ts
Edit: apps/snaptrade-sync-reconcile/main.ts
Edit: apps/cache-market-movers/main.ts
Edit: apps/fetch-openexchangerates-fx-rates/main.ts
Edit: apps/update-recap-page-paths/main.ts
Edit: jobs/news/ingest/src/main.ts
Edit: infra/monitoring.tf — refine alert filter
Code change
5

Link logs to traces.

Modify packages/core/logging.ts and monitoring/logger.ts to read the active OTel span context at the moment each log entry is written. If a span is active, inject two fields that GCP Cloud Logging recognises natively: logging.googleapis.com/trace (the full trace resource path) and logging.googleapis.com/spanId.

After this change, every log entry written inside a traced operation displays a "View in Trace" link in Cloud Logging's Log Explorer. You can jump from a specific WARN or ERROR log line directly to the Cloud Trace span that contains it — closing the gap between the two previously unlinked signals.

Edit: packages/core/logging.ts — inject span context
Edit: monitoring/logger.ts — inject span context
Edit: monitoring/schema.ts — add trace fields to LogEntry type
Code change
6

Add a root span per job.

Wrap each job's top-level execution in a withSpan() call. This creates a single parent span representing the full job run, under which all auto-instrumented fetch() spans and explicit withSpan() blocks become children. Without this, Cloud Trace shows individual spans but no trace that represents "this was one execution of batch-portfolio-insights."

Attach the runId from the JobLogger as a span attribute so the trace and the log entries for one execution share a common identifier — making cross-signal correlation possible without relying on timestamps or Cloud Run's execution_id label.

Edit: apps/batch-portfolio-insights/main.ts
Edit: apps/batch-risk-analysis/main.ts
Edit: apps/snaptrade-sync-reconcile/main.ts
Edit: apps/cache-market-movers/main.ts
Edit: apps/fetch-openexchangerates-fx-rates/main.ts
Edit: apps/update-recap-page-paths/main.ts
Edit: jobs/news/ingest/src/main.ts
Code change

What becomes available after each step

After steps 1–3 · infra only
Operational visibility
  • Every fetch() call in every job appears as a span in Cloud Trace
  • Token usage and batch duration flow into Cloud Monitoring
  • Job success/failure dashboard live
  • Alerts fire on fatal job crashes and error rate spikes
  • Zero application code changed
After step 4 · JobLogger
Structured signal
  • Log entries carry runId, event, phase, errorCode
  • Alert filter narrows from all ERRORs to JOB_FATAL — no more false positives
  • CRITICAL severity tier active — GCP can distinguish a job crash from a per-item failure
  • Log-based metrics become more precise
After steps 5–6 · correlation
Unified observability
  • Log entries link directly to their trace span in Cloud Logging UI
  • Each job execution is one traceable unit from start to finish
  • runId ties logs and traces together by a shared identifier
  • All three pillars unified — logs, traces, metrics cross-reference each other

Key principle

Steps 1–3 are independently shippable and carry no application risk. They give you most of the operational value — alerting, dashboards, trace coverage — without touching a single line of job code. Steps 4–6 deepen the signal but depend on steps 1–3 being in place first: there is no point injecting trace IDs into log entries before traces are flowing.

Implementation details

Step 2 — OTEL_SERVICE_NAME is required, not optional

OTEL_SERVICE_NAME is the label Cloud Trace uses to group and filter spans. Without it, every span from every job appears under unknown_service — all ten jobs' traces merged into one unnavigable list. It must be set per job in Terraform, matching the job name exactly.

JobOTEL_SERVICE_NAME value
batch-portfolio-insightsbatch-portfolio-insights
batch-risk-analysisbatch-risk-analysis
news-ingestnews-ingest
news-sentiment-submitnews-sentiment-submit
news-sentiment-writenews-sentiment-write
batch-pollbatch-poll
snaptrade-sync-reconcilesnaptrade-sync-reconcile
cache-market-moverscache-market-movers
fetch-openexchangerates-fx-ratesfetch-openexchangerates-fx-rates
update-recap-page-pathsupdate-recap-page-paths

Step 5 — trace field format must be the full resource path

GCP Cloud Logging only draws the link between a log entry and a Cloud Trace span if the logging.googleapis.com/trace field contains the full resource path — not just the trace ID hex string. The OTel API returns a 32-character hex string from spanContext().traceId. Step 5 must construct the full path before writing it to the log entry.

import { trace } from "@opentelemetry/api";

function getTraceContext() {
  const span = trace.getActiveSpan();
  if (!span) return {};

  const { traceId, spanId, traceFlags } = span.spanContext();
  return {
    // full path required — hex string alone does not work
    "logging.googleapis.com/trace":
      `projects/${PROJECT_ID}/traces/${traceId}`,
    "logging.googleapis.com/spanId": spanId,
    "logging.googleapis.com/trace_sampled": (traceFlags & 1) === 1,
  };
}

This function is called inside packages/core/logging.ts and monitoring/logger.ts at the point each entry is serialised. If no span is active (e.g. code running outside a withSpan() block), it returns an empty object and the log entry is written without trace fields — this is correct behaviour, not an error. PROJECT_ID is read from the GOOGLE_CLOUD_PROJECT_ID env var already present on all Cloud Run jobs.

Silent failure risk

Both of these details fail silently. If OTEL_SERVICE_NAME is omitted, Cloud Trace still receives spans — they just all appear under unknown_service with no indication anything is wrong. If the trace field is written as a bare hex string instead of the full path, Cloud Logging still accepts the log entry — it just never draws the link to Cloud Trace. Neither produces an error. Verify both in a staging run before treating the feature as working.