Logging & Monitoring Audit

i.

Two systems, one codebase.

monitoring/logger.ts · monitoring/schema.ts · packages/core/logging.ts

Informational

Logging is split across two independent systems that serve different execution contexts and produce structurally different output. They share the same GCP destination and the same physical output method (console.log / console.error), but the shape of what they emit is quite different.

System A — JobLogger

Factory-based structured logger for batch and cron jobs. Created via createLogger(meta: JobMeta) in monitoring/logger.ts. Every entry carries full job identity threaded through: job name, job type, run UUID, environment, a typed LogEvent enum value, a LogPhase, optional entity context IDs, quantitative metrics, and structured error detail. The schema is defined exhaustively in monitoring/schema.ts (12 KB).

Interface

JobLogger — 5 methods: debug(), info(), warn(), error(), critical()

Identity fields

jobName · jobType · runId (UUID per execution) · environment

Classification

LogEvent (22 event types) · LogPhase (5 phases)

Sub-objects

context (entity IDs) · metrics (counts, durations, tokens) · error (code + message + stack)

Routing

DEBUG/INFO → console.log (stdout) · WARNING/ERROR/CRITICAL → console.error (stderr)

System B — log(LogLevel)

A lightweight generic logger in packages/core/logging.ts for application-layer code — API middleware, service packages, library utilities. Output is a flat JSON object: severity, message, timestamp, and any additional fields spread from the data argument. No job identity, no event taxonomy, no structured sub-objects.

Signature

log(level, message, data?, context?)

Levels

debug · info · warn · error (4, no critical)

Output shape

Flat JSON — severity + message + timestamp + spread data

Routing

error/warn → stderr · info/debug → stdout

Critical: JobLogger is not yet in production

The monitoring/ module — JobLogger, schema.ts, all 22 LogEvent types — exists on feat/logging-monitoring but has not been imported or called by any running job. Every batch and cron job in production today uses packages/core/logging.ts, the flat generic logger. The rich schema described above is the design target, not the current state.

The generic log() function is a zero-ceremony escape hatch for code that isn't job-scoped. The problem is that the batch jobs themselves still use it. SnapTrade packages also use the generic logger despite being primary consumers of job context.

JobName enum — 9 predefined jobs

Constant	Job type	Description
BATCH_PORTFOLIO_INSIGHTS	batch	LLM-powered portfolio analysis via Anthropic Batch API
BATCH_RISK_ANALYSIS	batch	LLM-powered risk analysis via Anthropic Batch API
CACHE_MARKET_MOVERS	cron	Refresh market movers cache
DAILY_SNAPTRADE_SNAPSHOT	cron	Daily SnapTrade portfolio snapshot
FETCH_OXR_FX_RATES	cron	Sync OpenExchangeRates FX data
PORTFOLIO_METRICS	cron	Compute portfolio performance metrics
SNAPTRADE_SYNC_RECONCILE	cron	Reconcile SnapTrade sync state
SYNC_BROKERAGE_INFO	cron	Sync brokerage metadata from SnapTrade
UPDATE_RECAP_PAGE_PATHS	cron	Refresh recap page path index

a.

Level taxonomy and distribution.

358 total call sites across 49 files · INFO + WARN dominate

Informational

Five levels are defined across the two systems. Four are in active use. The distribution is heavily weighted toward INFO and WARN — diagnostic depth at the DEBUG tier is nearly absent, and the highest severity tier (CRITICAL) is dead code.

INFO

132 · 36.9%

WARN

116 · 32.4%

ERROR

104 · 29.1%

DEBUG

6 · 1.7%

CRITICAL

0 · 0%

INFO — phase markers and operation summaries

INFO is used for high-level lifecycle events: job start/end, phase transitions, batch submission and completion, account sync milestones. It always includes entity IDs and summary metrics. It never fires per-item in tight loops — that's left to DEBUG (or not logged at all).

WARN — recoverable degradation

WARN is the second-busiest level and covers three distinct patterns: external system degradation (SnapTrade 429s, network retries), concurrent state machine race conditions in the sync pipeline, and data integrity gaps where processing continues with partial data (missing FX rates, unknown tickers, null brokerage IDs). Every WARN means the operation continued.

ERROR — terminal failures

ERROR marks operations that stopped. It is the signal for incidents and alerting. Always includes machine-readable error codes, entity IDs for correlation, and consequence language where data loss is possible. No ERROR should be informational — if the operation recovered, it was a WARN.

DEBUG — success signals on high-frequency paths

The six DEBUG calls cover narrow cases: market holiday cache refreshes, batch polling status, per-portfolio save confirmations, bulk-save outcomes, and skipped-item notices. These are correct targets for DEBUG. The gap is that almost no high-frequency path produces any diagnostic signal at all.

Level-to-GCP severity mapping

Logger level	GCP severity	Stream	Alert eligible
DEBUG	DEBUG	stdout	No
INFO / info	INFO	stdout	No
WARN / warn	WARNING	stderr	Optional
ERROR / error	ERROR	stderr	Yes
CRITICAL	CRITICAL	stderr	Yes — immediate

WARN breakdown by pattern

Pattern	Count	Example
External system retries (rate limits, network errors)	~28	SnapTrade 429, upstream 5xx, network timeout
Concurrent state race conditions	~18	Sync state machine: `complete UPDATE no-op`, `row disappeared`
Data integrity gaps (partial data, missing FX)	~32	Missing FX rate, null `brokerage_id`, skipped SnapTrade activity
Fallback mechanisms engaged	~14	Bulk save → per-item fallback, `brokerage_id` left NULL
Graceful shutdown signals	~8	SIGTERM received, forced shutdown
Unknown / unrecognised data	~16	Unknown `custom_id` in batch result, unsupported ticker

b.

CRITICAL is dead code.

monitoring/schema.ts:Severity · monitoring/logger.ts:JobLogger · 0 call sites

Gap

The Severity.CRITICAL level and the logger.critical() method are fully implemented. The schema documents their intended contract precisely: reserved for log entries emitted immediately before exit(1), mapping to GCP's highest alert tier. But the level has zero call sites across the entire codebase.

The actual pattern for fatal exits in every batch job is to log at ERROR and then call process.exit(1) separately. The two signals are disconnected — GCP sees an ERROR entry and then a container exit, but never a CRITICAL entry that would let an alerting policy fire specifically on "job crashed, not just errored."

Current exit pattern (batch-portfolio-insights)

// What actually happens today
catch (err) {
  log(LogLevel.ERROR, "Batch job crashed", { message: err.message });
  Deno.exit(1);
}

// What the schema intended
catch (err) {
  logger.critical("Batch job crashed", {
    event: LogEvent.JOB_FATAL,
    phase: LogPhase.TEARDOWN,
    error: toErrorDetail("JOB_CRASHED", err),
  });
  Deno.exit(1);
}

Why this matters

GCP Cloud Logging alerting policies distinguish severity tiers. An ERROR policy fires on all errors including non-fatal ones. A CRITICAL policy fires only on the most severe failures. Without CRITICAL, there is no GCP-native way to build a "job crashed" alert that doesn't also fire on recoverable per-item errors — or you have to filter by jsonPayload.event="JOB_FATAL" manually rather than by severity.

Scope

The fix is mechanical: replace the terminal log(LogLevel.ERROR) + Deno.exit(1) pair in each job's top-level catch block with logger.critical() using LogEvent.JOB_FATAL. Eight jobs are affected. No schema changes required — the infrastructure is already there.

c.

DEBUG covers 1.7% of calls.

6 total debug sites · no tracing in API, SnapTrade client, or DB operations

Attention

The six existing DEBUG calls are correctly placed — they cover success signals on high-cardinality operations that don't warrant INFO noise. The problem is scope: whole subsystems that would benefit from per-operation tracing produce no diagnostic output between the WARN/ERROR conditions they do log.

Existing DEBUG call sites

File	Message	Data
trading_day_service.ts:163	Refreshed market holiday cache	exchange · holidayCount
llm-batch/anthropic.ts	Batch still processing	batchId · status
portfolio/service.ts	Saved portfolio insights	userId · accountId · targetDate
portfolio/service.ts	Bulk-saved portfolio insights	portfolios · rows
llm-batch/retry.ts	No retryable failed requests	batchJobId
update-recap-page-paths/main.ts	Page not found, skipping	isin · status

Where DEBUG is absent but needed

SnapTrade API client (packages/snaptrade/client.ts) Outbound requests are invisible unless they fail. A per-request DEBUG entry (method, URL, status, latency) would make rate limit patterns and latency regressions visible without adding INFO noise.
API middleware (packages/api/middleware/logging.ts) Only logs procedure entry/exit. Sub-operations — DB queries, cache hits, external calls — within a procedure are silent. No way to trace where time is spent inside a slow oRPC call.
Database operations No query-level DEBUG anywhere. Failed queries are logged at WARN/ERROR, but slow or unexpectedly large queries are invisible.
Webhook processing (packages/snaptrade/webhook.ts) Zero log output on success paths. A received-and-processed confirmation at DEBUG would make the webhook pipeline observable without polluting INFO.

Note on filtering

DEBUG entries are routed to stdout and excluded from GCP alerting by severity filter. Adding DEBUG logging to high-frequency paths (per-request, per-query) carries no alerting cost — the risk is log volume in production. A LOG_LEVEL env var or DEBUG=true flag should gate these calls before they are added.

d.

Output format and GCP routing.

JSON structured · stdout/stderr · Cloud Logging jsonPayload

Sound

Both logging systems emit newline-delimited JSON to standard streams. GCP Cloud Run captures these streams and forwards them to Cloud Logging, where the severity, message, and timestamp fields are promoted to first-class indexed columns. Everything else lands in jsonPayload.

JobLogger entry shape

{
  "severity":    "INFO",
  "message":     "Starting portfolio insights batch job",
  "timestamp":   "2026-05-25T14:32:18.123Z",
  "jobName":     "batch-portfolio-insights",
  "jobType":     "batch",
  "environment": "production",
  "runId":       "550e8400-e29b-41d4-a716-446655440000",
  "event":       "JOB_START",
  "phase":       "INIT",
  "context": {
    "targetDate": "2026-05-23"
  },
  "metrics": {
    "total": 450,
    "durationMs": 5432
  }
}

log(LogLevel) entry shape

{
  "severity":     "INFO",
  "message":      "-> user.deleteAccount",
  "functionName": "oRPC_procedure",
  "timestamp":    "2026-05-25T14:32:18.123Z"
}

GCP filter examples

Top-level fields in the JobLogger output are indexed and filterable directly in GCP Log Explorer. The event and errorCode fields were specifically designed for this — their values were chosen to be useful as alert conditions, not just human-readable labels.

-- Only job-fatal events
jsonPayload.event="JOB_FATAL"

-- All errors from one job
severity>=ERROR AND jsonPayload.jobName="batch-portfolio-insights"

-- Trace all logs for one run
jsonPayload.runId="550e8400-e29b-41d4-a716-446655440000"

-- Follow a specific user across all jobs
jsonPayload.context.userId="<uuid>"

What's working well

All entries are valid JSON — no mixed text/JSON lines, no unstructured console.log calls.
The runId UUID provides a complete per-execution trace — all logs for one job run can be isolated with a single filter.
The event enum vocabulary is well-designed and GCP-filter-ready.
The errorCode field is a flat top-level field specifically for fast alert condition matching.
Stderr/stdout routing respects GCP's severity inference — entries at WARNING and above are flagged automatically.

LogEvent taxonomy — 22 defined event types

Category	Events
Job lifecycle	JOB_START · JOB_SUCCESS · JOB_PARTIAL_FAILURE · JOB_FATAL
Step-level	STEP_START · STEP_SUCCESS · STEP_FAILURE
Item-level	ITEM_PROCESSED · ITEM_FAILED · ITEM_SKIPPED
External systems	EXTERNAL_API_REQUEST · EXTERNAL_API_ERROR · DB_OPERATION · DB_ERROR
LLM batch	LLM_BATCH_SUBMITTED · LLM_BATCH_COMPLETE · LLM_BATCH_ERROR
Process signals	SHUTDOWN_SIGNAL

e.

Where logs actually land.

GCP Cloud Logging · Supabase Dashboard · Sentry · system_logs · nowhere

Informational

All ten Cloud Run jobs and their orchestration layers log to GCP Cloud Logging. They arrive under different resource.type labels — so they appear separated in Log Explorer — but they are all in the same logging system. Cloud Scheduler and Cloud Workflows are not separate log destinations; they are separate resource types within Cloud Logging.

Full product × destination matrix

Batch & cron jobs (Cloud Run)

batch-portfolio-insightsLLM batch · weekdays 2am

Cloud Run Job

GCP Cloud Logging

Logging → Log Explorer
resource.type="cloud_run_job"

batch-risk-analysisLLM batch · weekdays

Cloud Run Job

GCP Cloud Logging

Logging → Log Explorer
resource.type="cloud_run_job"

news-ingestcron · weekdays 6am

Cloud Run Job

GCP Cloud Logging

Logging → Log Explorer
job_name="news-ingest"

news-sentiment-submitworkflow-triggered

Cloud Run Job

GCP Cloud Logging

Logging → Log Explorer
job_name="news-sentiment-submit"

batch-pollworkflow loop · every 5min

Cloud Run Job

GCP Cloud Logging

Logging → Log Explorer
job_name="batch-poll"

news-sentiment-writeworkflow-triggered

Cloud Run Job

GCP Cloud Logging

Logging → Log Explorer
job_name="news-sentiment-write"

snaptrade-sync-reconcilecron · every 15min

Cloud Run Job

GCP Cloud Logging

Logging → Log Explorer
job_name="snaptrade-sync-reconcile"

cache-market-moverscron · weekdays 9pm

Cloud Run Job

GCP Cloud Logging

Logging → Log Explorer
job_name="cache-market-movers"

fetch-openexchangerates-fx-ratescron · daily 2am

Cloud Run Job

GCP Cloud Logging

Logging → Log Explorer
job_name="fetch-openexchangerates-fx-rates"

update-recap-page-pathscron · daily 2am

Cloud Run Job

GCP Cloud Logging

Logging → Log Explorer
job_name="update-recap-page-paths"

Workflow orchestration

news-sentiment workflowCloud Workflows YAML

Cloud Workflows

GCP Cloud Logging

resource.type=
"workflows.googleapis.com/Workflow"
+ Executions tab for step-state UI

Cloud Scheduler jobsall job triggers

Cloud Scheduler

GCP Cloud Logging

resource.type=
"cloud_scheduler_job"
View Logs button is a shortcut to this filter

f.

GCP Cloud Logging: what's queryable.

resource.type="cloud_run_job" · project: castello-backend · region: us-central1

Informational

All ten Cloud Run jobs write structured JSON to stdout and stderr. Cloud Run captures these streams automatically and forwards them to GCP Cloud Logging — no SDK or explicit client required. The GCP project is castello-backend, region us-central1, and all images are stored in Artifact Registry under castello-batch.

What the logs look like today

Because the monitoring module's JobLogger is not yet connected, every job currently emits flat JSON via packages/core/logging.ts — not the rich schema defined in monitoring/schema.ts. There are no jobName, runId, event, or phase fields in production logs today. The fields available for filtering are limited to those spread from each log call's data argument, which varies by callsite.

// What GCP receives today (flat, core/logging.ts)
{
  "severity": "INFO",
  "message":  "Fetched X brokerages from SnapTrade",
  "timestamp":"2026-05-25T14:32:18.123Z",
  "count":    42
}

// What GCP will receive after monitoring module is connected
{
  "severity":  "INFO",
  "message":   "Fetched X brokerages from SnapTrade",
  "timestamp": "2026-05-25T14:32:18.123Z",
  "jobName":   "sync-brokerage-info",
  "runId":     "550e8400-e29b-41d4-a716-446655440000",
  "event":     "STEP_SUCCESS",
  "phase":     "PREFETCH",
  "metrics":   { "total": 42, "durationMs": 812 }
}

How to navigate to job logs in GCP Console

Log Explorer URL

GCP Console → Logging → Log Explorer (project: castello-backend)

All job logs

resource.type="cloud_run_job"

Specific job

resource.type="cloud_run_job" AND resource.labels.job_name="portfolio-insights"

Errors only

resource.type="cloud_run_job" AND severity>=ERROR

One execution

resource.labels.execution_id="portfolio-insights-abc12"

Cloud Workflows logs

Cloud Workflows writes to Cloud Logging under resource.type="workflows.googleapis.com/Workflow". By default only errors are logged. Setting call_log_level = "LOG_ALL_CALLS" on the Terraform resource enables step-level entries — which steps executed, what arguments they passed to each Cloud Run job, and what the response was. Custom messages can also be written from within the workflow YAML using sys.log(), which land in the same stream.

Log filter

resource.type="workflows.googleapis.com/Workflow" AND resource.labels.workflow_id="news-sentiment"

Enable detail

Set call_log_level = "LOG_ALL_CALLS" in infra/jobs/news-sentiment-workflow.tf

Executions tab

The Workflows console Executions tab is a separate step-state UI — not a log store. It shows the execution graph, inputs, outputs, and current step. Useful alongside Cloud Logging, not instead of it.

Polling loop

batch-poll runs up to 288 times (24h max, every 5min) — each invocation is a separate Cloud Run Job execution with its own log stream under resource.type="cloud_run_job"

Cloud Scheduler logs

Cloud Scheduler writes to Cloud Logging under resource.type="cloud_scheduler_job" on every invocation — the time it fired, the HTTP target it called, and the response status code it received. The "View Logs" button in the Cloud Scheduler console is a shortcut link that opens Cloud Logging with this filter pre-applied. It is not a separate log store.

Log filter

resource.type="cloud_scheduler_job" AND resource.labels.job_name="portfolio-insights-daily"

What you see

Did the scheduler fire? Did Cloud Run accept the request? — not what the job did internally. Follow the Cloud Run execution ID to find the job's own logs.

What GCP Cloud Logging does well here

Zero-config ingestion — stdout is all that's needed; no SDK, no sidecar.
Each Cloud Run execution gets a distinct execution_id label, so you can isolate one run's logs with a single filter even before runId is added.
Severity routing is correct — WARN/ERROR already land on stderr and are flagged at the right GCP tier.
30-day retention by default; configurable log sinks to BigQuery for longer-term analysis.

g.

Three systems producing no signal.

JobLogger unconnected · OpenTelemetry no-op · Cloud Monitoring absent

Gap

Beyond the silent code paths identified in §v, three pieces of observability infrastructure exist in the codebase but produce no output in production. Each is fully designed, partially implemented, and then stopped short of being activated.

1. JobLogger — fully designed, never deployed

The monitoring/ module defines a complete structured logging schema: 22 LogEvent types, 5 LogPhase values, typed sub-objects for context, metrics, and error detail, and a createLogger() factory that produces a JobLogger bound to a specific job's identity. This is the infrastructure this branch (feat/logging-monitoring) was created to build.

No batch job or cron job currently imports or calls createLogger(). GCP Cloud Logging receives flat JSON today. The monitoring module exists only on disk.

Impact

Without runId, isolating one job execution in GCP Log Explorer requires filtering by Cloud Run's execution_id label — which works, but isn't surfaced in the log entries themselves. Without event and phase fields, GCP alerting policies cannot distinguish a per-item failure from a job-fatal crash. Without errorCode, programmatic alert routing is impossible.

2. OpenTelemetry — wired but dormant

packages/llm-batch/tracing.ts wraps every Anthropic Batch API call in an OpenTelemetry span via a tracedFetch() helper. It adds gen_ai.* semantic attributes, sanitises URLs, and sets span status on HTTP errors. The implementation is complete and correct.

Activation requires two things that are not currently set in production: the OTEL_DENO=true environment variable (gates all tracing calls behind a no-op check), and an OTEL exporter endpoint (OTEL_EXPORTER_OTLP_ENDPOINT or equivalent). Without both, every tracedFetch() call is a plain fetch() — no spans are emitted anywhere.

What would activate it

Set OTEL_DENO=true + configure an OTEL collector endpoint in Cloud Run job env vars

What it would produce

Per-request spans for every Anthropic Batch API call, with token usage, latency, and HTTP status as span attributes

Where spans would go

Wherever the OTEL exporter is pointed — GCP Cloud Trace, Jaeger, Honeycomb, etc.

3. Cloud Monitoring — no metrics layer configured

There is no Cloud Monitoring (formerly Stackdriver Metrics) setup in the Terraform configuration. No custom dashboards, no uptime checks, no alerting policies, no log-based metrics. Job success and failure are recorded as log events but never elevated to time-series metrics.

This means there is no way to answer operational questions like "how many jobs failed this week?" or "what is the P95 duration of the portfolio insights batch?" without writing ad-hoc Log Explorer queries. There are no alerting policies that fire when a job fails — failure is visible in logs, but only if someone is watching.

No alerting on job failure A batch-portfolio-insights crash produces an ERROR log in Cloud Logging, but no notification fires. Detection relies on downstream effects (missing insights) or someone manually checking logs.
No duration tracking Job execution time is not tracked as a metric. Duration regressions — a job that used to take 20 minutes now taking 90 — are invisible unless someone compares log timestamps manually.
No success rate dashboard There is no view of job success/failure rates over time. Individual runs are auditable but aggregate trends are not surfaced anywhere.

iii.

Four subsystems produce no logs.

webhook.ts · portfolio_intraday.ts · watchlist/ · news/sentiment

Attention

The following modules handle real user data and real-time operations but emit nothing to stdout or stderr under any condition. When they fail silently, the only signal is a downstream symptom — stale portfolio data, missing webhook updates, absent sentiment scores — with no log trail to explain the gap.

Package-level breakdown

Package	Debug	Info	Warn	Error	Total	Status
apps/ (batch jobs)	1	68	28	46	143	Well logged
packages/snaptrade/	0	15	36	24	75	Well logged
packages/llm-batch/	1	6	0	0	7	Partial
packages/portfolio/	2	4	0	0	6	Partial
packages/risk-analysis/	1	2	0	0	3	Partial
packages/trading-days/	1	0	1	1	3	Partial
packages/api/	0	2	0	0	2	Minimal
packages/snaptrade/webhook.ts	—	—	—	—	0	Silent
packages/snaptrade/portfolio_intraday.ts	—	—	—	—	0	Silent
packages/watchlist/service.ts	—	—	—	—	0	Silent
jobs/news/sentiment/	—	—	—	—	0	Silent

Risk surface

Webhook processing (webhook.ts) is the highest-risk silent area. Webhooks are the primary mechanism by which SnapTrade notifies Castello of account changes — a silent failure here means user portfolio state diverges from reality without any observable log trail. Intraday portfolio compute (portfolio_intraday.ts) is similarly high-frequency and user-visible.

Minimum viable instrumentation

For each silent module: one INFO at entry (event type, entity ID), one WARN on any non-fatal skip or data gap, one ERROR on any unrecoverable failure. That's three log sites minimum to make a module observable. Comprehensive tracing can follow incrementally.

iv.

Eleven products, three pillars.

logs · traces · metrics · current implementation status per product

Attention

Every product in _next has some logging. Almost none have traces or metrics. The two pillars that make logging actionable — knowing where time was spent and whether the system is healthy over time — are either no-ops in production or absent entirely.

Product	Logs	Traces	Metrics
LLM batch jobs
batch-portfolio-insightsapps/batch-portfolio-insights	Partial Flat JSON via `core/logging`. Good lifecycle coverage. No `runId`, no event types. JobLogger not connected.	No-op in prod Anthropic calls traced indirectly via `llm-batch/anthropic.ts → tracedFetch()`. No job-level root span. OTel SDK not configured.	No-op in prod `BatchMetricsTracker` fires OTel counter + histogram on finalize. Token counts also persisted to DB. OTel instruments no-op without SDK.
batch-risk-analysisapps/batch-risk-analysis	Partial Same as portfolio-insights. Good coverage, flat schema, no JobLogger.	No-op in prod Same indirect tracing via `llm-batch/anthropic.ts`. No root span.	No-op in prod Same `BatchMetricsTracker` pattern. Token counts to DB.
News pipeline
news-ingestjobs/news/ingest	Partial Progress every 500 tickers, per-ticker failures. Silent on individual successes. No JobLogger.	None Finnhub API calls use raw `fetch()` with no tracing wrapper. Zero span coverage.	None Counts tracked as local variables and logged at job end. No OTel instruments.
news-sentiment-submitjobs/news/sentiment/submit	Partial Thin main.ts delegates to shared runner. Logging behaviour is inside the runner — not directly auditable from main.	Likely no-op Runner likely uses `llm-batch` for Anthropic submission. If so, indirect tracing applies. SDK still not configured.	Likely no-op Runner likely uses `BatchMetricsTracker`. OTel no-op without SDK.
news-sentiment-writejobs/news/sentiment/write	Partial Same delegation pattern as submit. Logging inside shared runner.	None Write phase reads Anthropic results from DB, no outbound API calls. No spans.	Likely no-op Runner likely uses `BatchMetricsTracker` for write-phase token counts.
batch-polljobs/batch/poll	Partial Thin main.ts delegates to `runBatchPoll()`. Per-poll cycle logging inside runner.	Likely no-op Polls Anthropic batch status via `llm-batch`. Likely indirect tracing on status calls. SDK not configured.	None Poll phase does not accumulate result metrics — those belong to the write phase.
Cron jobs
snaptrade-sync-reconcileapps/snaptrade-sync-reconcile	Partial Best-logged cron job. Per-pass structured counts (recovered, retried, stuck, timed-out). Silent on successful state transitions.	None SnapTrade API calls go through `@castello/snaptrade` package with raw fetch and retry logic. No tracing wrapper.	None Pass-level counts logged as structured fields. No OTel instruments anywhere in this job.
cache-market-moversapps/cache-market-movers	Partial Checkpoint-only. Start, fetch failures, no-rows abort, RPC call, final success. No per-mover detail.	None Yahoo Finance fetched via raw `fetch()` with header signing. No tracing.	None No metrics of any kind.
fetch-openexchangerates-fx-ratesapps/fetch-openexchangerates-fx-rates	Partial Checkpoint-only. Trading day check, fetch failure, invalid/overflow rates, success. Silent on per-rate processing.	None OXR API called via raw `fetch()`. No tracing.	None No metrics of any kind.
update-recap-page-pathsapps/update-recap-page-paths	Partial Per-ISIN failures and skips logged. Successful path upserts are silent. Summary count at end.	None HEAD requests to recap server via `safeFetch()`. No span instrumentation.	None No metrics of any kind.
API layer
packages/api (oRPC)handler · middleware · procedures	Partial Procedure entry/exit + duration via logging middleware. No domain-level logging at the handler boundary. Error codes logged at WARN for unmapped errors.	None No span instrumentation at the handler or middleware level. Domain packages called from procedures have their own logging but no traces.	None No request rate, error rate, or latency metrics. No OTel instruments anywhere in the API layer.

Cross-cutting shortcomings by pillar

Logs

No correlation between logs and traces No traceId or spanId is injected into log entries. When a traced Anthropic call fails, you cannot jump from the span to the log entries from that same execution. The two signals are parallel but unlinked.
JobLogger not connected to any product The rich schema — runId, event, phase, errorCode, structured context and metrics sub-objects — exists in monitoring/ but no product imports it. All 11 products emit flat JSON with no shared identity fields.
All products share the same structural gap Every product logs failures well. None log successful per-item or per-step operations at DEBUG level. The only diagnostic signal is absence of errors — not positive confirmation that work completed.

Traces

8 of 11 products have zero trace coverage The four cron jobs (snaptrade-sync-reconcile, cache-market-movers, fetch-openexchangerates-fx-rates, update-recap-page-paths) and the API make external API calls via raw fetch() with no instrumentation. Latency, errors, and retry patterns are invisible as traces.
No job-level root span on the two instrumented products batch-portfolio-insights and batch-risk-analysis get Anthropic call spans indirectly through llm-batch/anthropic.ts, but there is no parent span for the job execution itself. Individual Anthropic calls are trace events, but there is no trace that represents the full job run.
OTel SDK not configured — all spans are no-op in production Even the instrumented Anthropic call spans are silently discarded. OTEL_DENO=true and an exporter endpoint are required but not set in any Cloud Run job environment.

Metrics

9 of 11 products have no metrics at all Only the two LLM batch jobs collect any OTel metrics (via BatchMetricsTracker). Every cron job, the news pipeline jobs, and the API produce no time-series metrics of any kind.
OTel metrics are also no-op in production The token counter and batch duration histogram in BatchMetricsTracker are correctly defined, but silently discarded — same SDK/exporter gap as traces. Token counts are also written to the DB as a workaround, but that is queryable data, not a metrics system.
No operational health metrics anywhere There are no metrics for job success rate, job duration distribution, external API error rate, or API request throughput. These are the signals that would feed dashboards and alerting — and none of them exist as metrics today.

v.

The implementation plan.

6 steps · 3 infra-only · 3 code changes · GCP-native + OTel Collector

Roadmap

The plan activates all three observability pillars using only GCP services and the instrumentation already built on this branch. Steps 1–3 are pure infrastructure — no application code changes, no risk to running jobs. Steps 4–6 are code changes that deepen what the infrastructure can surface.

6

Total steps

3

Infra-only

3

Code changes

0

New vendors

1

Deploy the OTel Collector.

Deploy a Cloud Run Service running the OpenTelemetry Collector Contrib image. Configure it with a single YAML: accept OTLP/HTTP on port 4318, export to GCP via the googlecloud exporter. Set ingress to internal-only so it is reachable by other Cloud Run services but not the public internet.

Create a dedicated otel-collector service account and grant it roles/cloudtrace.agent and roles/monitoring.metricWriter. The collector handles all GCP authentication — application jobs never touch credentials for telemetry again. Token refresh is automatic via ADC.

New: infra/collector.tf
New: infra/collector-config.yaml
Edit: infra/iam.tf — otel-collector SA + role bindings

Infra only No app risk

2

Activate OTel on all jobs.

Add three environment variables to every Cloud Run job spec in Terraform. OTEL_DENO=true activates Deno's built-in OTel SDK. OTEL_EXPORTER_OTLP_ENDPOINT points to the collector's internal Cloud Run URL. OTEL_SERVICE_NAME labels spans by job for grouping in Cloud Trace.

This single Terraform apply gives you: every fetch() call in every job as a trace span — Finnhub, Yahoo Finance, OXR, SnapTrade, Anthropic — all automatically, with no code changes. The existing tracedFetch() and withSpan() calls in llm-batch activate on top. BatchMetricsTracker token and duration metrics start flowing to Cloud Monitoring.

Edit: infra/jobs/cloud-run.tf — batch-portfolio-insights
Edit: infra/jobs/batch-risk-analysis.tf
Edit: infra/jobs/news-ingest.tf
Edit: infra/jobs/news-sentiment-*.tf (3 files)
Edit: infra/jobs/snaptrade-sync-reconcile.tf
Edit: infra/jobs/cache-market-movers.tf
Edit: infra/jobs/fetch-openexchangerates-fx-rates.tf
Edit: infra/jobs/update-recap-page-paths.tf

Infra only No app risk

3

Metrics, alerting, and a dashboard.

Cloud Run Jobs already emit free execution metrics to Cloud Monitoring — execution count labelled by exit code, and execution latency — but nothing is watching them. Create a google_monitoring_dashboard that surfaces these alongside the custom OTel metrics arriving from step 2.

Create two alerting policies: one on run.googleapis.com/job/completed_execution_count filtered to exit code 1 (fatal crash), and one on log-severity ERROR rate per job using a google_logging_metric. This is the first time a job failure will produce a notification rather than silent log entries.

New: infra/monitoring.tf
  — google_logging_metric (error rate per job)
  — google_monitoring_alert_policy (exit code 1)
  — google_monitoring_alert_policy (error rate spike)
  — google_monitoring_dashboard

Infra only No app risk

4

Connect the JobLogger.

Replace log() calls with createLogger() from @castello/monitoring in each job's main.ts. Every log entry gains runId (a UUID stable across one execution), event, phase, and errorCode — the fields the schema was designed around. Switch fatal catch blocks from log(LogLevel.ERROR) to logger.critical() before Deno.exit(1), activating the CRITICAL severity tier in GCP for the first time.

Once deployed, update the log-based metric filter in Terraform from severity>=ERROR to jsonPayload.event="JOB_FATAL" — a more precise signal that eliminates false positives from per-item errors.

Edit: apps/batch-portfolio-insights/main.ts
Edit: apps/batch-risk-analysis/main.ts
Edit: apps/snaptrade-sync-reconcile/main.ts
Edit: apps/cache-market-movers/main.ts
Edit: apps/fetch-openexchangerates-fx-rates/main.ts
Edit: apps/update-recap-page-paths/main.ts
Edit: jobs/news/ingest/src/main.ts
Edit: infra/monitoring.tf — refine alert filter

Code change

5

Link logs to traces.

Modify packages/core/logging.ts and monitoring/logger.ts to read the active OTel span context at the moment each log entry is written. If a span is active, inject two fields that GCP Cloud Logging recognises natively: logging.googleapis.com/trace (the full trace resource path) and logging.googleapis.com/spanId.

After this change, every log entry written inside a traced operation displays a "View in Trace" link in Cloud Logging's Log Explorer. You can jump from a specific WARN or ERROR log line directly to the Cloud Trace span that contains it — closing the gap between the two previously unlinked signals.

Edit: packages/core/logging.ts — inject span context
Edit: monitoring/logger.ts — inject span context
Edit: monitoring/schema.ts — add trace fields to LogEntry type

Code change

6

Add a root span per job.

Wrap each job's top-level execution in a withSpan() call. This creates a single parent span representing the full job run, under which all auto-instrumented fetch() spans and explicit withSpan() blocks become children. Without this, Cloud Trace shows individual spans but no trace that represents "this was one execution of batch-portfolio-insights."

Attach the runId from the JobLogger as a span attribute so the trace and the log entries for one execution share a common identifier — making cross-signal correlation possible without relying on timestamps or Cloud Run's execution_id label.

Edit: apps/batch-portfolio-insights/main.ts
Edit: apps/batch-risk-analysis/main.ts
Edit: apps/snaptrade-sync-reconcile/main.ts
Edit: apps/cache-market-movers/main.ts
Edit: apps/fetch-openexchangerates-fx-rates/main.ts
Edit: apps/update-recap-page-paths/main.ts
Edit: jobs/news/ingest/src/main.ts

Code change

What becomes available after each step

After steps 1–3 · infra only

Operational visibility

Every fetch() call in every job appears as a span in Cloud Trace
Token usage and batch duration flow into Cloud Monitoring
Job success/failure dashboard live
Alerts fire on fatal job crashes and error rate spikes
Zero application code changed

After step 4 · JobLogger

Structured signal

Log entries carry runId, event, phase, errorCode
Alert filter narrows from all ERRORs to JOB_FATAL — no more false positives
CRITICAL severity tier active — GCP can distinguish a job crash from a per-item failure
Log-based metrics become more precise

After steps 5–6 · correlation

Unified observability

Log entries link directly to their trace span in Cloud Logging UI
Each job execution is one traceable unit from start to finish
runId ties logs and traces together by a shared identifier
All three pillars unified — logs, traces, metrics cross-reference each other

Key principle

Steps 1–3 are independently shippable and carry no application risk. They give you most of the operational value — alerting, dashboards, trace coverage — without touching a single line of job code. Steps 4–6 deepen the signal but depend on steps 1–3 being in place first: there is no point injecting trace IDs into log entries before traces are flowing.

Implementation details

Step 2 — OTEL_SERVICE_NAME is required, not optional

OTEL_SERVICE_NAME is the label Cloud Trace uses to group and filter spans. Without it, every span from every job appears under unknown_service — all ten jobs' traces merged into one unnavigable list. It must be set per job in Terraform, matching the job name exactly.

Job	OTEL_SERVICE_NAME value
batch-portfolio-insights	batch-portfolio-insights
batch-risk-analysis	batch-risk-analysis
news-ingest	news-ingest
news-sentiment-submit	news-sentiment-submit
news-sentiment-write	news-sentiment-write
batch-poll	batch-poll
snaptrade-sync-reconcile	snaptrade-sync-reconcile
cache-market-movers	cache-market-movers
fetch-openexchangerates-fx-rates	fetch-openexchangerates-fx-rates
update-recap-page-paths	update-recap-page-paths

Step 5 — trace field format must be the full resource path

GCP Cloud Logging only draws the link between a log entry and a Cloud Trace span if the logging.googleapis.com/trace field contains the full resource path — not just the trace ID hex string. The OTel API returns a 32-character hex string from spanContext().traceId. Step 5 must construct the full path before writing it to the log entry.

import { trace } from "@opentelemetry/api";

function getTraceContext() {
  const span = trace.getActiveSpan();
  if (!span) return {};

  const { traceId, spanId, traceFlags } = span.spanContext();
  return {
    // full path required — hex string alone does not work
    "logging.googleapis.com/trace":
      `projects/${PROJECT_ID}/traces/${traceId}`,
    "logging.googleapis.com/spanId": spanId,
    "logging.googleapis.com/trace_sampled": (traceFlags & 1) === 1,
  };
}

This function is called inside packages/core/logging.ts and monitoring/logger.ts at the point each entry is serialised. If no span is active (e.g. code running outside a withSpan() block), it returns an empty object and the log entry is written without trace fields — this is correct behaviour, not an error. PROJECT_ID is read from the GOOGLE_CLOUD_PROJECT_ID env var already present on all Cloud Run jobs.

Silent failure risk

Both of these details fail silently. If OTEL_SERVICE_NAME is omitted, Cloud Trace still receives spans — they just all appear under unknown_service with no indication anything is wrong. If the trace field is written as a bare hex string instead of the full path, Cloud Logging still accepts the log entry — it just never draws the link to Cloud Trace. Neither produces an error. Verify both in a staging run before treating the feature as working.

Summary of findings

Two systems, one codebase.

System A — JobLogger

System B — log(LogLevel)

Critical: JobLogger is not yet in production

JobName enum — 9 predefined jobs

Current implementation.

Level taxonomy and distribution.

INFO — phase markers and operation summaries

WARN — recoverable degradation

ERROR — terminal failures

DEBUG — success signals on high-frequency paths

Level-to-GCP severity mapping

WARN breakdown by pattern

CRITICAL is dead code.

Current exit pattern (batch-portfolio-insights)

Why this matters

Scope

DEBUG covers 1.7% of calls.

Existing DEBUG call sites

Where DEBUG is absent but needed

Note on filtering

Output format and GCP routing.

JobLogger entry shape

log(LogLevel) entry shape

GCP filter examples

What's working well

LogEvent taxonomy — 22 defined event types

Where logs actually land.

Full product × destination matrix

GCP Cloud Logging: what's queryable.

What the logs look like today

How to navigate to job logs in GCP Console

Cloud Workflows logs

Cloud Scheduler logs

What GCP Cloud Logging does well here

Three systems producing no signal.

1. JobLogger — fully designed, never deployed

Impact

2. OpenTelemetry — wired but dormant

3. Cloud Monitoring — no metrics layer configured

Four subsystems produce no logs.

Package-level breakdown

Risk surface

Minimum viable instrumentation

Eleven products, three pillars.

Cross-cutting shortcomings by pillar

The implementation plan.

Deploy the OTel Collector.

Activate OTel on all jobs.

Metrics, alerting, and a dashboard.

Connect the JobLogger.

Link logs to traces.

Add a root span per job.

What becomes available after each step

Operational visibility

Structured signal

Unified observability

Key principle

Implementation details

Silent failure risk