A full survey of logging infrastructure across _next — what levels exist, where they fire, what stays silent, and where the architecture diverges from its own schema.
Logging is split across two independent systems that serve different execution contexts and produce structurally different output. They share the same GCP destination and the same physical output method (console.log / console.error), but the shape of what they emit is quite different.
Factory-based structured logger for batch and cron jobs. Created via createLogger(meta: JobMeta) in monitoring/logger.ts. Every entry carries full job identity threaded through: job name, job type, run UUID, environment, a typed LogEvent enum value, a LogPhase, optional entity context IDs, quantitative metrics, and structured error detail. The schema is defined exhaustively in monitoring/schema.ts (12 KB).
JobLogger — 5 methods: debug(), info(), warn(), error(), critical()jobName · jobType · runId (UUID per execution) · environmentLogEvent (22 event types) · LogPhase (5 phases)context (entity IDs) · metrics (counts, durations, tokens) · error (code + message + stack)DEBUG/INFO → console.log (stdout) · WARNING/ERROR/CRITICAL → console.error (stderr)A lightweight generic logger in packages/core/logging.ts for application-layer code — API middleware, service packages, library utilities. Output is a flat JSON object: severity, message, timestamp, and any additional fields spread from the data argument. No job identity, no event taxonomy, no structured sub-objects.
log(level, message, data?, context?)debug · info · warn · error (4, no critical)severity + message + timestamp + spread dataerror/warn → stderr · info/debug → stdoutThe monitoring/ module — JobLogger, schema.ts, all 22 LogEvent types — exists on feat/logging-monitoring but has not been imported or called by any running job. Every batch and cron job in production today uses packages/core/logging.ts, the flat generic logger. The rich schema described above is the design target, not the current state.
The generic log() function is a zero-ceremony escape hatch for code that isn't job-scoped. The problem is that the batch jobs themselves still use it. SnapTrade packages also use the generic logger despite being primary consumers of job context.
| Constant | Job type | Description |
|---|---|---|
| BATCH_PORTFOLIO_INSIGHTS | batch | LLM-powered portfolio analysis via Anthropic Batch API |
| BATCH_RISK_ANALYSIS | batch | LLM-powered risk analysis via Anthropic Batch API |
| CACHE_MARKET_MOVERS | cron | Refresh market movers cache |
| DAILY_SNAPTRADE_SNAPSHOT | cron | Daily SnapTrade portfolio snapshot |
| FETCH_OXR_FX_RATES | cron | Sync OpenExchangeRates FX data |
| PORTFOLIO_METRICS | cron | Compute portfolio performance metrics |
| SNAPTRADE_SYNC_RECONCILE | cron | Reconcile SnapTrade sync state |
| SYNC_BROKERAGE_INFO | cron | Sync brokerage metadata from SnapTrade |
| UPDATE_RECAP_PAGE_PATHS | cron | Refresh recap page path index |
Seven subsections covering the current state of logging in _next: level distribution, dead code, diagnostic gaps, output format and GCP routing, per-product log destinations, Cloud Logging queryability, and unactivated observability infrastructure.
Five levels are defined across the two systems. Four are in active use. The distribution is heavily weighted toward INFO and WARN — diagnostic depth at the DEBUG tier is nearly absent, and the highest severity tier (CRITICAL) is dead code.
INFO is used for high-level lifecycle events: job start/end, phase transitions, batch submission and completion, account sync milestones. It always includes entity IDs and summary metrics. It never fires per-item in tight loops — that's left to DEBUG (or not logged at all).
WARN is the second-busiest level and covers three distinct patterns: external system degradation (SnapTrade 429s, network retries), concurrent state machine race conditions in the sync pipeline, and data integrity gaps where processing continues with partial data (missing FX rates, unknown tickers, null brokerage IDs). Every WARN means the operation continued.
ERROR marks operations that stopped. It is the signal for incidents and alerting. Always includes machine-readable error codes, entity IDs for correlation, and consequence language where data loss is possible. No ERROR should be informational — if the operation recovered, it was a WARN.
The six DEBUG calls cover narrow cases: market holiday cache refreshes, batch polling status, per-portfolio save confirmations, bulk-save outcomes, and skipped-item notices. These are correct targets for DEBUG. The gap is that almost no high-frequency path produces any diagnostic signal at all.
| Logger level | GCP severity | Stream | Alert eligible |
|---|---|---|---|
| DEBUG | DEBUG | stdout | No |
| INFO / info | INFO | stdout | No |
| WARN / warn | WARNING | stderr | Optional |
| ERROR / error | ERROR | stderr | Yes |
| CRITICAL | CRITICAL | stderr | Yes — immediate |
| Pattern | Count | Example |
|---|---|---|
| External system retries (rate limits, network errors) | ~28 | SnapTrade 429, upstream 5xx, network timeout |
| Concurrent state race conditions | ~18 | Sync state machine: complete UPDATE no-op, row disappeared |
| Data integrity gaps (partial data, missing FX) | ~32 | Missing FX rate, null brokerage_id, skipped SnapTrade activity |
| Fallback mechanisms engaged | ~14 | Bulk save → per-item fallback, brokerage_id left NULL |
| Graceful shutdown signals | ~8 | SIGTERM received, forced shutdown |
| Unknown / unrecognised data | ~16 | Unknown custom_id in batch result, unsupported ticker |
The Severity.CRITICAL level and the logger.critical() method are fully implemented. The schema documents their intended contract precisely: reserved for log entries emitted immediately before exit(1), mapping to GCP's highest alert tier. But the level has zero call sites across the entire codebase.
The actual pattern for fatal exits in every batch job is to log at ERROR and then call process.exit(1) separately. The two signals are disconnected — GCP sees an ERROR entry and then a container exit, but never a CRITICAL entry that would let an alerting policy fire specifically on "job crashed, not just errored."
// What actually happens today catch (err) { log(LogLevel.ERROR, "Batch job crashed", { message: err.message }); Deno.exit(1); } // What the schema intended catch (err) { logger.critical("Batch job crashed", { event: LogEvent.JOB_FATAL, phase: LogPhase.TEARDOWN, error: toErrorDetail("JOB_CRASHED", err), }); Deno.exit(1); }
GCP Cloud Logging alerting policies distinguish severity tiers. An ERROR policy fires on all errors including non-fatal ones. A CRITICAL policy fires only on the most severe failures. Without CRITICAL, there is no GCP-native way to build a "job crashed" alert that doesn't also fire on recoverable per-item errors — or you have to filter by jsonPayload.event="JOB_FATAL" manually rather than by severity.
The fix is mechanical: replace the terminal log(LogLevel.ERROR) + Deno.exit(1) pair in each job's top-level catch block with logger.critical() using LogEvent.JOB_FATAL. Eight jobs are affected. No schema changes required — the infrastructure is already there.
The six existing DEBUG calls are correctly placed — they cover success signals on high-cardinality operations that don't warrant INFO noise. The problem is scope: whole subsystems that would benefit from per-operation tracing produce no diagnostic output between the WARN/ERROR conditions they do log.
| File | Message | Data |
|---|---|---|
| trading_day_service.ts:163 | Refreshed market holiday cache | exchange · holidayCount |
| llm-batch/anthropic.ts | Batch still processing | batchId · status |
| portfolio/service.ts | Saved portfolio insights | userId · accountId · targetDate |
| portfolio/service.ts | Bulk-saved portfolio insights | portfolios · rows |
| llm-batch/retry.ts | No retryable failed requests | batchJobId |
| update-recap-page-paths/main.ts | Page not found, skipping | isin · status |
DEBUG entries are routed to stdout and excluded from GCP alerting by severity filter. Adding DEBUG logging to high-frequency paths (per-request, per-query) carries no alerting cost — the risk is log volume in production. A LOG_LEVEL env var or DEBUG=true flag should gate these calls before they are added.
Both logging systems emit newline-delimited JSON to standard streams. GCP Cloud Run captures these streams and forwards them to Cloud Logging, where the severity, message, and timestamp fields are promoted to first-class indexed columns. Everything else lands in jsonPayload.
{
"severity": "INFO",
"message": "Starting portfolio insights batch job",
"timestamp": "2026-05-25T14:32:18.123Z",
"jobName": "batch-portfolio-insights",
"jobType": "batch",
"environment": "production",
"runId": "550e8400-e29b-41d4-a716-446655440000",
"event": "JOB_START",
"phase": "INIT",
"context": {
"targetDate": "2026-05-23"
},
"metrics": {
"total": 450,
"durationMs": 5432
}
}
{
"severity": "INFO",
"message": "-> user.deleteAccount",
"functionName": "oRPC_procedure",
"timestamp": "2026-05-25T14:32:18.123Z"
}
Top-level fields in the JobLogger output are indexed and filterable directly in GCP Log Explorer. The event and errorCode fields were specifically designed for this — their values were chosen to be useful as alert conditions, not just human-readable labels.
-- Only job-fatal events jsonPayload.event="JOB_FATAL" -- All errors from one job severity>=ERROR AND jsonPayload.jobName="batch-portfolio-insights" -- Trace all logs for one run jsonPayload.runId="550e8400-e29b-41d4-a716-446655440000" -- Follow a specific user across all jobs jsonPayload.context.userId="<uuid>"
console.log calls.runId UUID provides a complete per-execution trace — all logs for one job run can be isolated with a single filter.event enum vocabulary is well-designed and GCP-filter-ready.errorCode field is a flat top-level field specifically for fast alert condition matching.| Category | Events |
|---|---|
| Job lifecycle | JOB_START · JOB_SUCCESS · JOB_PARTIAL_FAILURE · JOB_FATAL |
| Step-level | STEP_START · STEP_SUCCESS · STEP_FAILURE |
| Item-level | ITEM_PROCESSED · ITEM_FAILED · ITEM_SKIPPED |
| External systems | EXTERNAL_API_REQUEST · EXTERNAL_API_ERROR · DB_OPERATION · DB_ERROR |
| LLM batch | LLM_BATCH_SUBMITTED · LLM_BATCH_COMPLETE · LLM_BATCH_ERROR |
| Process signals | SHUTDOWN_SIGNAL |
All ten Cloud Run jobs and their orchestration layers log to GCP Cloud Logging. They arrive under different resource.type labels — so they appear separated in Log Explorer — but they are all in the same logging system. Cloud Scheduler and Cloud Workflows are not separate log destinations; they are separate resource types within Cloud Logging.
All ten Cloud Run jobs write structured JSON to stdout and stderr. Cloud Run captures these streams automatically and forwards them to GCP Cloud Logging — no SDK or explicit client required. The GCP project is castello-backend, region us-central1, and all images are stored in Artifact Registry under castello-batch.
Because the monitoring module's JobLogger is not yet connected, every job currently emits flat JSON via packages/core/logging.ts — not the rich schema defined in monitoring/schema.ts. There are no jobName, runId, event, or phase fields in production logs today. The fields available for filtering are limited to those spread from each log call's data argument, which varies by callsite.
// What GCP receives today (flat, core/logging.ts) { "severity": "INFO", "message": "Fetched X brokerages from SnapTrade", "timestamp":"2026-05-25T14:32:18.123Z", "count": 42 } // What GCP will receive after monitoring module is connected { "severity": "INFO", "message": "Fetched X brokerages from SnapTrade", "timestamp": "2026-05-25T14:32:18.123Z", "jobName": "sync-brokerage-info", "runId": "550e8400-e29b-41d4-a716-446655440000", "event": "STEP_SUCCESS", "phase": "PREFETCH", "metrics": { "total": 42, "durationMs": 812 } }
castello-backend)resource.type="cloud_run_job"resource.type="cloud_run_job" AND resource.labels.job_name="portfolio-insights"resource.type="cloud_run_job" AND severity>=ERRORresource.labels.execution_id="portfolio-insights-abc12"Cloud Workflows writes to Cloud Logging under resource.type="workflows.googleapis.com/Workflow". By default only errors are logged. Setting call_log_level = "LOG_ALL_CALLS" on the Terraform resource enables step-level entries — which steps executed, what arguments they passed to each Cloud Run job, and what the response was. Custom messages can also be written from within the workflow YAML using sys.log(), which land in the same stream.
resource.type="workflows.googleapis.com/Workflow" AND resource.labels.workflow_id="news-sentiment"call_log_level = "LOG_ALL_CALLS" in infra/jobs/news-sentiment-workflow.tfbatch-poll runs up to 288 times (24h max, every 5min) — each invocation is a separate Cloud Run Job execution with its own log stream under resource.type="cloud_run_job"Cloud Scheduler writes to Cloud Logging under resource.type="cloud_scheduler_job" on every invocation — the time it fired, the HTTP target it called, and the response status code it received. The "View Logs" button in the Cloud Scheduler console is a shortcut link that opens Cloud Logging with this filter pre-applied. It is not a separate log store.
resource.type="cloud_scheduler_job" AND resource.labels.job_name="portfolio-insights-daily"execution_id label, so you can isolate one run's logs with a single filter even before runId is added.Beyond the silent code paths identified in §v, three pieces of observability infrastructure exist in the codebase but produce no output in production. Each is fully designed, partially implemented, and then stopped short of being activated.
The monitoring/ module defines a complete structured logging schema: 22 LogEvent types, 5 LogPhase values, typed sub-objects for context, metrics, and error detail, and a createLogger() factory that produces a JobLogger bound to a specific job's identity. This is the infrastructure this branch (feat/logging-monitoring) was created to build.
No batch job or cron job currently imports or calls createLogger(). GCP Cloud Logging receives flat JSON today. The monitoring module exists only on disk.
Without runId, isolating one job execution in GCP Log Explorer requires filtering by Cloud Run's execution_id label — which works, but isn't surfaced in the log entries themselves. Without event and phase fields, GCP alerting policies cannot distinguish a per-item failure from a job-fatal crash. Without errorCode, programmatic alert routing is impossible.
packages/llm-batch/tracing.ts wraps every Anthropic Batch API call in an OpenTelemetry span via a tracedFetch() helper. It adds gen_ai.* semantic attributes, sanitises URLs, and sets span status on HTTP errors. The implementation is complete and correct.
Activation requires two things that are not currently set in production: the OTEL_DENO=true environment variable (gates all tracing calls behind a no-op check), and an OTEL exporter endpoint (OTEL_EXPORTER_OTLP_ENDPOINT or equivalent). Without both, every tracedFetch() call is a plain fetch() — no spans are emitted anywhere.
OTEL_DENO=true + configure an OTEL collector endpoint in Cloud Run job env varsThere is no Cloud Monitoring (formerly Stackdriver Metrics) setup in the Terraform configuration. No custom dashboards, no uptime checks, no alerting policies, no log-based metrics. Job success and failure are recorded as log events but never elevated to time-series metrics.
This means there is no way to answer operational questions like "how many jobs failed this week?" or "what is the P95 duration of the portfolio insights batch?" without writing ad-hoc Log Explorer queries. There are no alerting policies that fire when a job fails — failure is visible in logs, but only if someone is watching.
batch-portfolio-insights crash produces an ERROR log in Cloud Logging, but no notification fires. Detection relies on downstream effects (missing insights) or someone manually checking logs.
The following modules handle real user data and real-time operations but emit nothing to stdout or stderr under any condition. When they fail silently, the only signal is a downstream symptom — stale portfolio data, missing webhook updates, absent sentiment scores — with no log trail to explain the gap.
| Package | Debug | Info | Warn | Error | Total | Status |
|---|---|---|---|---|---|---|
| apps/ (batch jobs) | 1 | 68 | 28 | 46 | 143 | Well logged |
| packages/snaptrade/ | 0 | 15 | 36 | 24 | 75 | Well logged |
| packages/llm-batch/ | 1 | 6 | 0 | 0 | 7 | Partial |
| packages/portfolio/ | 2 | 4 | 0 | 0 | 6 | Partial |
| packages/risk-analysis/ | 1 | 2 | 0 | 0 | 3 | Partial |
| packages/trading-days/ | 1 | 0 | 1 | 1 | 3 | Partial |
| packages/api/ | 0 | 2 | 0 | 0 | 2 | Minimal |
| packages/snaptrade/webhook.ts | — | — | — | — | 0 | Silent |
| packages/snaptrade/portfolio_intraday.ts | — | — | — | — | 0 | Silent |
| packages/watchlist/service.ts | — | — | — | — | 0 | Silent |
| jobs/news/sentiment/ | — | — | — | — | 0 | Silent |
Webhook processing (webhook.ts) is the highest-risk silent area. Webhooks are the primary mechanism by which SnapTrade notifies Castello of account changes — a silent failure here means user portfolio state diverges from reality without any observable log trail. Intraday portfolio compute (portfolio_intraday.ts) is similarly high-frequency and user-visible.
For each silent module: one INFO at entry (event type, entity ID), one WARN on any non-fatal skip or data gap, one ERROR on any unrecoverable failure. That's three log sites minimum to make a module observable. Comprehensive tracing can follow incrementally.
Every product in _next has some logging. Almost none have traces or metrics. The two pillars that make logging actionable — knowing where time was spent and whether the system is healthy over time — are either no-ops in production or absent entirely.
| Product | Logs | Traces | Metrics |
|---|---|---|---|
| LLM batch jobs | |||
| batch-portfolio-insightsapps/batch-portfolio-insights |
Partial
Flat JSON via
core/logging. Good lifecycle coverage. No runId, no event types. JobLogger not connected. |
No-op in prod
Anthropic calls traced indirectly via
llm-batch/anthropic.ts → tracedFetch(). No job-level root span. OTel SDK not configured. |
No-op in prod
BatchMetricsTracker fires OTel counter + histogram on finalize. Token counts also persisted to DB. OTel instruments no-op without SDK. |
| batch-risk-analysisapps/batch-risk-analysis |
Partial
Same as portfolio-insights. Good coverage, flat schema, no JobLogger.
|
No-op in prod
Same indirect tracing via
llm-batch/anthropic.ts. No root span. |
No-op in prod
Same
BatchMetricsTracker pattern. Token counts to DB. |
| News pipeline | |||
| news-ingestjobs/news/ingest |
Partial
Progress every 500 tickers, per-ticker failures. Silent on individual successes. No JobLogger.
|
None
Finnhub API calls use raw
fetch() with no tracing wrapper. Zero span coverage. |
None
Counts tracked as local variables and logged at job end. No OTel instruments.
|
| news-sentiment-submitjobs/news/sentiment/submit |
Partial
Thin main.ts delegates to shared runner. Logging behaviour is inside the runner — not directly auditable from main.
|
Likely no-op
Runner likely uses
llm-batch for Anthropic submission. If so, indirect tracing applies. SDK still not configured. |
Likely no-op
Runner likely uses
BatchMetricsTracker. OTel no-op without SDK. |
| news-sentiment-writejobs/news/sentiment/write |
Partial
Same delegation pattern as submit. Logging inside shared runner.
|
None
Write phase reads Anthropic results from DB, no outbound API calls. No spans.
|
Likely no-op
Runner likely uses
BatchMetricsTracker for write-phase token counts. |
| batch-polljobs/batch/poll |
Partial
Thin main.ts delegates to
runBatchPoll(). Per-poll cycle logging inside runner. |
Likely no-op
Polls Anthropic batch status via
llm-batch. Likely indirect tracing on status calls. SDK not configured. |
None
Poll phase does not accumulate result metrics — those belong to the write phase.
|
| Cron jobs | |||
| snaptrade-sync-reconcileapps/snaptrade-sync-reconcile |
Partial
Best-logged cron job. Per-pass structured counts (recovered, retried, stuck, timed-out). Silent on successful state transitions.
|
None
SnapTrade API calls go through
@castello/snaptrade package with raw fetch and retry logic. No tracing wrapper. |
None
Pass-level counts logged as structured fields. No OTel instruments anywhere in this job.
|
| cache-market-moversapps/cache-market-movers |
Partial
Checkpoint-only. Start, fetch failures, no-rows abort, RPC call, final success. No per-mover detail.
|
None
Yahoo Finance fetched via raw
fetch() with header signing. No tracing. |
None
No metrics of any kind.
|
| fetch-openexchangerates-fx-ratesapps/fetch-openexchangerates-fx-rates |
Partial
Checkpoint-only. Trading day check, fetch failure, invalid/overflow rates, success. Silent on per-rate processing.
|
None
OXR API called via raw
fetch(). No tracing. |
None
No metrics of any kind.
|
| update-recap-page-pathsapps/update-recap-page-paths |
Partial
Per-ISIN failures and skips logged. Successful path upserts are silent. Summary count at end.
|
None
HEAD requests to recap server via
safeFetch(). No span instrumentation. |
None
No metrics of any kind.
|
| API layer | |||
| packages/api (oRPC)handler · middleware · procedures |
Partial
Procedure entry/exit + duration via logging middleware. No domain-level logging at the handler boundary. Error codes logged at WARN for unmapped errors.
|
None
No span instrumentation at the handler or middleware level. Domain packages called from procedures have their own logging but no traces.
|
None
No request rate, error rate, or latency metrics. No OTel instruments anywhere in the API layer.
|
Logs
traceId or spanId is injected into log entries. When a traced Anthropic call fails, you cannot jump from the span to the log entries from that same execution. The two signals are parallel but unlinked.
runId, event, phase, errorCode, structured context and metrics sub-objects — exists in monitoring/ but no product imports it. All 11 products emit flat JSON with no shared identity fields.
Traces
fetch() with no instrumentation. Latency, errors, and retry patterns are invisible as traces.
llm-batch/anthropic.ts, but there is no parent span for the job execution itself. Individual Anthropic calls are trace events, but there is no trace that represents the full job run.
OTEL_DENO=true and an exporter endpoint are required but not set in any Cloud Run job environment.
Metrics
BatchMetricsTracker). Every cron job, the news pipeline jobs, and the API produce no time-series metrics of any kind.
BatchMetricsTracker are correctly defined, but silently discarded — same SDK/exporter gap as traces. Token counts are also written to the DB as a workaround, but that is queryable data, not a metrics system.
The plan activates all three observability pillars using only GCP services and the instrumentation already built on this branch. Steps 1–3 are pure infrastructure — no application code changes, no risk to running jobs. Steps 4–6 are code changes that deepen what the infrastructure can surface.
Deploy a Cloud Run Service running the OpenTelemetry Collector Contrib image. Configure it with a single YAML: accept OTLP/HTTP on port 4318, export to GCP via the googlecloud exporter. Set ingress to internal-only so it is reachable by other Cloud Run services but not the public internet.
Create a dedicated otel-collector service account and grant it roles/cloudtrace.agent and roles/monitoring.metricWriter. The collector handles all GCP authentication — application jobs never touch credentials for telemetry again. Token refresh is automatic via ADC.
Add three environment variables to every Cloud Run job spec in Terraform. OTEL_DENO=true activates Deno's built-in OTel SDK. OTEL_EXPORTER_OTLP_ENDPOINT points to the collector's internal Cloud Run URL. OTEL_SERVICE_NAME labels spans by job for grouping in Cloud Trace.
This single Terraform apply gives you: every fetch() call in every job as a trace span — Finnhub, Yahoo Finance, OXR, SnapTrade, Anthropic — all automatically, with no code changes. The existing tracedFetch() and withSpan() calls in llm-batch activate on top. BatchMetricsTracker token and duration metrics start flowing to Cloud Monitoring.
Cloud Run Jobs already emit free execution metrics to Cloud Monitoring — execution count labelled by exit code, and execution latency — but nothing is watching them. Create a google_monitoring_dashboard that surfaces these alongside the custom OTel metrics arriving from step 2.
Create two alerting policies: one on run.googleapis.com/job/completed_execution_count filtered to exit code 1 (fatal crash), and one on log-severity ERROR rate per job using a google_logging_metric. This is the first time a job failure will produce a notification rather than silent log entries.
Replace log() calls with createLogger() from @castello/monitoring in each job's main.ts. Every log entry gains runId (a UUID stable across one execution), event, phase, and errorCode — the fields the schema was designed around. Switch fatal catch blocks from log(LogLevel.ERROR) to logger.critical() before Deno.exit(1), activating the CRITICAL severity tier in GCP for the first time.
Once deployed, update the log-based metric filter in Terraform from severity>=ERROR to jsonPayload.event="JOB_FATAL" — a more precise signal that eliminates false positives from per-item errors.
Modify packages/core/logging.ts and monitoring/logger.ts to read the active OTel span context at the moment each log entry is written. If a span is active, inject two fields that GCP Cloud Logging recognises natively: logging.googleapis.com/trace (the full trace resource path) and logging.googleapis.com/spanId.
After this change, every log entry written inside a traced operation displays a "View in Trace" link in Cloud Logging's Log Explorer. You can jump from a specific WARN or ERROR log line directly to the Cloud Trace span that contains it — closing the gap between the two previously unlinked signals.
Wrap each job's top-level execution in a withSpan() call. This creates a single parent span representing the full job run, under which all auto-instrumented fetch() spans and explicit withSpan() blocks become children. Without this, Cloud Trace shows individual spans but no trace that represents "this was one execution of batch-portfolio-insights."
Attach the runId from the JobLogger as a span attribute so the trace and the log entries for one execution share a common identifier — making cross-signal correlation possible without relying on timestamps or Cloud Run's execution_id label.
fetch() call in every job appears as a span in Cloud TracerunId, event, phase, errorCodeJOB_FATAL — no more false positivesrunId ties logs and traces together by a shared identifierSteps 1–3 are independently shippable and carry no application risk. They give you most of the operational value — alerting, dashboards, trace coverage — without touching a single line of job code. Steps 4–6 deepen the signal but depend on steps 1–3 being in place first: there is no point injecting trace IDs into log entries before traces are flowing.
Step 2 — OTEL_SERVICE_NAME is required, not optional
OTEL_SERVICE_NAME is the label Cloud Trace uses to group and filter spans. Without it, every span from every job appears under unknown_service — all ten jobs' traces merged into one unnavigable list. It must be set per job in Terraform, matching the job name exactly.
| Job | OTEL_SERVICE_NAME value |
|---|---|
| batch-portfolio-insights | batch-portfolio-insights |
| batch-risk-analysis | batch-risk-analysis |
| news-ingest | news-ingest |
| news-sentiment-submit | news-sentiment-submit |
| news-sentiment-write | news-sentiment-write |
| batch-poll | batch-poll |
| snaptrade-sync-reconcile | snaptrade-sync-reconcile |
| cache-market-movers | cache-market-movers |
| fetch-openexchangerates-fx-rates | fetch-openexchangerates-fx-rates |
| update-recap-page-paths | update-recap-page-paths |
Step 5 — trace field format must be the full resource path
GCP Cloud Logging only draws the link between a log entry and a Cloud Trace span if the logging.googleapis.com/trace field contains the full resource path — not just the trace ID hex string. The OTel API returns a 32-character hex string from spanContext().traceId. Step 5 must construct the full path before writing it to the log entry.
import { trace } from "@opentelemetry/api"; function getTraceContext() { const span = trace.getActiveSpan(); if (!span) return {}; const { traceId, spanId, traceFlags } = span.spanContext(); return { // full path required — hex string alone does not work "logging.googleapis.com/trace": `projects/${PROJECT_ID}/traces/${traceId}`, "logging.googleapis.com/spanId": spanId, "logging.googleapis.com/trace_sampled": (traceFlags & 1) === 1, }; }
This function is called inside packages/core/logging.ts and monitoring/logger.ts at the point each entry is serialised. If no span is active (e.g. code running outside a withSpan() block), it returns an empty object and the log entry is written without trace fields — this is correct behaviour, not an error. PROJECT_ID is read from the GOOGLE_CLOUD_PROJECT_ID env var already present on all Cloud Run jobs.
Both of these details fail silently. If OTEL_SERVICE_NAME is omitted, Cloud Trace still receives spans — they just all appear under unknown_service with no indication anything is wrong. If the trace field is written as a bare hex string instead of the full path, Cloud Logging still accepts the log entry — it just never draws the link to Cloud Trace. Neither produces an error. Verify both in a staging run before treating the feature as working.