Skip to content

Diagnostics & Process Pools

Monitor Bifrost worker containers, fork pools, queue depth, and memory usage from the Diagnostics dashboard.

The Diagnostics dashboard is the operator-facing view into Bifrost’s worker containers, fork-based process pools, queue depth, and live memory usage. It’s the first place to look when executions are slow, queueing up, or crashing.

This guide is for self-hosted operators and on-call engineers who need to:

  • Confirm worker containers are connected and healthy
  • Spot a hot fork that’s chewing memory before it OOM-kills
  • Triage backlog when the queue depth alert fires
  • Trace which worker ran a specific execution

If you’re a workflow author looking at a single failed run, use the Execution History page instead.

Navigate to Settings → Diagnostics, or go directly to /diagnostics.

Diagnostics dashboard overview

The page has one tab today (Process Pools). The header shows a Live indicator when the WebSocket worker stream is connected — if it says “Connecting…” instead, REST snapshots will still populate the page but you won’t get sub-second updates.

The right-hand summary reports N container(s) · M fork(s):

  • Container — one Bifrost worker pod / container. Each container registers itself on startup over Redis.
  • Fork — one OS-level worker process inside a container. Workflows run in forks; each fork executes one task at a time.

If you’re sized for 4 containers × 8 forks, you should see 4 containers · 32 forks. Anything lower means a pod has died or hasn’t registered yet — check the pod’s logs.

The badge next to the Live indicator shows the current depth of the execution queue (RabbitMQ). A small backlog (single digits) is normal during bursts. Sustained double-digit backlogs mean you don’t have enough fork capacity for the incoming rate — scale workers, raise pool size, or look for a slow workflow.

Memory chart pane

A live time-series of total RSS per container, color-coded per worker. The chart retains the last few minutes of data and updates as WebSocket heartbeats arrive.

Numbers worth alerting on:

  • Sustained climb on a single container while others are flat — usually a workflow leaking memory in a fork. The Container Table will show which fork is holding the bag.
  • All containers near their pod memory limit — you’re undersized for the workload. Add containers or reduce WORKER_POOL_SIZE.
  • Sawtooth pattern — forks are being recycled (good — process-pool memory pressure is being reclaimed). Steep teeth mean a single workflow is grabbing a lot of RSS per run; consider chunking it or tightening its data handling.

Container table with fork details

One row per container, expandable to show its forks. Per fork you see:

ColumnMeaning
PIDOS process ID inside the container.
Stateidle, busy, terminating, or killed. A fork stuck in busy for a long time is the most common smell.
Memory (RSS)Resident set size in MB. Compare to the container’s pod memory limit.
Current executionExecution ID being processed, if any. Click through to the execution detail page.
UptimeHow long this fork has been alive since the last spawn/recycle.

A fork that’s been busy for longer than your WORKFLOW_TIMEOUT is wedged — either the workflow is in a tight loop or asyncio is blocked on a sync call. Kill the pod and inspect the worker logs.

Two paths:

  • REST snapshot — the page mounts with a one-shot /api/workers/pools call so the table is populated immediately even if the WebSocket hasn’t connected yet.
  • WebSocket heartbeats — each container streams its state every few seconds; the table merges in updates by worker_id. The Live dot pulses green while connected.

Click the Refresh button in the top-right to force a REST snapshot if you suspect the live stream is stale.

“Queue depth is climbing and won’t drain.” Check the container table for forks stuck busy. If most forks are idle, your bottleneck isn’t workers — look at the API or RabbitMQ. If most are busy with the same workflow path, that workflow is the choke point.

“A pod just OOM-killed.” Pull the memory chart — you’ll see the climbing line right up to the kill. Cross-reference the fork that was busy at that timestamp with the Current execution column to identify the workflow. File a ticket against that workflow author with the execution ID.

“WebSocket says connecting forever.” The browser tab is talking to the API; the API is talking to Redis to gather worker heartbeats. If the dot never goes green, check redis-cli ping from the API pod, then restart the pod.

The diagnostics surface intentionally only covers the execution worker pool. It does not show:

  • Scheduler health (use scheduler logs / the deferred execution promoter, see Scheduled Executions)
  • API request rates (use your reverse-proxy / APM)
  • Database / RabbitMQ / S3 health (those are infrastructure dashboards)
  • Per-workflow latency or token spend (use Usage Reports)