Operating Roost

This page covers the day-2 operational surface — pausing queues, cancelling jobs, recovering from worker crashes, and bulk-reviving the dead-letter pile.

Queue pause / resume

Halt a queue without killing workers — useful when a downstream service is down:

roost queue pause emails
roost queue resume emails
roost queue list

Workers will not fetch jobs from a paused queue. Jobs already in flight finish normally.

Cancellation

roost cancel 42

If the job is available or retryable, it’s moved straight to cancelled. If it’s currently executing, Roost flips cancel_requested = true and fires NOTIFY roost_cancel_requested. Every running worker listens on that channel and cancels the matching asyncio.Task. The handler sees an asyncio.CancelledError and the row finalizes to cancelled.

Handlers that need to clean up gracefully should catch CancelledError, do their cleanup, and re-raise.

Per-job timeouts

await roost.enqueue(send_invoice, args={"id": 7}, timeout_seconds=30)

A timeout uses asyncio.wait_for around the handler. The job goes to retryable (or discarded if attempts exhausted) with the TimeoutError recorded in the errors array.

Worker heartbeats and orphan recovery

Every worker writes its identity (host, pid, queues, in-flight count) into roost.workers on the heartbeat interval (default 15s). Stale rows are GC’d by peer workers.

If a worker is SIGKILL’d while a job is executing, the row stays executing until the orphan reaper running in another worker picks it up. Default staleness window is 5 minutes (--orphan-stale-after). On reap:

  • attempt < max_attempts → back to retryable, scheduled to run immediately.

  • attempt = max_attemptsdiscarded, with WorkerCrash recorded in errors.

Mass requeue

Resurrect everything in the dead-letter pile:

roost requeue --discarded

This zeroes the attempt counter and clears cancel_requested. Use sparingly.

Workers list

roost workers

Shows current heartbeats — host, pid, queues, last seen.