Workflow runtime
The PostGrip workflow runtime is durable: a workflow body can run for hours or days, survive worker restarts, recover from failed activities with retries, and react to signals delivered while it was paused. This page explains how that works in the Go SDK so you can write workflows that behave correctly under all of those conditions.
The replay model
Every time the runtime service hands a workflow task to a worker, the worker:
- Fetches the full durable history of the workflow from the runtime service.
- Builds an in-memory cursor (
internal/replay.Replay) over that history. - Constructs a fresh
workflow.Contextand runs your workflow body from the top, passing it that context. - Each call from your body to
ctx.Sleep,ctx.ExecuteActivity,ctx.ExecuteChildWorkflow, orctx.GetSignalChannel().Receive()consults the replay cursor before doing anything new.
The cursor advances by one event per call (per command type). What happens at each call:
| Replay state for this command | What happens |
|---|---|
| History records this exact command, and it completed | The persisted result is decoded into your target. |
| History records this exact command, still in flight | The body returns *workflow.Suspended. |
| History is exhausted past this point | The worker enqueues a fresh command and the body returns *workflow.Suspended. |
| History records a different command at this position | The worker raises a non-retryable WorkflowDeterminismViolation failure. |
When the body returns Suspended, the worker calls BlockTask on the runtime service. The workflow task moves to the blocked state — not failed. The runtime service redelivers the task whenever a dependency resolves (an activity completes, a timer fires, a signal arrives), at which point the worker re-runs your body from the top with fuller history.
Workflow bodies must be deterministic. Because the body re-runs on every redelivery, anything that varies between runs — calling
time.Now()directly, reading wall-clock-driven random IDs, iterating a map and using order to schedule commands — will eventually produce aWorkflowDeterminismViolation. Usectx.Now()for time, deterministic IDs for any per-step identifiers, and stable iteration order (e.g. sort keys) when looping.
Activities
Activities are the right place for non-deterministic work: HTTP calls, database queries, anything that touches the outside world or wall-clock state.
func MyWorkflow(ctx workflow.Context, args []any) (any, error) {
var resp ApiResponse
err := ctx.ExecuteActivity("FetchUser", []any{userID}, &resp, &workflow.ActivityOptions{
LeaseTimeoutSeconds: 30,
Retry: &workflow.RetryPolicy{MaximumAttempts: 5},
})
if err != nil {
return nil, err
}
return resp.Name, nil
}
The runtime service handles retries based on RetryPolicy. From the workflow body’s perspective, ExecuteActivity either eventually returns the activity’s result or returns the failure that exhausted retries — it doesn’t matter how many attempts happened underneath.
If the activity returned a failure.Application with NonRetryable: true, the runtime service skips retries and surfaces the failure immediately. Use failure.NewNonRetryable(...) from inside an activity for permanent errors (validation failures, “not found”, etc.).
Timers
ctx.Sleep(d) is not time.Sleep. It enqueues a durable timer task with the runtime service:
if err := ctx.Sleep(10 * time.Minute); err != nil {
return nil, err
}
The first time your body reaches that line, the timer is enqueued and the body returns Suspended. When the timer fires, the runtime service redelivers, your body re-runs, and on the second pass ctx.Sleep sees the recorded TimerFired event and returns nil immediately so execution continues past the sleep.
The duration argument is part of the determinism check — varying it between runs raises a violation.
Child workflows
ctx.ExecuteChildWorkflow schedules a separate workflow execution and waits for its result. The semantics mirror ExecuteActivity (history-recorded, suspends, returns result on the lease that observes completion) but the child runs its own replay loop.
Signals
Signals are inputs sent into a running workflow from outside. Workflow code receives them through a channel:
ch := ctx.GetSignalChannel("ready")
args, err := ch.Receive(ctx)
if err != nil {
return nil, err
}
Receive is non-blocking in the traditional sense. On replay, the channel is pre-seeded with every signal recorded in history; Receive drains the buffer one entry per call. When the buffer empties, Receive returns Suspended. The worker blocks the task; the runtime service redelivers when a new signal arrives; the next replay re-seeds the buffer with the additional signal and the body proceeds.
This means: Receive doesn’t actually wait for new signals during a single workflow execution. It returns whatever’s already buffered, then suspends until the runtime service has more.
ContinueAsNew
Long-running workflows accumulate history. Eventually that history gets big enough to make replay slow and history-fetch round-trips expensive. The fix is ContinueAsNew: end the current run and atomically schedule a new run with a fresh history.
func LongRunner(ctx workflow.Context, args []any) (any, error) {
counter, _ := args[0].(int)
for i := 0; i < 1000; i++ {
// ... do work, schedule activities, etc.
counter++
}
if counter < 1_000_000 {
return ctx.ContinueAsNew(workflow.ContinueAsNewOptions{
Args: []any{counter},
})
}
return counter, nil
}
ContinueAsNew is modeled as an error sentinel internally so workflow code can return it directly. The worker recognizes the sentinel and tells the runtime service to schedule the new run instead of marking the current one failed. Customer code typically just writes return ctx.ContinueAsNew(...).
Cancellation
When the runtime service receives WorkflowCancellationRequested, the next replay will see that history event. ctx.Sleep, ctx.ExecuteActivity, and ctx.ExecuteChildWorkflow all check for cancellation before scheduling new commands and return *failure.Cancelled if requested. Customer code can check with failure.IsCancelled(err).
To cancel from the client side: handle.Cancel(ctx, "reason").
For activities to know about cancellation, the activity body should respect ctx.Done() on its context.Context — that’s the standard Go cancellation channel, propagated through.
What happens on worker crash
If the worker crashes mid-task, the runtime service notices via heartbeat-loss and redelivers the task to another worker. Replay does the rest: the new worker runs your body from the top, sees the same history, and continues from where the previous worker left off.
This is why workflow bodies must be idempotent under re-invocation. If your body has a side effect outside of ctx.ExecuteActivity (e.g. directly hitting a database from the workflow body), it will run again on every redelivery.
Querying state
Workflow bodies can register read-only query handlers:
ctx.SetQueryHandler("status", func(args []any) (any, error) {
return currentStatus, nil
})
Queries don’t trigger any commands — they read in-memory state during replay. The runtime service routes a query:<name> task to the worker, which replays the workflow up to the registration point and invokes the handler.
Query / update task dispatch isn’t yet wired through
Workerin this SDK version. The handlers register fine but executing them requires a workflow runtime that keeps state across queries — that’s on the roadmap. For now, query/update task types surface as unsupported inWorker.dispatch.
Updates
Updates are like signals but synchronous from the client’s perspective. The client blocks until the update handler returns; the handler may schedule commands. Like queries, registration works in this SDK but dispatch is gated behind the same future Worker capability.