Skip to content

Crash recovery and broker-gap reconciliation

Two problems sit next to each other but are distinct.

Problem Who solves it
"I lost SDK state because my process died." LocalTransport + strix.resume() replays the event log.
"Executions happened on the broker while I was gone." strix.reconcile(broker) calls into your BrokerReadAdapter to pull missed fills and sanity-check state.

BrokerReadAdapter is a Protocol you implement against your broker's SDK; Strix ships only the contract plus a StaticBrokerReadAdapter test fixture. There's also a "no adapter, manual loop" fallback shown at the end of the guide — same primitives, more code.

The default 3-line pattern (with BrokerReadAdapter)

import strix

transport = strix.LocalTransport(data_dir="./strix_data")
broker = MyBrokerReadAdapter(...)   # your own implementation of strix.BrokerReadAdapter

# 1. Restore SDK state from the prior session's event log.
strix.resume(transport=transport)

# 2. Pull anything the broker filled while we were down + sanity-check state.
strix.reconcile(broker)

# 3. Close the prior session, start a fresh live window.
strix.init(transport=transport)

What this does:

  • Step 1 rebuilds positions, open orders, and risk config in memory from the event log. No-op if the prior process exited cleanly with all events flushed (which it should — LocalTransport.append fsyncs).
  • Step 2 fetches executions from your broker via BrokerReadAdapter.fetch_executions, ingests them, then sanity-checks Strix's resumed positions and open orders against fetch_positions / fetch_open_orders. Any mismatch raises BrokerReconciliationError by default (see Mismatch modes below for the warn and trust alternatives). Executions are deduped by execution_id, so re-fetches are safe.
  • Step 3 closes the prior session (it's now caught up to reality), opens a new session for live trading, and carries forward open orders + positions automatically.

The fills attribute to the prior session — the one that placed the orders. After step 3 the new session's dashboards start fresh; the prior session's dashboards show the full lifecycle including the gap fills.

What is the BrokerReadAdapter Protocol?

class BrokerReadAdapter(Protocol):
    def fetch_executions(self, *, since: str | None) -> strix.ExecutionBatch: ...
    def fetch_open_orders(self) -> Iterable[strix.Order]: ...
    def fetch_positions(self) -> Iterable[strix.Position]: ...

Implement it against your broker's REST/FIX/websocket SDK. The since cursor is opaque — Strix passes back whatever your adapter put in ExecutionBatch.next_marker on the previous call. Your adapter must set per-request timeouts on its own HTTP client; Strix does not wrap the calls in a timeout.

Order instances returned by fetch_open_orders should carry status=OrderStatus.NEW (or PARTIALLY_FILLED when filled_qty > 0). The other statuses (PENDING_NEW, PENDING_CANCEL, REJECTED, FILLED, CANCELLED) are Strix-internal or terminal and don't apply to a broker-side "currently working" view.

For tests and examples, strix.StaticBrokerReadAdapter(executions=..., open_orders=..., positions=..., next_marker=...) is a frozen-data fixture.

Stateful adapters in tests

Production adapters wrap a broker SDK and are stateless from your code's perspective — the broker IS the state of truth. But tests, simulators, and examples that need a broker-shaped sandbox have to persist their own view across runs. The pattern that works:

import json
from dataclasses import dataclass, field
from decimal import Decimal
from pathlib import Path
from strix import BrokerReadAdapter, Execution, ExecutionBatch, Order, OrderStatus, Position, Side

class SimulatedBroker:
    """JSON-on-disk broker simulator. Survives process restarts."""

    def __init__(self, *, state_path: str) -> None:
        self._path = Path(state_path)
        if self._path.exists():
            self._state = json.loads(self._path.read_text())
        else:
            self._state = {"orders": {}, "executions": [], "positions": {}, "next_seq": 1}

    def save(self) -> None:
        tmp = self._path.with_suffix(".tmp")
        tmp.write_text(json.dumps(self._state, default=str))
        tmp.replace(self._path)

    # BrokerReadAdapter Protocol
    def fetch_executions(self, *, since: str | None) -> ExecutionBatch: ...
    def fetch_open_orders(self) -> "Iterable[Order]": ...
    def fetch_positions(self) -> "Iterable[Position]": ...

Two things to keep separate from LocalTransport:

  1. Don't put the broker state file inside the data_dir. LocalTransport enforces an .strix-storage marker layout and treats unexpected files as corruption. Put the broker JSON next to data_dir, not inside it.
  2. Save on every mutation, not at exit. If the test simulates a crash, "exit handlers" don't run. Mirror what a real broker does: durable on accept.

A complete working example lives at examples/python/broker_gap_reconcile/ — ~300 lines of broker_sim plus a CLI that drives the resume + reconcile cookbook end-to-end.

Marker auto-tracking

You don't have to track last_seen between runs. Strix records the next_marker your adapter returns in a ReconciliationCompleted event, and the next reconcile(broker) call without since= passes the recorded marker back to your adapter. Survives resume.

Explicit override is still supported: strix.reconcile(broker, since="my-cursor") forwards "my-cursor" for this call only without affecting the tracked marker.

If a reconcile call fails (adapter raises, mismatch raises under on_mismatch="raise"), the marker doesn't advance — the next call re-fetches from the prior marker, and the dedupe makes the overlap a no-op.

Mismatch modes

strix.reconcile(broker, on_mismatch=...) accepts three values:

Mode What happens on mismatch
"raise" (default) Raise BrokerReconciliationError with the full lists. You inspect, fix, retry.
"warn" Log per-mismatch warnings; return the ReconcileResult with mismatches populated. No state change.
"trust" Adopt the broker's view: emit PositionAdjusted events to bring positions in line; auto-CANCELLED orders that the broker no longer reports as open. Other order mismatch cases (status divergence, broker-only orders) stay resolution="unresolved" and still raise at the end.

All modes ingest executions before the mismatch check. fetch_executions runs first; any new fills are applied to positions and orders, and any ExecutionAnomaly from those fills is recorded. The mismatch check then compares the post-ingest state against fetch_positions / fetch_open_orders. So under "raise", when the call raises, Strix's positions already reflect the gap fills — only the divergence check failed. A retry under "trust" will see those fills as skipped_duplicate and only needs to resolve the remaining position/order mismatches.

PositionAdjusted is a new event that mutates the position book directly — no fill explains the change, the event itself is the audit trail. Use trust mode only when the broker is the source of truth for your positions (manual trades off-system, broker-side corrections).

check_positions=False / check_open_orders=False

Skip individual checks when one of the broker's endpoints is flaky or you don't care about that dimension:

strix.reconcile(broker, check_positions=False, check_open_orders=False)
# just the execution backfill

Cross-session caveat

Execution dedupe is intra-session. If you reconcile in session A, close A via strix.init(...), then reconcile in session B with overlapping since, the broker may return executions already ingested by A. Session B's dedupe set is empty, so those would be applied again, double-counting the position. Mitigation: pass since= accurately enough that the broker doesn't return already-ingested fills across the session boundary — or use the 3-line pattern above (which keeps the reconcile inside the prior session).

When the gap matters: 3-session pattern

Sometimes you want the gap-fill executions to live in their own bucket — a "what happened while I was down" session distinct from both pre-crash and live trading. This makes per-session dashboards cleaner if the gap was large.

transport = strix.LocalTransport(data_dir="./strix_data")

# 1. Restore.
strix.resume(transport=transport)

# 2. Cut a fresh "gap reconciliation" session. The prior session closes.
strix.init(transport=transport)

# 3. Pull gap fills into the gap session.
strix.reconcile(broker)

# 4. Close the gap session, open the live one.
strix.init(transport=transport)

Same primitives, more init() calls. Three sessions in the log: pre-crash, gap-reconciliation, live.

The trade-off:

  • Default 3-line: simpler, gap fills attributed to the session that placed the orders.
  • 3-session: more init boundaries, clean per-session attribution if the gap is meaningful in its own right.

Default unless you have a reason.

Manual fallback (no BrokerReadAdapter)

If you don't want to implement a BrokerReadAdapter yet, the for-loop pattern still works:

transport = strix.LocalTransport(data_dir="./strix_data")
strix.resume(transport=transport)
for ex in my_broker.fetch_executions(since=last_seen):   # your code
    strix.ingest_execution(ex)
strix.init(transport=transport)

Trade-offs versus strix.reconcile(...):

  • You track last_seen yourself between runs (Strix doesn't see it).
  • No automatic sanity-check against fetch_positions / fetch_open_orders — code your own (see "Sanity checks worth running on resume" below).
  • Execution dedupe still works as long as your Execution objects carry execution_id.

Why not auto-resume?

You might wonder why strix.init() doesn't just resume if it finds an open session. Two reasons:

  1. Surprise. A user who calls init expecting fresh state and instead gets yesterday's resumed state has a hard-to-debug problem. "I called init, why are these positions here?"
  2. Intent mismatch. init says "new analytics window"; resume says "continue the one I was in". Bundling both into one call hides which one you meant.

So the model is: two functions, two intents. init for new windows, resume for crash recovery. The 3-line cookbook uses both, in that order.

Handling NoActiveSessionError

strix.resume throws NoActiveSessionError if no open session exists in storage — for example, on the very first run against a fresh data_dir, or if a prior session closed cleanly before the crash.

import strix
from strix import NoActiveSessionError

transport = strix.LocalTransport(data_dir="./strix_data")
broker = MyBrokerReadAdapter(...)
try:
    strix.resume(transport=transport)
    strix.reconcile(broker)
    strix.init(transport=transport)
except NoActiveSessionError:
    # Nothing to resume — just start fresh.
    strix.init(transport=transport)

This is the boot pattern most algos want: try to resume, fall through to fresh-init.

What resume does not do

  • Does not contact your broker. Strix has no broker — resume only replays the local event log. Use strix.reconcile(broker) separately to pull broker-side state.
  • Does not change the session_id. Resume continues the same session. The next event picks up at max(seq) + 1.
  • Does not auto-cancel orders that the broker dropped. strix.reconcile(broker, on_mismatch="trust") will cancel Strix's view of any order the broker no longer reports as open, or you can call strix.cancel(order_id=...) yourself once you've identified them.

Sanity checks worth running on resume (without reconcile)

If you're using the manual fallback flow (no BrokerReadAdapter), reproduce reconcile's sanity-check by hand:

strix.resume(transport=transport)

open_orders = strix.open_orders()
positions = strix.positions()

log.info("resumed with %d open orders, %d positions", len(open_orders), len(positions))

# Compare against the broker's view.
broker_positions = my_broker.get_positions()
for p in positions:
    broker_qty = broker_positions.get(p.symbol, Decimal(0))
    if broker_qty != p.qty:
        log.warning(
            "position mismatch on %s: strix=%s, broker=%s",
            p.symbol, p.qty, broker_qty,
        )

If broker and Strix disagree, something happened that the gap-fill loop didn't capture (a cancel, an expiry, a manual broker-side adjustment). Resolve before going live. The BrokerReadAdapter path makes this automatic — strix.reconcile(broker) raises BrokerReconciliationError with the same information by default.

On disk, after the recovery flow

After the default 3-line pattern, ./strix_data/sessions/ contains two session directories — the prior one (closed) and the new live one (open). Both have full event logs. The active_session pointer points at the new live one.

That's the audit trail. You can cat sessions/<prior_id>/events.jsonl to see exactly what happened up to and including the gap fills.