Crash recovery and broker-gap reconciliation¶
Two problems sit next to each other but are distinct.
| Problem | Who solves it |
|---|---|
| "I lost SDK state because my process died." | LocalTransport + strix.resume() replays the event log. |
| "Executions happened on the broker while I was gone." | strix.reconcile(broker) calls into your BrokerReadAdapter to pull missed fills and sanity-check state. |
BrokerReadAdapter is a Protocol you implement against your broker's SDK; Strix ships only the contract plus a StaticBrokerReadAdapter test fixture. There's also a "no adapter, manual loop" fallback shown at the end of the guide — same primitives, more code.
The default 3-line pattern (with BrokerReadAdapter)¶
import strix
transport = strix.LocalTransport(data_dir="./strix_data")
broker = MyBrokerReadAdapter(...) # your own implementation of strix.BrokerReadAdapter
# 1. Restore SDK state from the prior session's event log.
strix.resume(transport=transport)
# 2. Pull anything the broker filled while we were down + sanity-check state.
strix.reconcile(broker)
# 3. Close the prior session, start a fresh live window.
strix.init(transport=transport)
What this does:
- Step 1 rebuilds positions, open orders, and risk config in memory from the event log. No-op if the prior process exited cleanly with all events flushed (which it should —
LocalTransport.appendfsyncs). - Step 2 fetches executions from your broker via
BrokerReadAdapter.fetch_executions, ingests them, then sanity-checks Strix's resumed positions and open orders againstfetch_positions/fetch_open_orders. Any mismatch raisesBrokerReconciliationErrorby default (see Mismatch modes below for thewarnandtrustalternatives). Executions are deduped byexecution_id, so re-fetches are safe. - Step 3 closes the prior session (it's now caught up to reality), opens a new session for live trading, and carries forward open orders + positions automatically.
The fills attribute to the prior session — the one that placed the orders. After step 3 the new session's dashboards start fresh; the prior session's dashboards show the full lifecycle including the gap fills.
What is the BrokerReadAdapter Protocol?¶
class BrokerReadAdapter(Protocol):
def fetch_executions(self, *, since: str | None) -> strix.ExecutionBatch: ...
def fetch_open_orders(self) -> Iterable[strix.Order]: ...
def fetch_positions(self) -> Iterable[strix.Position]: ...
Implement it against your broker's REST/FIX/websocket SDK. The since cursor is opaque — Strix passes back whatever your adapter put in ExecutionBatch.next_marker on the previous call. Your adapter must set per-request timeouts on its own HTTP client; Strix does not wrap the calls in a timeout.
Order instances returned by fetch_open_orders should carry status=OrderStatus.NEW (or PARTIALLY_FILLED when filled_qty > 0). The other statuses (PENDING_NEW, PENDING_CANCEL, REJECTED, FILLED, CANCELLED) are Strix-internal or terminal and don't apply to a broker-side "currently working" view.
For tests and examples, strix.StaticBrokerReadAdapter(executions=..., open_orders=..., positions=..., next_marker=...) is a frozen-data fixture.
Stateful adapters in tests¶
Production adapters wrap a broker SDK and are stateless from your code's perspective — the broker IS the state of truth. But tests, simulators, and examples that need a broker-shaped sandbox have to persist their own view across runs. The pattern that works:
import json
from dataclasses import dataclass, field
from decimal import Decimal
from pathlib import Path
from strix import BrokerReadAdapter, Execution, ExecutionBatch, Order, OrderStatus, Position, Side
class SimulatedBroker:
"""JSON-on-disk broker simulator. Survives process restarts."""
def __init__(self, *, state_path: str) -> None:
self._path = Path(state_path)
if self._path.exists():
self._state = json.loads(self._path.read_text())
else:
self._state = {"orders": {}, "executions": [], "positions": {}, "next_seq": 1}
def save(self) -> None:
tmp = self._path.with_suffix(".tmp")
tmp.write_text(json.dumps(self._state, default=str))
tmp.replace(self._path)
# BrokerReadAdapter Protocol
def fetch_executions(self, *, since: str | None) -> ExecutionBatch: ...
def fetch_open_orders(self) -> "Iterable[Order]": ...
def fetch_positions(self) -> "Iterable[Position]": ...
Two things to keep separate from LocalTransport:
- Don't put the broker state file inside the
data_dir.LocalTransportenforces an.strix-storagemarker layout and treats unexpected files as corruption. Put the broker JSON next todata_dir, not inside it. - Save on every mutation, not at exit. If the test simulates a crash, "exit handlers" don't run. Mirror what a real broker does: durable on accept.
A complete working example lives at examples/python/broker_gap_reconcile/ — ~300 lines of broker_sim plus a CLI that drives the resume + reconcile cookbook end-to-end.
Marker auto-tracking¶
You don't have to track last_seen between runs. Strix records the next_marker your adapter returns in a ReconciliationCompleted event, and the next reconcile(broker) call without since= passes the recorded marker back to your adapter. Survives resume.
Explicit override is still supported: strix.reconcile(broker, since="my-cursor") forwards "my-cursor" for this call only without affecting the tracked marker.
If a reconcile call fails (adapter raises, mismatch raises under on_mismatch="raise"), the marker doesn't advance — the next call re-fetches from the prior marker, and the dedupe makes the overlap a no-op.
Mismatch modes¶
strix.reconcile(broker, on_mismatch=...) accepts three values:
| Mode | What happens on mismatch |
|---|---|
"raise" (default) |
Raise BrokerReconciliationError with the full lists. You inspect, fix, retry. |
"warn" |
Log per-mismatch warnings; return the ReconcileResult with mismatches populated. No state change. |
"trust" |
Adopt the broker's view: emit PositionAdjusted events to bring positions in line; auto-CANCELLED orders that the broker no longer reports as open. Other order mismatch cases (status divergence, broker-only orders) stay resolution="unresolved" and still raise at the end. |
All modes ingest executions before the mismatch check. fetch_executions runs first; any new fills are applied to positions and orders, and any ExecutionAnomaly from those fills is recorded. The mismatch check then compares the post-ingest state against fetch_positions / fetch_open_orders. So under "raise", when the call raises, Strix's positions already reflect the gap fills — only the divergence check failed. A retry under "trust" will see those fills as skipped_duplicate and only needs to resolve the remaining position/order mismatches.
PositionAdjusted is a new event that mutates the position book directly — no fill explains the change, the event itself is the audit trail. Use trust mode only when the broker is the source of truth for your positions (manual trades off-system, broker-side corrections).
check_positions=False / check_open_orders=False¶
Skip individual checks when one of the broker's endpoints is flaky or you don't care about that dimension:
strix.reconcile(broker, check_positions=False, check_open_orders=False)
# just the execution backfill
Cross-session caveat¶
Execution dedupe is intra-session. If you reconcile in session A, close A via strix.init(...), then reconcile in session B with overlapping since, the broker may return executions already ingested by A. Session B's dedupe set is empty, so those would be applied again, double-counting the position. Mitigation: pass since= accurately enough that the broker doesn't return already-ingested fills across the session boundary — or use the 3-line pattern above (which keeps the reconcile inside the prior session).
When the gap matters: 3-session pattern¶
Sometimes you want the gap-fill executions to live in their own bucket — a "what happened while I was down" session distinct from both pre-crash and live trading. This makes per-session dashboards cleaner if the gap was large.
transport = strix.LocalTransport(data_dir="./strix_data")
# 1. Restore.
strix.resume(transport=transport)
# 2. Cut a fresh "gap reconciliation" session. The prior session closes.
strix.init(transport=transport)
# 3. Pull gap fills into the gap session.
strix.reconcile(broker)
# 4. Close the gap session, open the live one.
strix.init(transport=transport)
Same primitives, more init() calls. Three sessions in the log: pre-crash, gap-reconciliation, live.
The trade-off:
- Default 3-line: simpler, gap fills attributed to the session that placed the orders.
- 3-session: more
initboundaries, clean per-session attribution if the gap is meaningful in its own right.
Default unless you have a reason.
Manual fallback (no BrokerReadAdapter)¶
If you don't want to implement a BrokerReadAdapter yet, the for-loop pattern still works:
transport = strix.LocalTransport(data_dir="./strix_data")
strix.resume(transport=transport)
for ex in my_broker.fetch_executions(since=last_seen): # your code
strix.ingest_execution(ex)
strix.init(transport=transport)
Trade-offs versus strix.reconcile(...):
- You track
last_seenyourself between runs (Strix doesn't see it). - No automatic sanity-check against
fetch_positions/fetch_open_orders— code your own (see "Sanity checks worth running on resume" below). - Execution dedupe still works as long as your
Executionobjects carryexecution_id.
Why not auto-resume?¶
You might wonder why strix.init() doesn't just resume if it finds an open session. Two reasons:
- Surprise. A user who calls
initexpecting fresh state and instead gets yesterday's resumed state has a hard-to-debug problem. "I calledinit, why are these positions here?" - Intent mismatch.
initsays "new analytics window";resumesays "continue the one I was in". Bundling both into one call hides which one you meant.
So the model is: two functions, two intents. init for new windows, resume for crash recovery. The 3-line cookbook uses both, in that order.
Handling NoActiveSessionError¶
strix.resume throws NoActiveSessionError if no open session exists in storage — for example, on the very first run against a fresh data_dir, or if a prior session closed cleanly before the crash.
import strix
from strix import NoActiveSessionError
transport = strix.LocalTransport(data_dir="./strix_data")
broker = MyBrokerReadAdapter(...)
try:
strix.resume(transport=transport)
strix.reconcile(broker)
strix.init(transport=transport)
except NoActiveSessionError:
# Nothing to resume — just start fresh.
strix.init(transport=transport)
This is the boot pattern most algos want: try to resume, fall through to fresh-init.
What resume does not do¶
- Does not contact your broker. Strix has no broker —
resumeonly replays the local event log. Usestrix.reconcile(broker)separately to pull broker-side state. - Does not change the session_id. Resume continues the same session. The next event picks up at
max(seq) + 1. - Does not auto-cancel orders that the broker dropped.
strix.reconcile(broker, on_mismatch="trust")will cancel Strix's view of any order the broker no longer reports as open, or you can callstrix.cancel(order_id=...)yourself once you've identified them.
Sanity checks worth running on resume (without reconcile)¶
If you're using the manual fallback flow (no BrokerReadAdapter), reproduce reconcile's sanity-check by hand:
strix.resume(transport=transport)
open_orders = strix.open_orders()
positions = strix.positions()
log.info("resumed with %d open orders, %d positions", len(open_orders), len(positions))
# Compare against the broker's view.
broker_positions = my_broker.get_positions()
for p in positions:
broker_qty = broker_positions.get(p.symbol, Decimal(0))
if broker_qty != p.qty:
log.warning(
"position mismatch on %s: strix=%s, broker=%s",
p.symbol, p.qty, broker_qty,
)
If broker and Strix disagree, something happened that the gap-fill loop didn't capture (a cancel, an expiry, a manual broker-side adjustment). Resolve before going live. The BrokerReadAdapter path makes this automatic — strix.reconcile(broker) raises BrokerReconciliationError with the same information by default.
On disk, after the recovery flow¶
After the default 3-line pattern, ./strix_data/sessions/ contains two session directories — the prior one (closed) and the new live one (open). Both have full event logs. The active_session pointer points at the new live one.
That's the audit trail. You can cat sessions/<prior_id>/events.jsonl to see exactly what happened up to and including the gap fills.