Control Path¶

The control path covers the non-IO exchanges that keep an RMR pool coherent: admitting and removing sessions, propagating membership, reconciling dirty maps after a disruption, and admitting storage nodes back into service. Control messages share the RTRS session used for IO and are identified by the rmr_msg_cmd_type enum (RMR_CMD_*). For the IO flow itself, see Data Path.

Backend store registration ¶

An RMR server pool cannot serve IO until a backend store is registered with it. On each storage node, the brmr-server module opens the block device and calls rmr_srv_register() to attach it as the pool’s io_store. Registration is the server-side event that moves the server pool out of EMPTY, establishes the member’s dirty map, and marks the pool ready to accept session joins. rmr_srv_unregister() reverses it.

Registration carries a rmr_srv_register_disk_mode that selects one of three behaviors (RMR_SRV_DISK_CREATE, RMR_SRV_DISK_ADD, RMR_SRV_DISK_REPLACE). Unregistration carries a delete flag. These map to the four user-facing sysfs entries on brmr-server: create_store, add_store, remove_store, delete_store. See Cluster Management for the user-facing walkthrough; the per-mode semantics are described in Attach and detach modes below.

Store state on the brmr side is tracked in brmr_srv_blk_dev->state as BRMR_SRV_STORE_OPEN / BRMR_SRV_STORE_MAPPED. Only an open and mapped store passes io_allowed(); this is the check used by the recovery thread’s store probe (see Store check).

Pool session lifecycle ¶

A client session attaches to a storage node via RMR_CMD_JOIN_POOL, is admitted to service via RMR_CMD_ENABLE_POOL, re-attaches after a link disruption via RMR_CMD_REJOIN_POOL, and detaches via RMR_CMD_LEAVE_POOL. Each attach and detach is also propagated to every other non-FAILED, non-REMOVING peer via RMR_CMD_POOL_INFO (rmr_clt_send_pool_info()) so all storage nodes keep a consistent view of membership.

RMR_CMD_JOIN_POOL carries per-pool parameters (chunk_size, queue_depth), a create flag, and rmr_pool_member_info describing the peer member_ids the client is aware of. The server replies with the member_id it has assigned to this session along with pool-wide properties (protocol version, mapped_size).
RMR_CMD_REJOIN_POOL uses the same message body as join but with rejoin=true; it does not assign a new member_id because the server-side pool state is preserved across the disruption. The session participates in recovery but is not yet allowed to serve IO.
RMR_CMD_ENABLE_POOL admits the session into service and is the transition that brings an rmr_clt_pool_sess to NORMAL. Enable is also embedded in the final RMR_CMD_MAP_DONE of a map update, which is how reconnected sessions get promoted at the end of reconciliation (see Map update).
RMR_CMD_LEAVE_POOL carries a delete flag selecting whether the removal is permanent or transient (see below).
RMR_CMD_POOL_INFO carries the affected member_id, an ADD/REMOVE operation, and a mode (CREATE/ASSEMBLE/DELETE/DISASSEMBLE) matching the join/leave that triggered it.

Attach and detach modes ¶

Attach and detach each have two variants, paired across the brmr-store layer, the session-level command, and the peer propagation layer so every layer agrees on intent.

Create vs. Assemble ¶

Used when a node joins a pool, chosen to match whether the pool already exists on that node’s disk.

Create establishes the pool on the storage node for the first time.

Store (create_store, RMR_SRV_DISK_CREATE): brmr-server writes new on-disk pool metadata on the block device and rmr-server creates a fresh dirty map for the member. Rejected if sessions or a map for this member_id already exist. The server pool records marked_create=true for validation of the first joining client session.
Session (RMR_CMD_JOIN_POOL with create=true): accepted only if the server pool was registered with marked_create. The server uses rmr_pool_member_info in the message to populate stg_members and create dirty maps for the peers the client is aware of.
Propagation (RMR_CMD_POOL_INFO with ADD + CREATE): each peer calls rmr_srv_add_store_member() to create a stg_members entry and dirty map for the new member. If the dirty flag is set, the peer marks the new member’s map fully dirty so subsequent piggyback IOs build up real dirty entries the new node will need to catch up on.

Assemble is used when the pool’s state already exists on the storage node’s disk (e.g. after a compute-client crash where the storage retains data but the client view is gone).

Store (add_store, RMR_SRV_DISK_ADD): brmr-server validates the existing on-disk pool metadata; rmr-server refreshes pool_md from disk (rmr_srv_refresh_md()) and preserves the existing dirty map.
Session (RMR_CMD_JOIN_POOL with create=false): the server pool is unchanged by the handshake. The client separately reads pool_md from the server (via RMR_CMD_MD_SEND) to learn membership and rebuild its client-side maps to match. The session goes to RECONNECTING and waits for a map update before being promoted to NORMAL.
Propagation (RMR_CMD_POOL_INFO with ADD + ASSEMBLE): peers verify that the member’s stg_members entry and dirty map already exist; no new state is created. A missing entry is an error.

A third store-only mode, Replace (RMR_SRV_DISK_REPLACE), covers the case where the old disk is gone and a new empty disk is inserted. Server-side, the existing map is erased, a fresh map is created, and the RMR_STORE_IS_REPLACE bit is set on map_ver so that peers discover the replacement and coordinate discards of the dirty entries they still hold for this member (see Discard coordination). There is no corresponding session mode — the session simply assembles on top of the replaced store.

Note

Replace is currently disabled. The user-facing add_store mode=replace path in brmr-server rejects the request, and the surrounding flows that depend on it (discard coordination and the replace-triggered parts of last-IO reconciliation) are not exercised in normal operation. Most of the underlying code exists but has known edge cases and missing peer-to-peer info exchange that need further work.

Delete vs. Disassemble ¶

Used when a node is removed from a pool, chosen to match whether the removal is permanent or whether the node is expected to come back.

Delete is permanent removal (decommissioning).

Store (delete_store, rmr_srv_unregister(delete=true)): brmr-server closes the block device and wipes the pool metadata from disk. The disk must be reformatted with create_store before it can be reused in any pool.
Session (RMR_CMD_LEAVE_POOL with delete=true): the server deletes the dirty maps of all other members on this node (rmr_srv_process_leave_delete()) — this node no longer needs to track dirty data for anyone.
Propagation (RMR_CMD_POOL_INFO with REMOVE + DELETE): peers call rmr_srv_delete_store_member(), erasing the member’s stg_members entry and dirty map.

Disassemble is transient removal (maintenance, graceful shutdown with planned return).

Store (remove_store, rmr_srv_unregister(delete=false)): brmr-server closes the block device but preserves the on-disk pool metadata so the disk can be reattached later via add_store.
Session (RMR_CMD_LEAVE_POOL with delete=false): no map changes on the server side — state is preserved for a subsequent reassemble.
Propagation (RMR_CMD_POOL_INFO with REMOVE + DISASSEMBLE): peers keep the member’s stg_members entry and dirty map intact. IOs arriving while the member is away continue to accumulate dirty entries for it via the piggyback mechanism, so the state needed for resync on reassembly is built up during the detachment.

Map update ¶

When a session is brought back into service, its dirty map must be reconciled against the pool before it is allowed to serve IO. The client picks an authoritative session (a NORMAL one, or the single was_last_authoritative session surviving a full-pool failure) and orchestrates a three-command handshake per receiving session:

RMR_CMD_MAP_READY — sent to the receiving session to prepare it to accept a map.
RMR_CMD_MAP_SEND — sent to the authoritative session with the receiver’s member_id, instructing it to transfer its map in chunks (via RMR_CMD_SEND_MAP_BUF / RMR_CMD_MAP_BUF_DONE).
RMR_CMD_MAP_DONE — sent to the receiving session once the transfer is complete. It carries an enable flag that controls whether the session transitions to NORMAL at the end.

The entry point is rmr_clt_spread_map(). The enable flag lets the same handshake cover two cases: admitting a freshly reconnected session to service (enable=true) and propagating an up-to-date map to sessions that must not yet be admitted (enable=false).

If IOs are already flowing through a NORMAL session at the time of the spread, the client freezes IO for the duration of the exchange so new writes cannot race with reconciliation.

If any step fails, the receiving session is sent RMR_CMD_MAP_DISABLE to discard the partial state.

Map version ¶

Every pool carries a monotonically advancing map_ver in struct rmr_pool_md. It advances whenever the authoritative view changes and is used during recovery to pick the most up-to-date node. RMR_CMD_MAP_GET_VER and RMR_CMD_MAP_SET_VER read and write the version on a storage node. When no session is obviously authoritative — for example after a pserver crash — the client queries every session for its map_ver and picks the node with the highest value as the source for the subsequent spread. See Map Version Handling for the full design.

Note

The current u64 representation of map_ver is a temporary choice. It multiplexes ordering, a state-carrying flag (RMR_STORE_IS_REPLACE), and peer comparison into a single integer, which does not scale to future needs. A refactor to a richer representation is planned.

Discard coordination ¶

Some failure modes leave the pool with dirty entries for a member_id whose underlying data no longer exists — most notably after a disk replacement, where the replaced node returns in a cleared state and the RMR_STORE_IS_REPLACE bit is set on its map_ver. Such entries have to be discarded across the pool, and the discard must be coordinated so every surviving node processes it before the state is treated as settled.

The client issues a two-step protocol:

RMR_CMD_SEND_DISCARD to every NORMAL session in the pool, identifying the member_id whose tracked entries should be dropped.
RMR_CMD_DISCARD_CLEAR_FLAG once all sessions have acknowledged. This clears the per-member discard_entries flag in pool_md.

Splitting the exchange in two ensures a surviving node cannot clear its own discard flag before peers have processed the discard. The current trigger is inside the last-IO update: a node whose map_ver has RMR_STORE_IS_REPLACE set triggers the two-step discard, after which the map spread proceeds.

Note

Because Replace is currently disabled (see Create vs. Assemble), the RMR_STORE_IS_REPLACE bit is never set in normal operation, so this coordination path is not exercised today. The protocol and handlers are in place for when Replace is re-enabled.

Last-IO reconciliation ¶

If the pserver itself goes down while IOs are in flight, an individual IO may have completed on some storage nodes and not others. No surviving session can be trusted as authoritative on its own.

rmr_clt_start_last_io_update() handles this case. It runs from rmr_clt_pool_try_enable() when every member_id in pool_md is present and in RECONNECTING:

Query each session’s map_ver; pick the node with the highest value as the authoritative source, applying any pending discards (RMR_STORE_IS_REPLACE) first.
Spread that map across the pool so every session shares a common baseline.
Send RMR_CMD_LAST_IO_TO_MAP to every session. Each storage node turns its persisted last_io array (the IDs of the most recently processed IOs — see Terminology, and Last IO update for the full design) into dirty entries on every peer’s map, so any IO that was incomplete at the time of the crash is now marked dirty wherever it could still be missing.
Spread the resulting maps again and promote the sessions to NORMAL.

Recovery thread ¶

Lifecycle, map update, discard, and last-IO reconciliation are event-driven — each runs in response to a specific trigger (join, reconnect, replacement, crash). The recovery thread is the part of the control path that runs on its own schedule.

Each client pool has its own recovery worker (recover_dwork on recover_wq in rmr_clt_pool, entry point recover_work()) that wakes every RMR_RECOVER_INTERVAL_MS. On every tick it walks the pool’s sessions and performs three tasks: check whether dirty map entries held on the pserver can be cleared, probe failed sessions to see if IO can resume, and push the latest client-side pool metadata to the storage nodes.

Map check ¶

When a storage node has gone through an error while IOs are running, dirty entries are added to the map. The map is stored on all storage nodes and on the pserver. As chunks are synced — either through the sync thread or while servicing IOs — the dirty map entries are cleared from the storage nodes. The pserver does not take part in syncing, so it never sees the clears directly; its entries have to be cleared explicitly.

To do that, the recovery worker sends RMR_CMD_MAP_CHECK to each storage node for which the pserver still holds dirty entries. The check is only issued for sessions whose rmr_clt_pool_sess state is NORMAL and whose client-side map is non-empty.

If the storage node’s server pool is not itself in the NORMAL state, it does not inspect its map and unconditionally replies that the map is non-empty. This prevents the pserver from clearing state against a storage node not yet ready to vouch for it. A NORMAL but still-degraded storage node answers honestly — its own map is non-empty and it says so.

After a storage node reports an empty map, the pserver does not clear its entries immediately. It waits for RMR_MAP_CLEAN_DELAY_MS and requires the “empty” answer to persist across ticks before unsetting the dirty bits.

RMR_CMD_MAP_CHECK is also triggered internally in the reverse direction, from one storage node to another. Each server pool runs its own delayed worker (clean_dwork on clean_wq, scheduled every RMR_SRV_CHECK_MAPS_INTERVAL_MS, entry point rmr_srv_check_map_clear()) that walks the dirty maps the node holds for its peers — one map per member_id other than its own. For each peer whose map is non-empty, the worker sends RMR_CMD_MAP_CHECK to that peer through the attached sync client pool (via rmr_clt_pool_member_synced()). If the peer replies that its own map is empty, the storage node clears its local tracking for that peer with rmr_srv_clear_map(). The server-side gating rules on the response — NORMAL-only inspection, conservative non-empty reply otherwise — apply identically in this direction.

Store check ¶

When the RMR server gets an error from the backend while sending IOs, it propagates the error to the RMR client and the relevant session moves to FAILED. The client stops sending IOs on that session until the error is resolved.

To detect when IO can resume, the recovery worker sends RMR_CMD_STORE_CHECK to sessions that are FAILED and whose underlying rmr_clt_sess is CONNECTED. The server queries its backend via io_store->ops->io_allowed() — for brmr-server this checks that the block device is both OPEN and MAPPED — and replies.

A positive response does not put the session back in service directly. It transitions the session from FAILED to RECONNECTING and calls rmr_clt_pool_try_enable(), which drives the map update (or the last-IO update, if all members are reconnecting) that ultimately promotes the session to NORMAL. See Pool and Session States for the full state machine.

Sessions with maintenance_mode set skip the map check but remain eligible for store check and state progression, so they can be brought back online in a controlled way.

Metadata send ¶

The client pool maintains pool_md (see Terminology), which the storage nodes read back to assemble or reconcile pool state. At the end of each recovery tick, the worker refreshes the local copy and pushes it to every storage node via RMR_CMD_SEND_MD_BUF. A failed send is not retried within the tick — the next tick will try again.

RMR and BRMR Documentation

Navigation

Related Topics

Control Path¶

Backend store registration ¶

Pool session lifecycle ¶

Attach and detach modes ¶

Create vs. Assemble ¶

Delete vs. Disassemble ¶

Map update ¶

Map version ¶

Discard coordination ¶

Last-IO reconciliation ¶

Recovery thread ¶

Map check ¶

Store check ¶

Metadata send ¶