Memory Management
- Memory Management
Overview
This document provides the technical details of memory management in wasm-dbms, also known as Layer 0 (the Memory Layer). Understanding this layer is useful for:
- Performance optimization
- Debugging memory issues
- Contributing to wasm-dbms
- Understanding storage costs
How Internet Computer Memory Works
On the Internet Computer, canisters have access to stable memory that persists across upgrades. Key characteristics:
- Page-based: Memory is divided into 64 KiB (65,536 bytes) pages
- Growable: Canisters start small and can allocate additional pages
- Persistent: Survives canister upgrades
- Limited: Subject to subnet memory limits
wasm-dbms uses stable memory directly (not the heap) to ensure data persistence.
Memory Model
┌──────────────────────────────────────────────────┐
│ Page 0: Schema Registry (65 KiB) │
│ - Table name hashes → table registry pages │
├──────────────────────────────────────────────────┤
│ Page 1: ACL Table (65 KiB) │
│ - List of allowed principals │
├──────────────────────────────────────────────────┤
│ Page 2: Unclaimed Pages Ledger (65 KiB) │
│ - Stack of pages released by destructive ops │
├──────────────────────────────────────────────────┤
│ Page 3: Table "users" Schema Snapshot │
├──────────────────────────────────────────────────┤
│ Page 4: Table "users" Page Ledger │
├──────────────────────────────────────────────────┤
│ Page 5: Table "users" Free Segments Ledger │
├──────────────────────────────────────────────────┤
│ Page 6: Table "users" Index Ledger │
├──────────────────────────────────────────────────┤
│ Page 7: Table "users" Autoincrement Ledger (*) │
├──────────────────────────────────────────────────┤
│ Page 8: Table "posts" Schema Snapshot │
├──────────────────────────────────────────────────┤
│ Page 9: Table "posts" Page Ledger │
├──────────────────────────────────────────────────┤
│ Page 10: Table "posts" Free Segments Ledger │
├──────────────────────────────────────────────────┤
│ Page 11: Table "posts" Index Ledger │
├──────────────────────────────────────────────────┤
│ Page 12: Table "users" Records - Page 1 │
├──────────────────────────────────────────────────┤
│ Page 13: Table "users" Records - Page 2 │
├──────────────────────────────────────────────────┤
│ Page 14: B-Tree Node (index on users.id) │
├──────────────────────────────────────────────────┤
│ Page 15: B-Tree Node (index on users.email) │
├──────────────────────────────────────────────────┤
│ Page 16: Table "posts" Records - Page 1 │
├──────────────────────────────────────────────────┤
│ ... │
└──────────────────────────────────────────────────┘
(*) Only allocated for tables with #[autoincrement] columns
Layout characteristics:
- Reserved pages (0-2) are allocated at initialization
- Each table gets a Schema Snapshot, Page Ledger, Free Segments Ledger, and Index Ledger
- Tables with
#[autoincrement]columns also get an Autoincrement Ledger page - Record pages and B-tree node pages are allocated on demand
- Pages can be interleaved between tables
- Pages released by destructive ops (e.g.
DropTable) are returned to the Unclaimed Pages Ledger and reused by subsequentclaim_pagecalls before the high-water mark is bumped
Memory Provider
The MemoryProvider trait abstracts memory access:
#![allow(unused)]
fn main() {
pub trait MemoryProvider {
/// Size of a memory page in bytes (64 KiB for IC)
const PAGE_SIZE: u64;
/// Current memory size in bytes
fn size(&self) -> u64;
/// Number of allocated pages
fn pages(&self) -> u64;
/// Grow memory by new_pages
/// Returns previous size on success
fn grow(&mut self, new_pages: u64) -> MemoryResult<u64>;
/// Read bytes from memory at offset
fn read(&mut self, offset: u64, buf: &mut [u8]) -> MemoryResult<()>;
/// Write bytes to memory at offset
fn write(&mut self, offset: u64, buf: &[u8]) -> MemoryResult<()>;
}
}
Implementations:
| Implementation | Use Case |
|---|---|
IcMemoryProvider | IC production (uses ic_cdk::stable::*) |
WasiMemoryProvider | WASI production (file-backed, single flat file) |
HeapMemoryProvider | Testing (uses Vec<u8>) |
#![allow(unused)]
fn main() {
// Production: Uses IC stable memory APIs
pub struct IcMemoryProvider;
#[cfg(target_family = "wasm")]
impl MemoryProvider for IcMemoryProvider {
const PAGE_SIZE: u64 = ic_cdk::stable::WASM_PAGE_SIZE_IN_BYTES;
fn grow(&mut self, new_pages: u64) -> MemoryResult<u64> {
ic_cdk::stable::stable_grow(new_pages)
.map_err(MemoryError::ProviderError)
}
fn read(&mut self, offset: u64, buf: &mut [u8]) -> MemoryResult<()> {
ic_cdk::stable::stable_read(offset, buf);
Ok(())
}
fn write(&mut self, offset: u64, buf: &[u8]) -> MemoryResult<()> {
ic_cdk::stable::stable_write(offset, buf);
Ok(())
}
}
// Testing: Uses heap memory
pub struct HeapMemoryProvider {
memory: Vec<u8>,
}
}
Memory Manager and MemoryAccess
The MemoryManager builds on MemoryProvider to handle page allocation. Its page-level
read/write operations are exposed through the MemoryAccess trait, which allows the DBMS layer
to substitute a journaled writer for atomic transactions (see Atomicity).
#![allow(unused)]
fn main() {
/// Abstracts page-level read/write operations.
///
/// `MemoryManager` implements this trait directly. The DBMS layer provides
/// `JournaledWriter`, which wraps a `MemoryManager` and records original
/// bytes before each write for rollback support.
pub trait MemoryAccess {
fn page_size(&self) -> u64;
/// Hand out a page — pops from the unclaimed-pages ledger if one is
/// available, otherwise grows the underlying provider.
fn claim_page(&mut self) -> MemoryResult<Page>;
/// Return a page to the unclaimed-pages ledger after zeroing it.
fn unclaim_page(&mut self, page: Page) -> MemoryResult<()>;
/// Grow the provider by exactly one zero-initialized page (primitive).
fn grow_one_page(&mut self) -> MemoryResult<Page>;
/// Zero an entire allocated page (primitive).
fn zero_page(&mut self, page: Page) -> MemoryResult<()>;
fn read_at<D: Encode>(&mut self, page: Page, offset: PageOffset) -> MemoryResult<D>;
fn write_at<E: Encode>(&mut self, page: Page, offset: PageOffset, data: &E) -> MemoryResult<()>;
fn zero<E: Encode>(&mut self, page: Page, offset: PageOffset, data: &E) -> MemoryResult<()>;
fn read_at_raw(&mut self, page: Page, offset: PageOffset, buf: &mut [u8]) -> MemoryResult<usize>;
}
}
claim_page and unclaim_page ship as default trait methods built on
top of the four primitives — every implementor automatically inherits the
unclaimed-pages-aware allocation strategy. JournaledWriter overrides
only grow_one_page (intentionally not journaled, since extending
the high-water mark cannot be replayed in reverse) and zero_page
(records the full pre-zero page contents so a rollback restores them).
#![allow(unused)]
fn main() {
pub struct MemoryManager<P: MemoryProvider> {
provider: P,
}
// Global instance (thread-local for IC)
// All state is consolidated in a single DbmsContext:
thread_local! {
pub static DBMS_CONTEXT: DbmsContext<IcMemoryProvider> =
DbmsContext::new(IcMemoryProvider::default());
}
impl<P: MemoryProvider> MemoryManager<P> {
/// Initialize and allocate reserved pages
fn init(provider: P) -> Self;
/// ACL page number (always 1)
pub const fn acl_page(&self) -> Page;
/// Schema registry page (always 0)
pub const fn schema_page(&self) -> Page;
}
// MemoryAccess is implemented for MemoryManager<P>,
// delegating directly to the underlying MemoryProvider.
impl<P: MemoryProvider> MemoryAccess for MemoryManager<P> { /* ... */ }
}
All table-registry and ledger functions are generic over impl MemoryAccess rather than
taking &[mut] MemoryManager directly. This makes it possible to intercept writes at the
DBMS layer without modifying any memory-crate code.
Unclaimed Pages Ledger
The unclaimed-pages ledger lives on reserved page 2 (UNCLAIMED_PAGES_PAGE).
It is a LIFO stack of [Page] numbers that destructive operations have
released. claim_page consults this stack before bumping the high-water
mark; unclaim_page zeroes the page and pushes it onto the stack.
Serialization format:
Offset Size Field
0 4 Number of unclaimed pages (u32, little-endian)
4+ var Sequence of page numbers (u32, little-endian) — newest last
The single reserved page can hold up to UNCLAIMED_PAGES_CAPACITY = 16382
entries (the encoded ledger size must fit in MSize = u16). Pushing
beyond capacity returns MemoryError::UnclaimedPagesFull. A future
extension may chain additional pages once a real workload exhausts the
single-page budget; for now the v1 limit is enough to absorb typical
drop-and-recreate cycles.
Behavior:
claim_pagepops the most recently unclaimed page (LIFO ordering keeps hot pages cache-warm). When the ledger is empty it grows the provider by exactly one zero-initialized page.unclaim_pagezeroes the page in full before pushing — released pages never leak residual record bytes. The zero is journaled when invoked throughJournaledWriter, so a rolled-back transaction restores the page contents and the ledger update.- The default
MemoryAccessimpls ofclaim_pageandunclaim_pagedrive the ledger entirely through the trait’s read/write methods, so every interceptor (journal, future overlays) automatically participates. MigrationOp::DropTablewalks every page owned by the dropped table — record pages, page-ledger / free-segments / index-ledger pages, every B-tree node, schema-snapshot and (optional) autoincrement pages — and hands each one tounclaim_pagebefore clearing the table from the schema registry.
Rollback semantics:
Inside an atomic block:
- A successful
unclaim_pagewhose surrounding transaction rolls back is fully reversed: the page contents reappear and the ledger does not contain the page. - A
claim_pagethat hits the ledger pops a page; on rollback the ledger is restored and the page is “back” in the unclaimed pool. Any data the caller wrote into that page is also reverted by the journal. - A
claim_pagethat grows the provider returns a page whose existence is not journaled; on rollback the page stays grown but unreferenced (it leaks until the next process lifetime). This matches the previous behavior ofallocate_pageand is unchanged by this design.
Encode Trait
All data stored in memory implements the Encode trait:
#![allow(unused)]
fn main() {
pub trait Encode {
/// Size characteristic: Fixed or Dynamic
const SIZE: DataSize;
/// Memory alignment in bytes
/// - For Fixed: must equal size
/// - For Dynamic: minimum 8, default 32
const ALIGNMENT: PageOffset;
/// Encode to bytes
fn encode(&'_ self) -> Cow<'_, [u8]>;
/// Decode from bytes
fn decode(data: Cow<[u8]>) -> MemoryResult<Self>
where
Self: Sized;
/// Size of encoded data
fn size(&self) -> MSize;
}
pub enum DataSize {
/// Fixed size in bytes (e.g., integers)
Fixed(MSize),
/// Variable size (e.g., strings, blobs)
Dynamic,
}
}
Examples:
| Type | SIZE | ALIGNMENT |
|---|---|---|
Uint32 | Fixed(4) | 4 |
Int64 | Fixed(8) | 8 |
Text | Dynamic | 32 (default) |
Blob | Dynamic | 32 (default) |
| User-defined record | Dynamic | Configurable (default 32) |
Schema Registry
The Schema Registry maps tables to their storage pages:
#![allow(unused)]
fn main() {
/// Information about a table's storage pages
pub struct TableRegistryPage {
pub schema_snapshot_page: Page, // Schema Snapshot Ledger location
pub pages_list_page: Page, // Page Ledger location
pub free_segments_page: Page, // Free Segments Ledger location
pub index_registry_page: Page, // Index Ledger location
pub autoincrement_registry_page: Option<Page>, // Autoincrement Ledger (if needed)
}
/// Maps table fingerprints to storage locations
pub struct SchemaRegistry {
tables: HashMap<TableFingerprint, TableRegistryPage>,
}
}
The autoincrement_registry_page is only allocated when a table has at least one column
with the #[autoincrement] attribute. For tables without autoincrement columns, this
field is None, avoiding unnecessary page allocation.
Table Fingerprint:
- Hash of
TableSchema::table_name()— stable across rebuilds and schema evolution - Used as the key into the schema registry’s
HashMap - Enables multiple tables in one canister
Name collision detection:
SchemaRegistry::register_table is collision-aware. When the fingerprint slot is already
occupied, the registry loads the persisted Schema Snapshot from
that table’s schema_snapshot_page and compares its name against the candidate’s
TableSchema::table_name():
- Same name: the entry is the same logical table — return the existing pages, no allocation
- Different name: two distinct names hashed to the same value — return
MemoryError::NameCollision { candidate, existing }without allocating any page
ACL Storage
The Access Control List is stored in Page 1. Access control is abstracted
behind the AccessControl trait, which allows different runtimes to use
different identity types (e.g., Principal on IC, Vec<u8> for generic use).
#![allow(unused)]
fn main() {
pub trait AccessControl: Default {
type Id;
fn load<M>(mm: &MemoryManager<M>) -> MemoryResult<Self>
where
M: MemoryProvider,
Self: Sized;
fn is_allowed(&self, identity: &Self::Id) -> bool;
fn allowed_identities(&self) -> Vec<Self::Id>;
fn add_identity<M>(&mut self, identity: Self::Id, mm: &mut MemoryManager<M>) -> MemoryResult<()>
where
M: MemoryProvider;
fn remove_identity<M>(&mut self, identity: &Self::Id, mm: &mut MemoryManager<M>) -> MemoryResult<()>
where
M: MemoryProvider;
}
}
The default implementation AccessControlList uses Vec<u8> as its identity
type. NoAccessControl is a no-op implementation (with type Id = ()) for
runtimes that don’t require ACL. The IC layer provides IcAccessControlList
which wraps AccessControlList and uses Principal as its identity type.
Table Registry
Each table has a TableRegistry managing its records, plus an optional
AutoincrementLedger for tables with autoincrement columns:
#![allow(unused)]
fn main() {
pub struct TableRegistry {
schema_snapshot_ledger: SchemaSnapshotLedger,
page_ledger: PageLedger,
free_segments_ledger: FreeSegmentsLedger,
index_ledger: IndexLedger,
auto_increment_ledger: Option<AutoincrementLedger>,
}
}
Schema Snapshot Ledger
The SchemaSnapshotLedger persists a single TableSchemaSnapshot on the table’s
schema_snapshot_page. The snapshot is the frozen, comparable view of the table’s
compile-time schema — name, primary key, alignment, columns, and indexes — used for
drift detection and migration planning.
#![allow(unused)]
fn main() {
pub struct SchemaSnapshotLedger {
snapshot: TableSchemaSnapshot, // cached copy of the on-disk snapshot
}
impl SchemaSnapshotLedger {
pub fn init<Schema: TableSchema>(page: Page, mm: &mut impl MemoryAccess) -> MemoryResult<()>;
pub fn load(page: Page, mm: &mut impl MemoryAccess) -> MemoryResult<Self>;
pub fn write(&mut self, page: Page, snapshot: TableSchemaSnapshot, mm: &mut impl MemoryAccess) -> MemoryResult<()>;
pub fn get(&self) -> &TableSchemaSnapshot;
}
}
Behavior:
initis called exactly once per table bySchemaRegistry::register_table, capturing the snapshot fromTableSchema::schema_snapshot()and writing it to the dedicated pageloaddecodes the persisted snapshot and caches it in memorywritereplaces the persisted snapshot (used after a successful migration) and updates the cache; on write error the cache is left untouchedgetreturns the cached snapshot — no I/O on the hot path
Serialization format (TableSchemaSnapshot):
Offset Size Field
0 1 Snapshot format version (u8, current = 0x01)
1 1 Table name length (u8)
2+ N1 UTF-8 table name
+ 1 Primary key column name length (u8)
+ N2 UTF-8 primary key column name
+ 4 Record alignment (u32, little-endian)
+ 2 Column count (u16, little-endian)
+ var For each column:
- 2 bytes: encoded column size (u16, little-endian)
- var bytes: encoded `ColumnSnapshot`
+ 2 Index count (u16, little-endian)
+ var For each index:
- 2 bytes: encoded index size (u16, little-endian)
- var bytes: encoded `IndexSnapshot`
ColumnSnapshot encodes name, data-type tag (with optional payload for Custom),
nullable / auto-increment / unique / primary-key flags, optional foreign key, and
optional default value. IndexSnapshot encodes the covered column names plus the
unique flag.
Stability rules:
DataTypeSnapshotdiscriminants are frozen — never reordered, never reused- New fields append at the tail and bump the container version; old readers stop at the previous length prefix
- Removed fields leave their slot reserved; later fields do not shift
See Schema Reference for the full per-field layout.
Page Ledger
Tracks which pages contain records for this table:
#![allow(unused)]
fn main() {
pub struct PageLedger {
ledger_page: Page, // Where this ledger is stored
pages: PageTable, // List of data pages with free space info
}
impl PageLedger {
/// Load from memory
pub fn load(page: Page) -> MemoryResult<Self>;
/// Get page for writing a record
/// Returns existing page with space or allocates new
pub fn get_page_for_record<R: Encode>(&mut self, record: &R) -> MemoryResult<Page>;
/// Commit allocation (update free space tracking)
pub fn commit<R: Encode>(&mut self, page: Page, record: &R) -> MemoryResult<()>;
}
}
Free Segments Ledger
Tracks free space from deleted/moved records:
#![allow(unused)]
fn main() {
pub struct FreeSegmentsLedger {
free_segments_page: Page,
tables: PagesTable, // Pages containing FreeSegmentsTables
}
pub struct FreeSegment {
pub page: Page,
pub offset: PageOffset,
pub size: MSize,
}
impl FreeSegmentsLedger {
/// Insert a free segment (when record is deleted)
pub fn insert_free_segment<E: Encode>(
&mut self,
page: Page,
offset: PageOffset,
record: &E,
) -> MemoryResult<()>;
/// Find reusable space for a record
pub fn find_reusable_segment<E: Encode>(
&self,
record: &E,
) -> MemoryResult<Option<FreeSegmentTicket>>;
/// Commit reused space
pub fn commit_reused_space<E: Encode>(
&mut self,
record: &E,
segment: FreeSegmentTicket,
) -> MemoryResult<()>;
}
}
Space reuse logic:
- When a record is deleted, its space is added to free segments
- When inserting, check for suitable free segment first
- If found, reuse the space; remaining space becomes new free segment
- Adjacent free segments are merged to reduce fragmentation
Autoincrement Ledger
Tables with #[autoincrement] columns have a dedicated page storing the current counter
value for each autoincrement column. The AutoincrementLedger manages these counters:
#![allow(unused)]
fn main() {
pub struct AutoincrementLedger {
page: Page,
registry: AutoincrementRegistry, // column name → current Value
}
}
Serialization format:
Offset Size Field
0 1 Number of entries (u8)
1+ var For each entry:
- 1 byte: column name length (u8)
- N bytes: UTF-8 column name
- var bytes: encoded Value (type-tagged)
Behavior:
- Initialized with zero values matching each column’s integer type when a table is registered
next()increments the counter by one and persists the updated value to memory- Uses
checked_add— returnsMemoryError::AutoincrementOverflowwhen a column reaches its type’s maximum value, preventing duplicate key generation - Each column’s counter is independent; advancing one does not affect others
- State survives across
load()/save()cycles (persisted to the dedicated page)
Supported types:
| Type | Range |
|---|---|
Int8 | -128 to 127 |
Int16 | -32,768 to 32,767 |
Int32 | -2,147,483,648 to 2,147,483,647 |
Int64 | -9.2 × 10¹⁸ to 9.2 × 10¹⁸ |
Uint8 | 0 to 255 |
Uint16 | 0 to 65,535 |
Uint32 | 0 to 4,294,967,295 |
Uint64 | 0 to 18.4 × 10¹⁸ |
Record Storage
Record Encoding
Records are wrapped in RawRecord with a length header:
┌─────────────────────────────────────────┐
│ 2 bytes: Data length (little-endian) │
├─────────────────────────────────────────┤
│ N bytes: Encoded data │
├─────────────────────────────────────────┤
│ Padding to alignment boundary │
└─────────────────────────────────────────┘
Dynamic size example (alignment=32, data=24 bytes):
Bytes 0-1: Data length (24)
Bytes 2-25: Data (24 bytes)
Bytes 26-31: Padding (6 bytes)
Total: 32 bytes (aligned)
Fixed size example (size=14 bytes):
Bytes 0-1: Data length (14)
Bytes 2-15: Data (14 bytes)
Total: 16 bytes (no padding for fixed)
Record Alignment
Alignment ensures efficient memory access:
#![allow(unused)]
fn main() {
impl<E: Encode> Alignment for E {
fn alignment() -> usize {
match E::SIZE {
DataSize::Fixed(size) => size as usize,
DataSize::Dynamic => E::ALIGNMENT as usize,
}
}
}
fn align_up<E: Encode>(size: usize) -> usize {
let align = E::alignment();
(size + align - 1) / align * align
}
}
Configuring alignment:
#![allow(unused)]
fn main() {
#[derive(Table, ...)]
#[table = "large_records"]
#[alignment = 64] // Custom alignment for this table
pub struct LargeRecord {
// ...
}
}
Table Reader
Reading records from a table:
#![allow(unused)]
fn main() {
impl<E: Encode> TableRegistry<E> {
pub fn read_all(&self) -> MemoryResult<Vec<E>> {
let mut records = Vec::new();
for page in self.page_ledger.pages() {
let mut offset = 0;
while offset < PAGE_SIZE {
// Read length header
let len = read_u16_le(page, offset);
if len == 0 {
// Skip empty slot
offset += E::alignment();
continue;
}
// Read and decode record
let data = read_bytes(page, offset + 2, len);
let record = E::decode(data)?;
records.push(record);
// Move to next aligned position
offset += align_up::<E>(len + 2);
}
}
Ok(records)
}
}
}
Read process:
- Read 2 bytes at offset for data length
- If length is 0, skip to next aligned position
- Read
lengthbytes of data - Decode data into record
- Move to next aligned position
- Repeat until end of page
Index Registry
Each table has an IndexLedger that maps index definitions (column sets) to B-tree root pages.
Indexes are always B+ trees where each node occupies exactly one memory page (64 KiB).
Every table automatically gets an index on its primary key. Additional indexes can be declared
with the #[index] attribute (see Schema Reference).
Index Ledger
The IndexLedger is stored in a single page per table and maps column sets to B-tree root pages:
#![allow(unused)]
fn main() {
pub struct IndexLedger {
ledger_page: Page,
tables: HashMap<Vec<String>, Page>, // column names → root page
}
}
Serialization format:
Offset Size Field
0-7 8 Number of indexes (u64)
8+ var For each index:
- 8 bytes: column count (u64)
- For each column name:
- 1 byte: name length (u8)
- N bytes: UTF-8 column name
- 4 bytes: root page (u32)
When a table is registered via SchemaRegistry::register_table(), the index ledger is
initialized by allocating one root page per index definition. The ledger supports
insert, delete, update, exact-match search, and range scan operations — all delegated
to the underlying B-tree for the appropriate column set.
B-Tree Structure
Indexes use a B+ tree where values (record pointers) are stored only in leaf nodes. Internal nodes contain separator keys that guide traversal. Each node is a single page.
#![allow(unused)]
fn main() {
struct RecordAddress {
page: Page, // 4 bytes, u32
offset: PageOffset, // 2 bytes, u16
}
}
RecordAddress is the pointer stored in leaf entries, pointing to the exact location
of the record in the table’s data pages. It is 6 bytes when serialized.
Key characteristics:
- Variable-size keys: Entries are packed as many as fit in a 64 KiB page
- Non-unique: The same key can map to multiple
RecordAddressvalues - Linked leaves: Leaf nodes form a doubly-linked list for range scans
- Node type tag: Byte 0 distinguishes internal (0x00) from leaf (0x01) nodes
Internal Node Layout
┌──────────────────────────────────────────────────────┐
│ Byte 0: Node type (0x00 = INTERNAL) │
│ Bytes 1-4: Parent page (u32, u32::MAX if root) │
│ Bytes 5-6: Entry count (u16) │
│ Bytes 7-10: Rightmost child page (u32) │
├──────────────────────────────────────────────────────┤
│ Entry 0: │
│ Bytes 0-1: Key size (u16) │
│ Bytes 2+: Key data (variable) │
│ Next 4: Child page (u32) │
├──────────────────────────────────────────────────────┤
│ Entry 1: ... │
├──────────────────────────────────────────────────────┤
│ ... │
└──────────────────────────────────────────────────────┘
Header size: 11 bytes. Entries are sorted by key. A search for key K routes to the
child page of the first entry whose key is >= K, or to rightmost_child if K
is greater than all entries.
Leaf Node Layout
┌──────────────────────────────────────────────────────┐
│ Byte 0: Node type (0x01 = LEAF) │
│ Bytes 1-4: Parent page (u32, u32::MAX if root) │
│ Bytes 5-6: Entry count (u16) │
│ Bytes 7-10: Previous leaf page (u32, u32::MAX=none) │
│ Bytes 11-14: Next leaf page (u32, u32::MAX=none) │
├──────────────────────────────────────────────────────┤
│ Entry 0: │
│ Bytes 0-1: Key size (u16) │
│ Bytes 2+: Key data (variable) │
│ Next 6: RecordAddress (4-byte page + 2-byte │
│ offset) │
├──────────────────────────────────────────────────────┤
│ Entry 1: ... │
├──────────────────────────────────────────────────────┤
│ ... │
└──────────────────────────────────────────────────────┘
Header size: 15 bytes. Entries are sorted by (key, record address). The
prev_leaf / next_leaf pointers form a doubly-linked list across all leaves,
enabling efficient forward and backward range scans.
Index Maintenance
Indexes are updated eagerly on every write operation:
- INSERT: After writing the record and obtaining its
RecordAddress, the key is inserted into every index defined on the table. - DELETE: After removing the record, the key-pointer pair is removed from every index.
- UPDATE: If indexed columns changed, those indexes are updated (delete old
key + insert new key). If the record moved (size change), all indexes are updated
with the new
RecordAddress.
When a leaf node overflows during insertion, it splits at its midpoint. The first key of the new right sibling is promoted to the parent internal node. If the parent also overflows, the split propagates upward. When the root splits, a new root is created and the tree height increases by one.
When a leaf becomes empty after deletion (and is not the root), it is unlinked from the leaf chain and its parent is updated.
Index Tree Walker
Range scans use an IndexTreeWalker that iterates through leaf entries across
linked leaf pages:
#![allow(unused)]
fn main() {
pub struct IndexTreeWalker<K: Encode + Ord> {
entries: Vec<LeafEntry<K>>, // Current leaf's entries
cursor: usize, // Position within current leaf
next_leaf: Option<Page>, // Next leaf page for continuation
end_key: Option<K>, // Optional upper bound (inclusive)
}
}
The walker starts at the first leaf entry >= start_key and advances through the
linked-leaf chain until it reaches an entry > end_key (or exhausts all leaves).
This provides efficient iteration for range queries without revisiting internal
nodes.