Documentation Index
Fetch the complete documentation index at: https://docs.springtail.io/llms.txt
Use this file to discover all available pages before exploring further.
Springtail Storage Layer
This document describes the storage layout and organization of data in Springtail’s storage engine, from the low-level extent structure up through tables and indexes.Overview
Springtail uses a multi-version concurrency control (MVCC) storage system with copy-on-write semantics. Data is organized hierarchically:- Extents: Fixed-size blocks containing rows of data with a fixed-width schema
- Fields: Type-specific accessors for reading/writing column values within rows
- BTrees: B+ tree indexes composed of extents, used for both primary and secondary indexes
- Tables: Logical collections of data organized by a primary BTree and zero or more secondary indexes
- Cache: Unified page cache managing access to extents across all tables
Extent Structure
An extent is the fundamental storage unit in Springtail. Each extent consists of three components stored sequentially on disk:1. Extent Header
The header (serialized viaExtentHeader::pack()) contains metadata about the extent:
LEAF(0x00): Contains actual data rowsBRANCH(0x01): Contains index keys and child extent pointers- Root flag (0x02): Indicates this extent is a tree root
2. Fixed Data
The fixed data section contains a contiguous array of fixed-width rows. Each row has exactlyrow_size bytes and stores:
- Fixed-width values: Integers, floats stored inline at specific byte offsets (at the beginning of the row)
- Variable-data offsets: 4-byte offsets into the variable data section for TEXT, BINARY, NUMERIC, and EXTENSION types
- Boolean fields: Packed as bits after all fixed-width fields
- Null bitmaps: Packed bits indicating which nullable fields are NULL (after boolean fields)
- Undefined bitmaps: Packed bits for replication tracking (indicates unchanged fields). Note: undefined bitmaps are only present in write cache extents, not in on-disk table extents.
ExtentSchema (see src/storage/schema.cc:79-162), which places fixed-width fields first, then boolean bits, null bits, and finally undefined bits (if applicable) at the end of the row.
Example row structure for on-disk extents (simplified):
3. Variable Data
Variable-length values (TEXT, BINARY, NUMERIC, EXTENSION types) are stored in a separate variable data section managed by theVariableData class. This provides:
- Length-prefixed storage: Each value stored as
[4-byte length][data] - Automatic deduplication: Hash table detects duplicate values within an extent
- Chunked allocation: Data organized in 4KB chunks to avoid vector reallocation
- Global offset addressing: Fixed data stores 4-byte offsets into this global space
Extent::_variable_hash) maps string_view → offset, ensuring identical strings share storage.
Field Abstraction
TheField class hierarchy provides type-safe access to column values within logical rows (often extent rows, but also replication log messages, constant values, etc.):
Field Types
ExtentField (include/storage/field.hh:508): Accesses data in an Extent::Row by:
- Storing byte offset within the fixed row data
- For booleans: storing bit position and bitmask
- For nullable fields: tracking null bitmap offset and bit
- For variable data: reading offset from fixed data, then dereferencing into variable section
include/storage/field.hh:325): Extends Field with set_* methods for writing values
Field Operations
Fields support:- Type-specific
get_*andset_*methods (e.g.,get_int64(),set_text()) - Comparison operations (
less_than(),equal()) used for index searches - Null and undefined checks via bitmask operations
- Copying values between different row sources
BTree Organization
Tables and indexes are organized as B+ trees composed of extents:BTree Structure (include/storage/btree.hh)
Read-Only BTree:
- Immutable snapshot at a specific XID
- Root extent identified by offset
- Branch extents contain:
[sort_key_fields...][child_extent_offset: uint64] - Leaf extents contain: actual table data or index entries
- Binary search within extents for O(log n) lookups
include/storage/mutable_btree.hh):
- Copy-on-write modifications at target XID
- In-memory page cache with configurable size
- Dirty pages flushed hierarchically (children before parents)
- Supports insert, remove, upsert operations
- Auto-splits extents when they exceed size threshold
BTree Extent Types
Branch Extents:- Schema: sort key columns +
child_offsetfield - Each row represents separator key + pointer to child extent
- Used for navigation during tree traversal
- For primary index: full table rows with all columns
- For secondary index: index columns + extent_id + row_id for lookup
BTree Iterator
TheBTree::Iterator performs in-order traversal:
- Maintains path from root to current leaf (via
Nodeobjects) ++operator: advances within extent or navigates to next leaf via parent--operator: similar backward traversal- Supports binary search via
lower_bound(),upper_bound(),find()
Table Organization
Table Structure (include/sys_tbl_mgr/table.hh)
A Table represents a logical collection of rows with:
Primary Index (_primary_index):
- BTree storing full row data, sorted by primary key
- Schema includes all table columns
- Branch extents use primary key fields for navigation
- Leaf extents contain complete rows
_secondary_indexes):
- Map from index_id → (BTree, column_positions)
- Schema: index columns + internal_row_id
- Used for queries with non-primary-key predicates
- Maintained alongside primary index
_look_aside_index):
- Maps internal_row_id → (extent_id, row_position)
- Enables secondary index lookups to find actual rows
- Schema:
[internal_row_id][extent_id][row_offset]
Table Iterator
Table::Iterator supports multiple iteration modes:
Primary iteration: Traverses BTree, loads data extents from extent_id pointers
Secondary iteration: Uses secondary index + look-aside for data access
Index-only iteration: Returns index entries without fetching full rows
Table Metadata
Tables maintain metadata in a “roots” file:Mutable Tables
MutableTable (include/sys_tbl_mgr/mutable_table.hh)
Provides write operations on tables:
Operations:
insert(): Add new row (append or positioned by primary key)update(): Modify existing row (copy-on-write)remove(): Delete row from extent and indexesupsert(): Insert or update based on primary key existence
- Empty table: Use
_empty_pagefor first extent - Append: Add to last extent if sorted correctly
- Lookup: Use primary index to find target extent
- Direct: Write to specified extent_id (from write cache)
- Dirty pages trigger
_flush_handler()callback - Invalidates old index entries via
_invalidate_indexes() - Populates new entries via
_flush_and_populate_indexes() - Maintains consistency between primary and secondary indexes
Finalization
MutableTable::finalize():
- Flushes all dirty pages (primary data + all index BTrees)
- Returns
TableMetadatawith new root extent offsets - Optionally calls
sync_data_and_indexes()for durability
Storage Cache
Cache Architecture (include/storage/cache.hh)
The StorageCache provides unified page management:
Page States:
CLEAN: Read from disk, unmodified, can be evictedDIRTY: Modified by MutableTable, needs flushMUTABLE: Actively being modifiedFLUSHING: Asynchronous write in progressINVALID: Evicted or replaced
- Wraps one or more
Extentobjects - Tracks extent_id, cache_id, file path
- Reference counting for multi-user access
- LRU eviction for clean pages
get(): Retrieve page by (database_id, file, extent_id, xid)get_or_create(): Get existing or create new mutable page- Copy-on-write: CLEAN pages copied to MUTABLE on first write
- Hierarchical flushing: Dirty pages written with updated child pointers
SafePagePtr
RAII wrapper managing page lifecycle:- Acquires page from cache on construction
- Releases back to cache on destruction
- Supports read (shared) and write (exclusive) locks
- Transparent access to underlying
Pageobject
MVCC and Versioning
Copy-on-Write Semantics
All modifications create new extent versions:- Read extent at access_xid
- Copy to new MUTABLE extent
- Apply modifications
- Flush with target_xid in header
- Update parent extent with new child pointer
Extent Chains
Extents form version chains viaprev_offset:
Snapshot Isolation
Tableobjects snapshot at specific XID- BTree navigation uses extent headers to find correct version
- Time-travel queries access historical extent versions
- Write transactions create new snapshot at target_xid
Schema Evolution
Schemas are versioned alongside data: ExtentSchema (include/storage/schema.hh):
- Defines column order, types, nullability
- Calculates fixed row layout and field offsets
- Generates Field accessors for reading/writing
- New columns: Added with default values or NULL
- Dropped columns: Marked as non-existent in schema
- Type changes: Handled via schema conversion during page reads
- Schema metadata tracked per XID range
File Organization
Tables stored in directories:Summary
Springtail’s storage layer provides:- Extent-level: Compact fixed+variable row storage with deduplication
- Field-level: Type-safe column access with null/undefined tracking
- Index-level: B+ tree organization with copy-on-write modifications
- Table-level: Primary + secondary indexes with look-aside support
- Cache-level: Unified page management with MVCC snapshots