Signed-off-by: Fabian Reinartz <freinartz@google.com>
6 KiB
WAL Disk Format
The write ahead log operates in segments that are numbered and sequential,
e.g. 000000
, 000001
, 000002
, etc., and are limited to 128MB by default.
A segment is written to in pages of 32KB. Only the last page of the most recent segment
may be partial. A WAL record is an opaque byte slice that gets split up into sub-records
should it exceed the remaining space of the current page. Records are never split across
segment boundaries. If a single record exceeds the default segment size, a segment with
a larger size will be created.
The encoding of pages is largely borrowed from LevelDB's/RocksDB's write ahead log.
Notable deviations are that the record fragment is encoded as:
┌───────────┬──────────┬────────────┬──────────────┐
│ type <1b> │ len <2b> │ CRC32 <4b> │ data <bytes> │
└───────────┴──────────┴────────────┴──────────────┘
The type flag has the following states:
0
: rest of page will be empty1
: a full record encoded in a single fragment2
: first fragment of a record3
: middle fragment of a record4
: final fragment of a record
Record encoding
The records written to the write ahead log are encoded as follows:
Series records
Series records encode the labels that identifies a series and its unique ID.
┌────────────────────────────────────────────┐
│ type = 1 <1b> │
├────────────────────────────────────────────┤
│ ┌─────────┬──────────────────────────────┐ │
│ │ id <8b> │ n = len(labels) <uvarint> │ │
│ ├─────────┴────────────┬─────────────────┤ │
│ │ len(str_1) <uvarint> │ str_1 <bytes> │ │
│ ├──────────────────────┴─────────────────┤ │
│ │ ... │ │
│ ├───────────────────────┬────────────────┤ │
│ │ len(str_2n) <uvarint> │ str_2n <bytes> │ │
│ └───────────────────────┴────────────────┘ │
│ . . . │
└────────────────────────────────────────────┘
Sample records
Sample records encode samples as a list of triples (series_id, timestamp, value)
.
Series reference and timestamp are encoded as deltas w.r.t the first sample.
┌──────────────────────────────────────────────────────────────────┐
│ type = 2 <1b> │
├──────────────────────────────────────────────────────────────────┤
│ ┌────────────────────┬───────────────────────────┬─────────────┐ │
│ │ id <8b> │ timestamp <8b> │ value <8b> │ │
│ └────────────────────┴───────────────────────────┴─────────────┘ │
│ ┌────────────────────┬───────────────────────────┬─────────────┐ │
│ │ id_delta <uvarint> │ timestamp_delta <uvarint> │ value <8b> │ │
│ └────────────────────┴───────────────────────────┴─────────────┘ │
│ . . . │
└──────────────────────────────────────────────────────────────────┘
Tombstone records
Tombstone records encode tombstones as a list of triples (series_id, min_time, max_time)
and specify an interval for which samples of a series got deleted.
┌─────────────────────────────────────────────────────┐
│ type = 3 <1b> │
├─────────────────────────────────────────────────────┤
│ ┌─────────┬───────────────────┬───────────────────┐ │
│ │ id <8b> │ min_time <varint> │ max_time <varint> │ │
│ └─────────┴───────────────────┴───────────────────┘ │
│ . . . │
└─────────────────────────────────────────────────────┘