prometheus/tsdb/docs/format/wal.md
bwplotka 2d3e8d4c4a
Some checks failed
CI / Go tests (push) Has been cancelled
CI / More Go tests (push) Has been cancelled
CI / Go tests with previous Go version (push) Has been cancelled
CI / UI tests (push) Has been cancelled
CI / Go tests on Windows (push) Has been cancelled
CI / Mixins tests (push) Has been cancelled
CI / Build Prometheus for common architectures (0) (push) Has been cancelled
CI / Build Prometheus for common architectures (1) (push) Has been cancelled
CI / Build Prometheus for common architectures (2) (push) Has been cancelled
CI / Build Prometheus for all architectures (0) (push) Has been cancelled
CI / Build Prometheus for all architectures (1) (push) Has been cancelled
CI / Build Prometheus for all architectures (10) (push) Has been cancelled
CI / Build Prometheus for all architectures (11) (push) Has been cancelled
CI / Build Prometheus for all architectures (2) (push) Has been cancelled
CI / Build Prometheus for all architectures (3) (push) Has been cancelled
CI / Build Prometheus for all architectures (4) (push) Has been cancelled
CI / Build Prometheus for all architectures (5) (push) Has been cancelled
CI / Build Prometheus for all architectures (6) (push) Has been cancelled
CI / Build Prometheus for all architectures (7) (push) Has been cancelled
CI / Build Prometheus for all architectures (8) (push) Has been cancelled
CI / Build Prometheus for all architectures (9) (push) Has been cancelled
CI / Check generated parser (push) Has been cancelled
CI / golangci-lint (push) Has been cancelled
CI / fuzzing (push) Has been cancelled
CI / codeql (push) Has been cancelled
CI / Report status of build Prometheus for all architectures (push) Has been cancelled
CI / Publish main branch artifacts (push) Has been cancelled
CI / Publish release artefacts (push) Has been cancelled
CI / Publish UI on npm Registry (push) Has been cancelled
Add WAL segment versioning; add flag (only v1 allowed).
Implementation for https://github.com/prometheus/proposals/pull/40

Signed-off-by: bwplotka <bwplotka@gmail.com>
2024-12-11 09:38:59 +00:00

25 KiB

WAL Disk Format

This document describes the official Prometheus WAL format.

The write aheacond log operates in segments that are versioned, numbered and sequential, and are limited to 128MB by default.

Segment filename

Both the sequence number and version are captured in the segment filename, e.g. 000000, 000001-v2, 000002-v4, etc. The exact format:

<uint>[-v<uint>]`

The first unsigned integer represents the sequence number of the segment, typically encoded with six digits. The second unsigned integer, after -v string represents the segment version. If the segment does not contain -v<uint>, it means a 1 version.

Segment v1

This section describes the encoding of the version 1 of the segment encoding.

A segment encodes an array of records. It does not contain any header. A segment is written to pages of 32KB. Only the last page of the most recent segment This document describes the official Prometheus WAL format.

The write ahead log operates in segments that are versioned, numbered and sequential, and are limited to 128MB by default.

Segment filename

Both the sequence number and version are captured in the segment filename, e.g. 000000, 000001-v2, 000002-v4, etc. The exact format:

<uint>[-v<uint>]`

The first unsigned integer represents the sequence number of the segment, typically encoded with six digits. The second unsigned integer, after -v string represents the segment version. If the segment does not contain -v<uint>, it means a 1 version.

Segment v1

This section describes the encoding of the version 1 of the segment encoding.

A segment encodes an array of records. It does not contain any header. A segment is written to pages of 32KB. Only the last page of the most recent segment may be partial. A WAL record is an opaque byte slice that gets split up into sub-records should it exceed the remaining space of the current page. Records are never split across segment boundaries. If a single record exceeds the default segment size, a segment with a larger size will be created.

The encoding of pages is largely borrowed from LevelDB's/RocksDB's write ahead log.

Records encoding

Each record fragment is encoded as:

┌───────────┬──────────┬────────────┬──────────────┐
│ type <1b> │ len <2b> │ CRC32 <4b> │ data <bytes> │
└───────────┴──────────┴────────────┴──────────────┘

The initial type byte is made up of three components: a 3-bit reserved field, a 1-bit zstd compression flag, a 1-bit snappy compression flag, and a 3-bit type flag.

┌─────────────────┬──────────────────┬────────────────────┬──────────────────┐
│ reserved <3bit> │ zstd_flag <1bit> │ snappy_flag <1bit> │ type_flag <3bit> │
└─────────────────┴──────────────────┴────────────────────┴──────────────────┘

The lowest 3 bits within the type flag represent the record type as follows:

  • 0: rest of page will be empty
  • 1: a full record encoded in a single fragment
  • 2: first fragment of a record
  • 3: middle fragment of a record
  • 4: final fragment of a record

After the type byte, 2-byte length and then 4-byte checksum of the following data are encoded.

All float values are represented using the IEEE 754 format.

Record types

In the following sections, all the known record types are described. New types, can be added in the future, in the same version. Removal or breaking change of an existing type require another segment version.

Series records

Series records encode the labels that identifies a series and its unique ID.

┌────────────────────────────────────────────┐
│ type = 1 <1b>                              │
├────────────────────────────────────────────┤
│ ┌─────────┬──────────────────────────────┐ │
│ │ id <8b> │ n = len(labels) <uvarint>    │ │
│ ├─────────┴────────────┬─────────────────┤ │
│ │ len(str_1) <uvarint> │ str_1 <bytes>   │ │
│ ├──────────────────────┴─────────────────┤ │
│ │  ...                                   │ │
│ ├───────────────────────┬────────────────┤ │
│ │ len(str_2n) <uvarint> │ str_2n <bytes> │ │
│ └───────────────────────┴────────────────┘ │
│                  . . .                     │
└────────────────────────────────────────────┘

Sample records

Sample records encode samples as a list of triples (series_id, timestamp, value). Series reference and timestamp are encoded as deltas w.r.t the first sample. The first row stores the starting id and the starting timestamp. The first sample record begins at the second row.

┌──────────────────────────────────────────────────────────────────┐
│ type = 2 <1b>                                                    │
├──────────────────────────────────────────────────────────────────┤
│ ┌────────────────────┬───────────────────────────┐               │
│ │ id <8b>            │ timestamp <8b>            │               │
│ └────────────────────┴───────────────────────────┘               │
│ ┌────────────────────┬───────────────────────────┬─────────────┐ │
│ │ id_delta <uvarint> │ timestamp_delta <uvarint> │ value <8b>  │ │
│ └────────────────────┴───────────────────────────┴─────────────┘ │
│                              . . .                               │
└──────────────────────────────────────────────────────────────────┘

Tombstone records

Tombstone records encode tombstones as a list of triples (series_id, min_time, max_time) and specify an interval for which samples of a series got deleted.

┌─────────────────────────────────────────────────────┐
│ type = 3 <1b>                                       │
├─────────────────────────────────────────────────────┤
│ ┌─────────┬───────────────────┬───────────────────┐ │
│ │ id <8b> │ min_time <varint> │ max_time <varint> │ │
│ └─────────┴───────────────────┴───────────────────┘ │
│                        . . .                        │
└─────────────────────────────────────────────────────┘

Exemplar records

Exemplar records encode exemplars as a list of triples (series_id, timestamp, value) plus the length of the labels list, and all the labels. The first row stores the starting id and the starting timestamp. Series reference and timestamp are encoded as deltas w.r.t the first exemplar. The first exemplar record begins at the second row.

See: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exemplars

┌──────────────────────────────────────────────────────────────────┐
│ type = 4 <1b>                                                    │
├──────────────────────────────────────────────────────────────────┤
│ ┌────────────────────┬───────────────────────────┐               │
│ │ id <8b>            │ timestamp <8b>            │               │
│ └────────────────────┴───────────────────────────┘               │
│ ┌────────────────────┬───────────────────────────┬─────────────┐ │
│ │ id_delta <uvarint> │ timestamp_delta <uvarint> │ value <8b>  │ │
│ ├────────────────────┴───────────────────────────┴─────────────┤ │
│ │  n = len(labels) <uvarint>                                   │ │
│ ├──────────────────────┬───────────────────────────────────────┤ │
│ │ len(str_1) <uvarint> │ str_1 <bytes>                         │ │
│ ├──────────────────────┴───────────────────────────────────────┤ │
│ │  ...                                                         │ │
│ ├───────────────────────┬──────────────────────────────────────┤ │
│ │ len(str_2n) <uvarint> │ str_2n <bytes> │                     │ │
│ └───────────────────────┴────────────────┴─────────────────────┘ │
│                              . . .                               │
└──────────────────────────────────────────────────────────────────┘

Metadata records

Metadata records encode the metadata updates associated with a series.

┌────────────────────────────────────────────┐
│ type = 6 <1b>                              │
├────────────────────────────────────────────┤
│ ┌────────────────────────────────────────┐ │
│ │ series_id <uvarint>                    │ │
│ ├────────────────────────────────────────┤ │
│ │ metric_type <1b>                       │ │
│ ├────────────────────────────────────────┤ │
│ │ num_fields <uvarint>                   │ │
│ ├───────────────────────┬────────────────┤ │
│ │ len(name_1) <uvarint> │ name_1 <bytes> │ │
│ ├───────────────────────┼────────────────┤ │
│ │ len(val_1) <uvarint>  │ val_1 <bytes>  │ │
│ ├───────────────────────┴────────────────┤ │
│ │                . . .                   │ │
│ ├───────────────────────┬────────────────┤ │
│ │ len(name_n) <uvarint> │ name_n <bytes> │ │
│ ├───────────────────────┼────────────────┤ │
│ │ len(val_n) <uvarint>  │ val_n <bytes>  │ │
│ └───────────────────────┴────────────────┘ │
│                  . . .                     │
└────────────────────────────────────────────┘

Histogram records

Histogram records encode the integer and float native histogram samples.

A record with the integer native histograms with the exponential bucketing:

┌───────────────────────────────────────────────────────────────────────┐
│ type = 7 <1b>                                                         │
├───────────────────────────────────────────────────────────────────────┤
│ ┌────────────────────┬───────────────────────────┐                    │
│ │ id <8b>            │ timestamp <8b>            │                    │
│ └────────────────────┴───────────────────────────┘                    │
│ ┌────────────────────┬──────────────────────────────────────────────┐ │
│ │ id_delta <uvarint> │ timestamp_delta <uvarint>                    │ │
│ ├────────────────────┴────┬─────────────────────────────────────────┤ │
│ │ counter_reset_hint <1b> │ schema <varint>                         │ │
│ ├─────────────────────────┴────┬────────────────────────────────────┤ │
│ │ zero_threshold (float) <8b>  │   zero_count <uvarint>             │ │
│ ├─────────────────┬────────────┴────────────────────────────────────┤ │
│ │ count <uvarint> │ sum (float) <8b>                                │ │
│ ├─────────────────┴─────────────────────────────────────────────────┤ │
│ │ positive_spans_num <uvarint>                                      │ │
│ ├─────────────────────────────────┬─────────────────────────────────┤ │
│ │ positive_span_offset_1 <varint> │ positive_span_len_1 <uvarint32> │ │
│ ├─────────────────────────────────┴─────────────────────────────────┤ │
│ │ . . .                                                             │ │   
│ ├───────────────────────────────────────────────────────────────────┤ │
│ │ negative_spans_num <uvarint>                                      │ │
│ ├───────────────────────────────┬───────────────────────────────────┤ │
│ │ negative_span_offset <varint> │ negative_span_len <uvarint32>     │ │
│ ├───────────────────────────────┴───────────────────────────────────┤ │
│ │ . . .                                                             │ │   
│ ├───────────────────────────────────────────────────────────────────┤ │
│ │ positive_bkts_num <uvarint>                                       │ │
│ ├─────────────────────────┬───────┬─────────────────────────────────┤ │
│ │ positive_bkt_1 <varint> │ . . . │ positive_bkt_n <varint>         │ │
│ ├─────────────────────────┴───────┴─────────────────────────────────┤ │
│ │ negative_bkts_num <uvarint>                                       │ │
│ ├─────────────────────────┬───────┬─────────────────────────────────┤ │
│ │ negative_bkt_1 <varint> │ . . . │ negative_bkt_n <varint>         │ │
│ └─────────────────────────┴───────┴─────────────────────────────────┘ │
│                              . . .                                    │
└───────────────────────────────────────────────────────────────────────┘

A records with the Float histograms:

┌───────────────────────────────────────────────────────────────────────┐
│ type = 8 <1b>                                                         │
├───────────────────────────────────────────────────────────────────────┤
│ ┌────────────────────┬───────────────────────────┐                    │
│ │ id <8b>            │ timestamp <8b>            │                    │
│ └────────────────────┴───────────────────────────┘                    │
│ ┌────────────────────┬──────────────────────────────────────────────┐ │
│ │ id_delta <uvarint> │ timestamp_delta <uvarint>                    │ │
│ ├────────────────────┴────┬─────────────────────────────────────────┤ │
│ │ counter_reset_hint <1b> │ schema <varint>                         │ │
│ ├─────────────────────────┴────┬────────────────────────────────────┤ │
│ │ zero_threshold (float) <8b>  │   zero_count (float) <8b>          │ │
│ ├────────────────────┬─────────┴────────────────────────────────────┤ │
│ │ count (float) <8b> │ sum (float) <8b>                             │ │
│ ├────────────────────┴──────────────────────────────────────────────┤ │
│ │ positive_spans_num <uvarint>                                      │ │
│ ├─────────────────────────────────┬─────────────────────────────────┤ │
│ │ positive_span_offset_1 <varint> │ positive_span_len_1 <uvarint32> │ │
│ ├─────────────────────────────────┴─────────────────────────────────┤ │
│ │ . . .                                                             │ │   
│ ├───────────────────────────────────────────────────────────────────┤ │
│ │ negative_spans_num <uvarint>                                      │ │
│ ├───────────────────────────────┬───────────────────────────────────┤ │
│ │ negative_span_offset <varint> │ negative_span_len <uvarint32>     │ │
│ ├───────────────────────────────┴───────────────────────────────────┤ │
│ │ . . .                                                             │ │   
│ ├───────────────────────────────────────────────────────────────────┤ │
│ │ positive_bkts_num <uvarint>                                       │ │
│ ├─────────────────────────────┬───────┬─────────────────────────────┤ │
│ │ positive_bkt_1 (float) <8b> │ . . . │ positive_bkt_n (float) <8b> │ │
│ ├─────────────────────────────┴───────┴─────────────────────────────┤ │
│ │ negative_bkts_num <uvarint>                                       │ │
│ ├─────────────────────────────┬───────┬─────────────────────────────┤ │
│ │ negative_bkt_1 (float) <8b> │ . . . │ negative_bkt_n (float) <8b> │ │
│ └─────────────────────────────┴───────┴─────────────────────────────┘ │
│                              . . .                                    │
└───────────────────────────────────────────────────────────────────────┘