mirror of
https://github.com/prometheus/prometheus.git
synced 2025-01-28 06:03:04 -08:00
8fd73b1d28
* Write exemplars to the WAL and send them over remote write. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Update example for exemplars, print data in a more obvious format. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Add metrics for remote write of exemplars. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Fix incorrect slices passed to send in remote write. Signed-off-by: Callum Styan <callumstyan@gmail.com> * We need to unregister the new metrics. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Address review comments Signed-off-by: Callum Styan <callumstyan@gmail.com> * Order of exemplar append vs write exemplar to WAL needs to change. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Several fixes to prevent sending uninitialized or incorrect samples with an exemplar. Fix dropping exemplar for missing series. Add tests for queue_manager sending exemplars Signed-off-by: Martin Disibio <mdisibio@gmail.com> * Store both samples and exemplars in the same timeseries buffer to remove the alloc when building final request, keep sub-slices in separate buffers for re-use Signed-off-by: Martin Disibio <mdisibio@gmail.com> * Condense sample/exemplar delivery tests to parameterized sub-tests Signed-off-by: Martin Disibio <mdisibio@gmail.com> * Rename test methods for clarity now that they also handle exemplars Signed-off-by: Martin Disibio <mdisibio@gmail.com> * Rename counter variable. Fix instances where metrics were not updated correctly Signed-off-by: Martin Disibio <mdisibio@gmail.com> * Add exemplars to LoadWAL benchmark Signed-off-by: Callum Styan <callumstyan@gmail.com> * last exemplars timestamp metric needs to convert value to seconds with ms precision Signed-off-by: Callum Styan <callumstyan@gmail.com> * Process exemplar records in a separate go routine when loading the WAL. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Address review comments related to clarifying comments and variable names. Also refactor sample/exemplar to enqueue prompb types. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Regenerate types proto with comments, update protoc version again. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Put remote write of exemplars behind a feature flag. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Address some of Ganesh's review comments. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Move exemplar remote write feature flag to a config file field. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Address Bartek's review comments. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Don't allocate exemplar buffers in queue_manager if we're not going to send exemplars over remote write. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Add ValidateExemplar function, validate exemplars when appending to head and log them all to WAL before adding them to exemplar storage. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Address more reivew comments from Ganesh. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Add exemplar total label length check. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Address a few last review comments Signed-off-by: Callum Styan <callumstyan@gmail.com> Co-authored-by: Martin Disibio <mdisibio@gmail.com>
121 lines
9.2 KiB
Markdown
121 lines
9.2 KiB
Markdown
# WAL Disk Format
|
|
|
|
The write ahead log operates in segments that are numbered and sequential,
|
|
e.g. `000000`, `000001`, `000002`, etc., and are limited to 128MB by default.
|
|
A segment is written to in pages of 32KB. Only the last page of the most recent segment
|
|
may be partial. A WAL record is an opaque byte slice that gets split up into sub-records
|
|
should it exceed the remaining space of the current page. Records are never split across
|
|
segment boundaries. If a single record exceeds the default segment size, a segment with
|
|
a larger size will be created.
|
|
The encoding of pages is largely borrowed from [LevelDB's/RocksDB's write ahead log.](https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log-File-Format)
|
|
|
|
Notable deviations are that the record fragment is encoded as:
|
|
|
|
```
|
|
┌───────────┬──────────┬────────────┬──────────────┐
|
|
│ type <1b> │ len <2b> │ CRC32 <4b> │ data <bytes> │
|
|
└───────────┴──────────┴────────────┴──────────────┘
|
|
```
|
|
|
|
The type flag has the following states:
|
|
|
|
* `0`: rest of page will be empty
|
|
* `1`: a full record encoded in a single fragment
|
|
* `2`: first fragment of a record
|
|
* `3`: middle fragment of a record
|
|
* `4`: final fragment of a record
|
|
|
|
## Record encoding
|
|
|
|
The records written to the write ahead log are encoded as follows:
|
|
|
|
### Series records
|
|
|
|
Series records encode the labels that identifies a series and its unique ID.
|
|
|
|
```
|
|
┌────────────────────────────────────────────┐
|
|
│ type = 1 <1b> │
|
|
├────────────────────────────────────────────┤
|
|
│ ┌─────────┬──────────────────────────────┐ │
|
|
│ │ id <8b> │ n = len(labels) <uvarint> │ │
|
|
│ ├─────────┴────────────┬─────────────────┤ │
|
|
│ │ len(str_1) <uvarint> │ str_1 <bytes> │ │
|
|
│ ├──────────────────────┴─────────────────┤ │
|
|
│ │ ... │ │
|
|
│ ├───────────────────────┬────────────────┤ │
|
|
│ │ len(str_2n) <uvarint> │ str_2n <bytes> │ │
|
|
│ └───────────────────────┴────────────────┘ │
|
|
│ . . . │
|
|
└────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Sample records
|
|
|
|
Sample records encode samples as a list of triples `(series_id, timestamp, value)`.
|
|
Series reference and timestamp are encoded as deltas w.r.t the first sample.
|
|
The first row stores the starting id and the starting timestamp.
|
|
The first sample record begins at the second row.
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|
│ type = 2 <1b> │
|
|
├──────────────────────────────────────────────────────────────────┤
|
|
│ ┌────────────────────┬───────────────────────────┐ │
|
|
│ │ id <8b> │ timestamp <8b> │ │
|
|
│ └────────────────────┴───────────────────────────┘ │
|
|
│ ┌────────────────────┬───────────────────────────┬─────────────┐ │
|
|
│ │ id_delta <uvarint> │ timestamp_delta <uvarint> │ value <8b> │ │
|
|
│ └────────────────────┴───────────────────────────┴─────────────┘ │
|
|
│ . . . │
|
|
└──────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Tombstone records
|
|
|
|
Tombstone records encode tombstones as a list of triples `(series_id, min_time, max_time)`
|
|
and specify an interval for which samples of a series got deleted.
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────┐
|
|
│ type = 3 <1b> │
|
|
├─────────────────────────────────────────────────────┤
|
|
│ ┌─────────┬───────────────────┬───────────────────┐ │
|
|
│ │ id <8b> │ min_time <varint> │ max_time <varint> │ │
|
|
│ └─────────┴───────────────────┴───────────────────┘ │
|
|
│ . . . │
|
|
└─────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Exemplar records
|
|
|
|
Exemplar records encode exemplars as a list of triples `(series_id, timestamp, value)`
|
|
plus the length of the labels list, and all the labels.
|
|
The first row stores the starting id and the starting timestamp.
|
|
Series reference and timestamp are encoded as deltas w.r.t the first exemplar.
|
|
The first exemplar record begins at the second row.
|
|
|
|
See: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exemplars
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|
│ type = 5 <1b> │
|
|
├──────────────────────────────────────────────────────────────────┤
|
|
│ ┌────────────────────┬───────────────────────────┐ │
|
|
│ │ id <8b> │ timestamp <8b> │ │
|
|
│ └────────────────────┴───────────────────────────┘ │
|
|
│ ┌────────────────────┬───────────────────────────┬─────────────┐ │
|
|
│ │ id_delta <uvarint> │ timestamp_delta <uvarint> │ value <8b> │ │
|
|
│ ├────────────────────┴───────────────────────────┴─────────────┤ │
|
|
│ │ n = len(labels) <uvarint> │ │
|
|
│ ├──────────────────────┬───────────────────────────────────────┤ │
|
|
│ │ len(str_1) <uvarint> │ str_1 <bytes> │ │
|
|
│ ├──────────────────────┴───────────────────────────────────────┤ │
|
|
│ │ ... │ │
|
|
│ ├───────────────────────┬──────────────────────────────────────┤ │
|
|
│ │ len(str_2n) <uvarint> │ str_2n <bytes> │ │ │
|
|
│ └───────────────────────┴────────────────┴─────────────────────┘ │
|
|
│ . . . │
|
|
└──────────────────────────────────────────────────────────────────┘
|
|
```
|