2017-04-24 01:33:08 -07:00
# Chunks Disk Format
2019-03-25 03:33:38 -07:00
The following describes the format of a chunks file,
which is created in the `chunks/` directory of a block.
The maximum size per segment file is 512MiB.
2017-04-28 05:17:53 -07:00
2019-03-25 03:33:38 -07:00
Chunks in the files are referenced from the index by uint64 composed of
in-file offset (lower 4 bytes) and segment sequence number (upper 4 bytes).
2017-04-24 01:33:08 -07:00
```
2019-09-19 03:56:32 -07:00
┌──────────────────────────────┐
│ magic(0x85BD40DD) < 4 byte > │
├──────────────────────────────┤
│ version(1) < 1 byte > │
├──────────────────────────────┤
│ padding(0) < 3 byte > │
├──────────────────────────────┤
│ ┌──────────────────────────┐ │
│ │ Chunk 1 │ │
│ ├──────────────────────────┤ │
│ │ ... │ │
│ ├──────────────────────────┤ │
│ │ Chunk N │ │
│ └──────────────────────────┘ │
└──────────────────────────────┘
2019-03-25 03:33:38 -07:00
```
# Chunk
```
2024-10-09 05:32:12 -07:00
┌───────────────┬───────────────────┬─────────────┬────────────────┐
│ len < uvarint > │ encoding < 1 byte > │ data < data > │ CRC32 < 4 byte > │
└───────────────┴───────────────────┴─────────────┴────────────────┘
2017-04-24 01:33:08 -07:00
```
2021-06-30 03:45:43 -07:00
2021-10-15 11:41:23 -07:00
Notes:
* `<uvarint>` has 1 to 10 bytes.
2024-10-09 05:32:12 -07:00
* `encoding` : Currently either `XOR` , `histogram` , or `floathistogram` , see
[code for numerical values ](https://github.com/prometheus/prometheus/blob/02d0de9987ad99dee5de21853715954fadb3239f/tsdb/chunkenc/chunk.go#L28-L47 ).
2021-10-15 11:41:23 -07:00
* `data` : See below for each encoding.
Style cleanup of all the changes in sparsehistogram so far
A lot of this code was hacked together, literally during a
hackathon. This commit intends not to change the code substantially,
but just make the code obey the usual style practices.
A (possibly incomplete) list of areas:
* Generally address linter warnings.
* The `pgk` directory is deprecated as per dev-summit. No new packages should
be added to it. I moved the new `pkg/histogram` package to `model`
anticipating what's proposed in #9478.
* Make the naming of the Sparse Histogram more consistent. Including
abbreviations, there were just too many names for it: SparseHistogram,
Histogram, Histo, hist, his, shs, h. The idea is to call it "Histogram" in
general. Only add "Sparse" if it is needed to avoid confusion with
conventional Histograms (which is rare because the TSDB really has no notion
of conventional Histograms). Use abbreviations only in local scope, and then
really abbreviate (not just removing three out of seven letters like in
"Histo"). This is in the spirit of
https://github.com/golang/go/wiki/CodeReviewComments#variable-names
* Several other minor name changes.
* A lot of formatting of doc comments. For one, following
https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences
, but also layout question, anticipating how things will look like
when rendered by `godoc` (even where `godoc` doesn't render them
right now because they are for unexported types or not a doc comment
at all but just a normal code comment - consistency is queen!).
* Re-enabled `TestQueryLog` and `TestEndopints` (they pass now,
leaving them disabled was presumably an oversight).
* Bucket iterator for histogram.Histogram is now created with a
method.
* HistogramChunk.iterator now allows iterator recycling. (I think
@dieterbe only commented it out because he was confused by the
question in the comment.)
* HistogramAppender.Append panics now because we decided to treat
staleness marker differently.
Signed-off-by: beorn7 <beorn@grafana.com>
2021-10-09 06:57:07 -07:00
2021-10-15 11:41:23 -07:00
## XOR chunk data
Style cleanup of all the changes in sparsehistogram so far
A lot of this code was hacked together, literally during a
hackathon. This commit intends not to change the code substantially,
but just make the code obey the usual style practices.
A (possibly incomplete) list of areas:
* Generally address linter warnings.
* The `pgk` directory is deprecated as per dev-summit. No new packages should
be added to it. I moved the new `pkg/histogram` package to `model`
anticipating what's proposed in #9478.
* Make the naming of the Sparse Histogram more consistent. Including
abbreviations, there were just too many names for it: SparseHistogram,
Histogram, Histo, hist, his, shs, h. The idea is to call it "Histogram" in
general. Only add "Sparse" if it is needed to avoid confusion with
conventional Histograms (which is rare because the TSDB really has no notion
of conventional Histograms). Use abbreviations only in local scope, and then
really abbreviate (not just removing three out of seven letters like in
"Histo"). This is in the spirit of
https://github.com/golang/go/wiki/CodeReviewComments#variable-names
* Several other minor name changes.
* A lot of formatting of doc comments. For one, following
https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences
, but also layout question, anticipating how things will look like
when rendered by `godoc` (even where `godoc` doesn't render them
right now because they are for unexported types or not a doc comment
at all but just a normal code comment - consistency is queen!).
* Re-enabled `TestQueryLog` and `TestEndopints` (they pass now,
leaving them disabled was presumably an oversight).
* Bucket iterator for histogram.Histogram is now created with a
method.
* HistogramChunk.iterator now allows iterator recycling. (I think
@dieterbe only commented it out because he was confused by the
question in the comment.)
* HistogramAppender.Append panics now because we decided to treat
staleness marker differently.
Signed-off-by: beorn7 <beorn@grafana.com>
2021-10-09 06:57:07 -07:00
2021-10-15 11:41:23 -07:00
```
2021-10-18 08:49:28 -07:00
┌──────────────────────┬───────────────┬───────────────┬──────────────────────┬──────────────────────┬──────────────────────┬──────────────────────┬─────┬──────────────────────┬──────────────────────┬──────────────────┐
│ num_samples < uint16 > │ ts_0 < varint > │ v_0 < float64 > │ ts_1_delta < uvarint > │ v_1_xor < varbit_xor > │ ts_2_dod < varbit_ts > │ v_2_xor < varbit_xor > │ ... │ ts_n_dod < varbit_ts > │ v_n_xor < varbit_xor > │ padding < x bits > │
└──────────────────────┴───────────────┴───────────────┴──────────────────────┴──────────────────────┴──────────────────────┴──────────────────────┴─────┴──────────────────────┴──────────────────────┴──────────────────┘
2021-10-15 11:41:23 -07:00
```
### Notes:
* `ts` is the timestamp, `v` is the value.
* `...` means to repeat the previous two fields as needed, with `n` starting at 2 and going up to `num_samples` – 1.
* `<uint16>` has 2 bytes in big-endian order.
* `<varint>` and `<uvarint>` have 1 to 10 bytes each.
* `ts_1_delta` is `ts_1` – `ts_0` .
* `ts_n_dod` is the “delta of deltas” of timestamps, i.e. (`ts_n` – `ts_n-1` ) – (`ts_n-1` – `ts_n-2` ).
2021-10-18 08:49:28 -07:00
* `v_n_xor` is the result of `v_n` XOR `v_n-1` .
2021-10-15 11:41:23 -07:00
* `<varbit_xor>` is a specific variable bitwidth encoding of the result of XORing the current and the previous value. It has between 1 bit and 77 bits.
See [code for details ](https://github.com/prometheus/prometheus/blob/7309c20e7e5774e7838f183ec97c65baa4362edc/tsdb/chunkenc/xor.go#L220-L253 ).
* `<varbit_ts>` is a specific variable bitwidth encoding for the “delta of deltas” of timestamps (signed integers that are ideally small).
It has between 1 and 68 bits.
see [code for details ](https://github.com/prometheus/prometheus/blob/7309c20e7e5774e7838f183ec97c65baa4362edc/tsdb/chunkenc/xor.go#L179-L205 ).
2021-10-18 08:49:28 -07:00
* `padding` of 0 to 7 bits so that the whole chunk data is byte-aligned.
* The chunk can have as few as one sample, i.e. `ts_1` , `v_1` , etc. are optional.
2021-10-15 11:41:23 -07:00
## Histogram chunk data
```
2024-10-09 05:19:20 -07:00
┌──────────────────────┬──────────────────────────┬───────────────────────────────┬─────────────────────┬──────────────────┬──────────────────┬──────────────────────┬────────────────┬──────────────────┐
│ num_samples < uint16 > │ histogram_flags < 1 byte > │ zero_threshold < 1 or 9 bytes > │ schema < varbit_int > │ pos_spans < data > │ neg_spans < data > │ custom_values < data > │ samples < data > │ padding < x bits > │
└──────────────────────┴──────────────────────────┴───────────────────────────────┴─────────────────────┴──────────────────┴──────────────────┴──────────────────────┴────────────────┴──────────────────┘
2021-10-15 11:41:23 -07:00
```
2021-06-30 03:45:43 -07:00
2021-10-15 11:41:23 -07:00
### Positive and negative spans data:
Style cleanup of all the changes in sparsehistogram so far
A lot of this code was hacked together, literally during a
hackathon. This commit intends not to change the code substantially,
but just make the code obey the usual style practices.
A (possibly incomplete) list of areas:
* Generally address linter warnings.
* The `pgk` directory is deprecated as per dev-summit. No new packages should
be added to it. I moved the new `pkg/histogram` package to `model`
anticipating what's proposed in #9478.
* Make the naming of the Sparse Histogram more consistent. Including
abbreviations, there were just too many names for it: SparseHistogram,
Histogram, Histo, hist, his, shs, h. The idea is to call it "Histogram" in
general. Only add "Sparse" if it is needed to avoid confusion with
conventional Histograms (which is rare because the TSDB really has no notion
of conventional Histograms). Use abbreviations only in local scope, and then
really abbreviate (not just removing three out of seven letters like in
"Histo"). This is in the spirit of
https://github.com/golang/go/wiki/CodeReviewComments#variable-names
* Several other minor name changes.
* A lot of formatting of doc comments. For one, following
https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences
, but also layout question, anticipating how things will look like
when rendered by `godoc` (even where `godoc` doesn't render them
right now because they are for unexported types or not a doc comment
at all but just a normal code comment - consistency is queen!).
* Re-enabled `TestQueryLog` and `TestEndopints` (they pass now,
leaving them disabled was presumably an oversight).
* Bucket iterator for histogram.Histogram is now created with a
method.
* HistogramChunk.iterator now allows iterator recycling. (I think
@dieterbe only commented it out because he was confused by the
question in the comment.)
* HistogramAppender.Append panics now because we decided to treat
staleness marker differently.
Signed-off-by: beorn7 <beorn@grafana.com>
2021-10-09 06:57:07 -07:00
2021-06-30 03:45:43 -07:00
```
2021-10-18 08:49:28 -07:00
┌─────────────────────────┬────────────────────────┬───────────────────────┬────────────────────────┬───────────────────────┬─────┬────────────────────────┬───────────────────────┐
│ num_spans < varbit_uint > │ length_0 < varbit_uint > │ offset_0 < varbit_int > │ length_1 < varbit_uint > │ offset_1 < varbit_int > │ ... │ length_n < varbit_uint > │ offset_n < varbit_int > │
└─────────────────────────┴────────────────────────┴───────────────────────┴────────────────────────┴───────────────────────┴─────┴────────────────────────┴───────────────────────┘
2021-10-15 11:41:23 -07:00
```
2021-06-30 03:45:43 -07:00
2024-10-09 05:19:20 -07:00
### Custom values data:
The `custom_values` data is currently only used for schema -53 (custom bucket boundaries). For other schemas, it is empty (length of zero).
```
┌──────────────────────────┬──────────────────┬──────────────────┬─────┬──────────────────┐
│ num_values < varbit_uint > │ value_0 < custom > │ value_1 < custom > │ ... │ value_n < custom > │
└──────────────────────────┴─────────────────────────────────────┴─────┴──────────────────┘
```
2021-10-15 11:41:23 -07:00
### Samples data:
2021-06-30 03:45:43 -07:00
```
2021-10-18 08:49:28 -07:00
┌──────────────────────────┐
│ sample_0 < data > │
├──────────────────────────┤
│ sample_1 < data > │
├──────────────────────────┤
│ sample_2 < data > │
├──────────────────────────┤
│ ... │
├──────────────────────────┤
2024-09-20 07:47:06 -07:00
│ sample_n < data > │
2021-10-18 08:49:28 -07:00
└──────────────────────────┘
```
#### Sample 0 data:
```
┌─────────────────┬─────────────────────┬──────────────────────────┬───────────────┬───────────────────────────┬─────┬───────────────────────────┬───────────────────────────┬─────┬───────────────────────────┐
│ ts < varbit_int > │ count < varbit_uint > │ zero_count < varbit_uint > │ sum < float64 > │ pos_bucket_0 < varbit_int > │ ... │ pos_bucket_n < varbit_int > │ neg_bucket_0 < varbit_int > │ ... │ neg_bucket_n < varbit_int > │
└─────────────────┴─────────────────────┴──────────────────────────┴───────────────┴───────────────────────────┴─────┴───────────────────────────┴───────────────────────────┴─────┴───────────────────────────┘
```
#### Sample 1 data:
```
2024-09-20 07:47:06 -07:00
┌───────────────────────┬──────────────────────────┬───────────────────────────────┬──────────────────────┬─────────────────────────────────┬─────┬─────────────────────────────────┬─────────────────────────────────┬─────┬─────────────────────────────────┐
│ ts_delta < varbit_int > │ count_delta < varbit_int > │ zero_count_delta < varbit_int > │ sum_xor < varbit_xor > │ pos_bucket_0_delta < varbit_int > │ ... │ pos_bucket_n_delta < varbit_int > │ neg_bucket_0_delta < varbit_int > │ ... │ neg_bucket_n_delta < varbit_int > │
└───────────────────────┴──────────────────────────┴───────────────────────────────┴──────────────────────┴─────────────────────────────────┴─────┴─────────────────────────────────┴─────────────────────────────────┴─────┴─────────────────────────────────┘
2021-10-18 08:49:28 -07:00
```
#### Sample 2 data and following:
```
┌─────────────────────┬────────────────────────┬─────────────────────────────┬──────────────────────┬───────────────────────────────┬─────┬───────────────────────────────┬───────────────────────────────┬─────┬───────────────────────────────┐
│ ts_dod < varbit_int > │ count_dod < varbit_int > │ zero_count_dod < varbit_int > │ sum_xor < varbit_xor > │ pos_bucket_0_dod < varbit_int > │ ... │ pos_bucket_n_dod < varbit_int > │ neg_bucket_0_dod < varbit_int > │ ... │ neg_bucket_n_dod < varbit_int > │
└─────────────────────┴────────────────────────┴─────────────────────────────┴──────────────────────┴───────────────────────────────┴─────┴───────────────────────────────┴───────────────────────────────┴─────┴───────────────────────────────┘
2021-10-15 11:41:23 -07:00
```
### Notes:
2021-10-18 08:49:28 -07:00
* `histogram_flags` is a byte of which currently only the first two bits are used:
* `10` : Counter reset between the previous chunk and this one.
* `01` : No counter reset between the previous chunk and this one.
* `00` : Counter reset status unknown.
* `11` : Chunk is part of a gauge histogram, no counter resets are happening.
2021-10-15 11:41:23 -07:00
* `zero_threshold` has a specific encoding:
* If 0, it is a single zero byte.
* If a power of two between 2^-243 and 2^10, it is a single byte between 1 and 254.
* Otherwise, it is a byte with all bits set (255), followed by a float64, resulting in 9 bytes length.
2024-10-09 05:19:20 -07:00
* `schema` is a specific value defined by the exposition format. Currently
valid values are either -4 < = n < = 8 (standard exponential schemas) or -53
(custom bucket boundaries).
2021-10-15 11:41:23 -07:00
* `<varbit_int>` is a variable bitwidth encoding for signed integers, optimized for “delta of deltas” of bucket deltas. It has between 1 bit and 9 bytes.
2021-10-18 08:49:28 -07:00
See [code for details ](https://github.com/prometheus/prometheus/blob/8c1507ebaa4ca552958ffb60c2d1b21afb7150e4/tsdb/chunkenc/varbit.go#L31-L60 ).
2021-10-15 11:41:23 -07:00
* `<varbit_uint>` is a variable bitwidth encoding for unsigned integers with the same bit-bucketing as `<varbit_int>` .
2021-10-18 08:49:28 -07:00
See [code for details ](https://github.com/prometheus/prometheus/blob/8c1507ebaa4ca552958ffb60c2d1b21afb7150e4/tsdb/chunkenc/varbit.go#L136-L165 ).
* `<varbit_xor>` is a specific variable bitwidth encoding of the result of XORing the current and the previous value. It has between 1 bit and 77 bits.
See [code for details ](https://github.com/prometheus/prometheus/blob/8c1507ebaa4ca552958ffb60c2d1b21afb7150e4/tsdb/chunkenc/histogram.go#L538-L574 ).
* `padding` of 0 to 7 bits so that the whole chunk data is byte-aligned.
* Note that buckets are inherently deltas between the current bucket and the previous bucket. Only `bucket_0` is an absolute count.
* The chunk can have as few as one sample, i.e. sample 1 and following are optional.
* Similarly, there could be down to zero spans and down to zero buckets.
2024-09-20 07:47:06 -07:00
2024-10-09 05:19:20 -07:00
The `<custom>` encoding within the custom values data depends on the schema.
For schema -53 (custom bucket boundaries, currently the only use case for
custom values), the values to encode are bucket boundaries in the form of
floats. The encoding of a given float value _x_ works as follows:
1. Create an intermediate value _y_ = _x_ * 1000.
2. If 0 ≤ _y_ ≤ 33554430 _and_ if the decimal value of _y_ is integer, store
_y_ + 1 as `<varbit_uint>` .
3. Otherwise, store a 0 bit, followed by the 64 bit of the original _x_
encoded as plain `<float64>` .
Note that values stored as per (2) will always start with a 1 bit, which allow
decoders to recognize this case in contrast to values stores as per (3), which
always start with a 0 bit.
The rational behind this encoding is that most custom bucket boundaries are set
by humans as decimal numbers with not very many decimal places. In most cases,
the encoding will therefore result in a short varbit representation. The upper
bound of 33554430 is picked so that the varbit encoded value will take at most
4 bytes.
2024-09-20 07:47:06 -07:00
## Float histogram chunk data
Float histograms have the same layout as histograms apart from the encoding of samples.
### Samples data:
```
┌──────────────────────────┐
│ sample_0 < data > │
├──────────────────────────┤
│ sample_1 < data > │
├──────────────────────────┤
│ sample_2 < data > │
├──────────────────────────┤
│ ... │
├──────────────────────────┤
│ sample_n < data > │
└──────────────────────────┘
```
#### Sample 0 data:
```
┌─────────────────┬─────────────────┬──────────────────────┬───────────────┬────────────────────────┬─────┬────────────────────────┬────────────────────────┬─────┬────────────────────────┐
│ ts < varbit_int > │ count < float64 > │ zero_count < float64 > │ sum < float64 > │ pos_bucket_0 < float64 > │ ... │ pos_bucket_n < float64 > │ neg_bucket_0 < float64 > │ ... │ neg_bucket_n < float64 > │
└─────────────────┴─────────────────┴──────────────────────┴───────────────┴────────────────────────┴─────┴────────────────────────┴────────────────────────┴─────┴────────────────────────┘
```
#### Sample 1 data:
```
┌───────────────────────┬────────────────────────┬─────────────────────────────┬──────────────────────┬───────────────────────────────┬─────┬───────────────────────────────┬───────────────────────────────┬─────┬───────────────────────────────┐
│ ts_delta < varbit_int > │ count_xor < varbit_xor > │ zero_count_xor < varbit_xor > │ sum_xor < varbit_xor > │ pos_bucket_0_xor < varbit_xor > │ ... │ pos_bucket_n_xor < varbit_xor > │ neg_bucket_0_xor < varbit_xor > │ ... │ neg_bucket_n_xor < varbit_xor > │
└───────────────────────┴────────────────────────┴─────────────────────────────┴──────────────────────┴───────────────────────────────┴─────┴───────────────────────────────┴───────────────────────────────┴─────┴───────────────────────────────┘
```
#### Sample 2 data and following:
```
┌─────────────────────┬────────────────────────┬─────────────────────────────┬──────────────────────┬───────────────────────────────┬─────┬───────────────────────────────┬───────────────────────────────┬─────┬───────────────────────────────┐
│ ts_dod < varbit_int > │ count_xor < varbit_xor > │ zero_count_xor < varbit_xor > │ sum_xor < varbit_xor > │ pos_bucket_0_xor < varbit_xor > │ ... │ pos_bucket_n_xor < varbit_xor > │ neg_bucket_0_xor < varbit_xor > │ ... │ neg_bucket_n_xor < varbit_xor > │
└─────────────────────┴────────────────────────┴─────────────────────────────┴──────────────────────┴───────────────────────────────┴─────┴───────────────────────────────┴───────────────────────────────┴─────┴───────────────────────────────┘
```