Add basic initial developer docs for TSDB (#9451)

* Add basic initial developer docs for TSDB There's a decent amount of content already out there (blog posts, conference talks, etc), but: * when they get stale, they don't tend to get updated * they still leave me with questions that I'ld like to answer for developers (like me) who want to use, or work with, TSDB What I propose is developer docs inside the prometheus repository. Easy to find and harness the power of the community to expand it and keep it up to date. * perfect is the enemy of good. Let's have a base and incrementally improve * Markdown docs should be broad but not too deep. Source code comments can complement them, and are the ideal place for implementation details. Signed-off-by: Dieter Plaetinck <dieter@grafana.com> * use example code that works out of the box Signed-off-by: Dieter Plaetinck <dieter@grafana.com> * Apply suggestions from code review Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com> Signed-off-by: Dieter Plaetinck <dieter@grafana.com> * PR feedback Signed-off-by: Dieter Plaetinck <dieter@grafana.com> * more docs Signed-off-by: Dieter Plaetinck <dieter@grafana.com> * PR feedback Signed-off-by: Dieter Plaetinck <dieter@grafana.com> * Apply suggestions from code review Signed-off-by: Dieter Plaetinck <dieter@grafana.com> Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> * Apply suggestions from code review Signed-off-by: Dieter Plaetinck <dieter@grafana.com> Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com> * feedback Signed-off-by: Dieter Plaetinck <dieter@grafana.com> * Update tsdb/docs/usage.md Signed-off-by: Dieter Plaetinck <dieter@grafana.com> Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com> * final tweaks Signed-off-by: Dieter Plaetinck <dieter@grafana.com> * workaround docs versioning issue Signed-off-by: Dieter Plaetinck <dieter@grafana.com> * Move example code to real executable, testable example. Signed-off-by: Dieter Plaetinck <dieter@grafana.com> * cleanup example test and make sure it always reproduces Signed-off-by: Dieter Plaetinck <dieter@grafana.com> * obtain temp dir in a way that works with older Go versions Signed-off-by: Dieter Plaetinck <dieter@grafana.com> * Fix Ganesh's comments Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com> Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com> Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
2025-03-05 20:59:13 -08:00 · 2021-11-17 05:21:27 -05:00 · 2021-11-17 05:21:27 -05:00 · 0fac9bb859
parent 49d8f02c1f
commit 0fac9bb859
13 changed files with 261 additions and 42 deletions
--- a/README.md
+++ b/README.md
@ -111,7 +111,7 @@ For more information on building, running, and developing on the new React-based

 ## More information

-  * The source code is periodically indexed: [Prometheus Core](https://pkg.go.dev/github.com/prometheus/prometheus).
+  * The source code is periodically indexed, but due to an issue with versioning, the "latest" docs shown on Godoc are outdated. Instead, you can use [the docs for v2.31.1](https://pkg.go.dev/github.com/prometheus/prometheus@v1.8.2-0.20211105201321-411021ada9ab).
  * You will find a CircleCI configuration in [`.circleci/config.yml`](.circleci/config.yml).
  * See the [Community page](https://prometheus.io/community) for how to reach the Prometheus developers and users on various communication channels.

--- a/discovery/README.md
+++ b/discovery/README.md
@ -131,7 +131,7 @@ the Prometheus server will be able to see them.

 ### The SD interface

-A Service Discovery (SD) mechanism has to discover targets and provide them to Prometheus. We expect similar targets to be grouped together, in the form of a [target group](https://pkg.go.dev/github.com/prometheus/prometheus/discovery/targetgroup#Group). The SD mechanism sends the targets down to prometheus as list of target groups.
+A Service Discovery (SD) mechanism has to discover targets and provide them to Prometheus. We expect similar targets to be grouped together, in the form of a [target group](https://pkg.go.dev/github.com/prometheus/prometheus@v1.8.2-0.20211105201321-411021ada9ab/discovery/targetgroup#Group). The SD mechanism sends the targets down to prometheus as list of target groups.

 An SD mechanism has to implement the `Discoverer` Interface:
 ```go
--- a/storage/interface.go
+++ b/storage/interface.go
@ -173,9 +173,9 @@ func (f QueryableFunc) Querier(ctx context.Context, mint, maxt int64) (Querier,
 // Operations on the Appender interface are not goroutine-safe.
 type Appender interface {
 	// Append adds a sample pair for the given series.
-	// An optional reference number can be provided to accelerate calls.
-	// A reference number is returned which can be used to add further
-	// samples in the same or later transactions.
+	// An optional series reference can be provided to accelerate calls.
+	// A series reference number is returned which can be used to add further
+	// samples to the given series in the same or later transactions.
 	// Returned reference numbers are ephemeral and may be rejected in calls
 	// to Append() at any point. Adding the sample via Append() returns a new
 	// reference number.
--- a/tsdb/README.md
+++ b/tsdb/README.md
@ -1,16 +1,25 @@
 # TSDB 

-[![GoPkg](https://pkg.go.dev/badge/github.com/prometheus/prometheus/tsdb.svg)](https://pkg.go.dev/github.com/prometheus/prometheus/tsdb)
+[![GoPkg](https://pkg.go.dev/badge/github.com/prometheus/prometheus/tsdb.svg)](https://pkg.go.dev/github.com/prometheus/prometheus@v1.8.2-0.20211105201321-411021ada9ab/tsdb)

-This directory contains the Prometheus storage layer that is used in its 2.x releases.
+This directory contains the Prometheus TSDB (Time Series DataBase) library,
+which handles storage and querying of all Prometheus v2 data.

-A writeup of its design can be found [here](https://fabxc.org/blog/2017-04-10-writing-a-tsdb/).
+Due to an issue with versioning, the "latest" docs shown on Godoc are outdated.
+Instead you may use [the docs for v2.31.1](https://pkg.go.dev/github.com/prometheus/prometheus@v1.8.2-0.20211105201321-411021ada9ab)

-Based on the Gorilla TSDB [white papers](http://www.vldb.org/pvldb/vol8/p1816-teller.pdf).
+## Documentation

-Video: [Storing 16 Bytes at Scale](https://youtu.be/b_pEevMAC3I) from [PromCon 2017](https://promcon.io/2017-munich/).
+* [Data format](docs/format/README.md).
+* [Usage](docs/usage.md).
+* [Bstream details](docs/bstream.md).
+
+## External resources
+
+* A writeup of the original design can be found [here](https://fabxc.org/blog/2017-04-10-writing-a-tsdb/).
+* Video: [Storing 16 Bytes at Scale](https://youtu.be/b_pEevMAC3I) from [PromCon 2017](https://promcon.io/2017-munich/).
+* Compression is based on the Gorilla TSDB [white paper](http://www.vldb.org/pvldb/vol8/p1816-teller.pdf).

-See also the [format documentation](docs/format/README.md) and [bstream details](docs/bstream.md).

 A series of blog posts explaining different components of TSDB:
 * [The Head Block](https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/)
--- a/tsdb/chunks/chunks.go
+++ b/tsdb/chunks/chunks.go
@ -109,7 +109,7 @@ type Meta struct {
 	MinTime, MaxTime int64
 }

-// Iterator iterates over the chunk of a time series.
+// Iterator iterates over the chunks of a single time series.
 type Iterator interface {
 	// At returns the current meta.
 	// It depends on implementation if the chunk is populated or not.
--- a/tsdb/chunks/head_chunks.go
+++ b/tsdb/chunks/head_chunks.go
@ -51,7 +51,7 @@ const (
 	MintMaxtSize = 8
 	// SeriesRefSize is the size of series reference on disk.
 	SeriesRefSize = 8
-	// HeadChunkFileHeaderSize is the total size of the header for the head chunk file.
+	// HeadChunkFileHeaderSize is the total size of the header for a head chunk file.
 	HeadChunkFileHeaderSize = SegmentHeaderSize
 	// MaxHeadChunkFileSize is the max size of a head chunk file.
 	MaxHeadChunkFileSize = 128 * 1024 * 1024 // 128 MiB.
@ -124,19 +124,20 @@ type ChunkDiskMapper struct {
 	// from which chunks are served till they are flushed and are ready for m-mapping.
 	chunkBuffer *chunkBuffer

-	// If 'true', it indicated that the maxt of all the on-disk files were set
-	// after iterating through all the chunks in those files.
+	// Whether the maxt field is set for all mmapped chunk files tracked within the mmappedChunkFiles map.
+	// This is done after iterating through all the chunks in those files using the IterateAllChunks method.
 	fileMaxtSet bool

 	closed bool
 }

+// mmappedChunkFile provides mmapp access to an entire head chunks file that holds many chunks.
 type mmappedChunkFile struct {
 	byteSlice ByteSlice
-	maxt      int64
+	maxt      int64 // Max timestamp among all of this file's chunks.
 }

-// NewChunkDiskMapper returns a new writer against the given directory
+// NewChunkDiskMapper returns a new ChunkDiskMapper against the given directory
 // using the default head chunk file duration.
 // NOTE: 'IterateAllChunks' method needs to be called at least once after creating ChunkDiskMapper
 // to set the maxt of all the file.
@ -172,6 +173,7 @@ func NewChunkDiskMapper(dir string, pool chunkenc.Pool, writeBufferSize int) (*C
 	return m, m.openMMapFiles()
 }

+// openMMapFiles opens all files within dir for mmapping.
 func (cdm *ChunkDiskMapper) openMMapFiles() (returnErr error) {
 	cdm.mmappedChunkFiles = map[int]*mmappedChunkFile{}
 	cdm.closers = map[int]io.Closer{}
@ -254,7 +256,7 @@ func listChunkFiles(dir string) (map[int]string, error) {
 }

 // repairLastChunkFile deletes the last file if it's empty.
-// Because we don't fsync when creating these file, we could end
+// Because we don't fsync when creating these files, we could end
 // up with an empty file at the end during an abrupt shutdown.
 func repairLastChunkFile(files map[int]string) (_ map[int]string, returnErr error) {
 	lastFile := -1
@ -350,7 +352,7 @@ func (cdm *ChunkDiskMapper) WriteChunk(seriesRef HeadSeriesRef, mint, maxt int64
 	return chkRef, nil
 }

-// shouldCutNewFile decides the cutting of a new file based on time and size retention.
+// shouldCutNewFile returns whether a new file should be cut, based on time and size retention.
 // Size retention: because depending on the system architecture, there is a limit on how big of a file we can m-map.
 // Time retention: so that we can delete old chunks with some time guarantee in low load environments.
 func (cdm *ChunkDiskMapper) shouldCutNewFile(chunkSize int) bool {
@ -465,7 +467,7 @@ func (cdm *ChunkDiskMapper) flushBuffer() error {
 // Chunk returns a chunk from a given reference.
 func (cdm *ChunkDiskMapper) Chunk(ref ChunkDiskMapperRef) (chunkenc.Chunk, error) {
 	cdm.readPathMtx.RLock()
-	// We hold this read lock for the entire duration because if the Close()
+	// We hold this read lock for the entire duration because if Close()
 	// is called, the data in the byte slice will get corrupted as the mmapped
 	// file will be closed.
 	defer cdm.readPathMtx.RUnlock()
@ -575,8 +577,8 @@ func (cdm *ChunkDiskMapper) Chunk(ref ChunkDiskMapperRef) (chunkenc.Chunk, error
 	return chk, nil
 }

-// IterateAllChunks iterates on all the chunks in its byte slices in the order of the head chunk file sequence
-// and runs the provided function on each chunk. It returns on the first error encountered.
+// IterateAllChunks iterates all mmappedChunkFiles (in order of head chunk file name/number) and all the chunks within it
+// and runs the provided function with information about each chunk. It returns on the first error encountered.
 // NOTE: This method needs to be called at least once after creating ChunkDiskMapper
 // to set the maxt of all the file.
 func (cdm *ChunkDiskMapper) IterateAllChunks(f func(seriesRef HeadSeriesRef, chunkRef ChunkDiskMapperRef, mint, maxt int64, numSamples uint16) error) (err error) {
@ -825,7 +827,7 @@ func closeAllFromMap(cs map[int]io.Closer) error {

 const inBufferShards = 128 // 128 is a randomly chosen number.

-// chunkBuffer is a thread safe buffer for chunks.
+// chunkBuffer is a thread safe lookup table for chunks by their ref.
 type chunkBuffer struct {
 	inBufferChunks     [inBufferShards]map[ChunkDiskMapperRef]chunkenc.Chunk
 	inBufferChunksMtxs [inBufferShards]sync.RWMutex
--- a/tsdb/db.go
+++ b/tsdb/db.go
@ -596,6 +596,9 @@ func validateOpts(opts *Options, rngs []int64) (*Options, []int64) {
 	return opts, rngs
 }

+// open returns a new DB in the given directory.
+// It initializes the lockfile, WAL, compactor, and Head (by replaying the WAL), and runs the database.
+// It is not safe to open more than one DB in the same directory.
 func open(dir string, l log.Logger, r prometheus.Registerer, opts *Options, rngs []int64, stats *DBStats) (_ *DB, returnedErr error) {
 	if err := os.MkdirAll(dir, 0o777); err != nil {
 		return nil, err
--- a/tsdb/docs/usage.md
+++ b/tsdb/docs/usage.md
@ -0,0 +1,71 @@
+# Usage
+
+TSDB can be - and is - used by other applications such as [Cortex](https://cortexmetrics.io/) and [Thanos](https://thanos.io/).
+This directory contains documentation for any developers who wish to work on or with TSDB.
+
+For a full example of instantiating a database, adding and querying data, see the [tsdb example in the docs](https://pkg.go.dev/github.com/prometheus/prometheus@v1.8.2-0.20211105201321-411021ada9ab/tsdb).
+`tsdb/db_test.go` also demonstrates various specific usages of the TSDB library.
+
+## Instantiating a database
+
+Callers should use [`tsdb.Open`](https://pkg.go.dev/github.com/prometheus/prometheus@v1.8.2-0.20211105201321-411021ada9ab/tsdb#Open) to open a TSDB
+(the directory may be new or pre-existing).
+This returns a [`*tsdb.DB`](https://pkg.go.dev/github.com/prometheus/prometheus@v1.8.2-0.20211105201321-411021ada9ab/tsdb#DB) which is the actual database.
+
+A `DB` has the following main components:
+
+* Compactor: a [leveled compactor](https://pkg.go.dev/github.com/prometheus/prometheus@v1.8.2-0.20211105201321-411021ada9ab/tsdb#LeveledCompactor). Note: it is currently the only compactor implementation. It runs automatically.
+* [`Head`](https://pkg.go.dev/github.com/prometheus/prometheus@v1.8.2-0.20211105201321-411021ada9ab/tsdb#DB.Head)
+* [Blocks (persistent blocks)](https://pkg.go.dev/github.com/prometheus/prometheus@v1.8.2-0.20211105201321-411021ada9ab/tsdb#DB.Blocks)
+
+The `Head` is responsible for a lot.  Here are its main components:
+
+* [WAL](https://pkg.go.dev/github.com/prometheus/prometheus@v1.8.2-0.20211105201321-411021ada9ab/tsdb/wal#WAL) (Write Ahead Log).
+* [`stripeSeries`](https://github.com/prometheus/prometheus/blob/411021ada9ab41095923b8d2df9365b632fd40c3/tsdb/head.go#L1292):
+  this holds all the active series by linking to [`memSeries`](https://github.com/prometheus/prometheus/blob/411021ada9ab41095923b8d2df9365b632fd40c3/tsdb/head.go#L1462)
+  by an ID (aka "ref") and by labels hash.
+* Postings list (reverse index): For any label-value pair, holds all the corresponding series refs. Used for queries.
+* Tombstones.
+
+## Adding data
+
+Use [`db.Appender()`](https://pkg.go.dev/github.com/prometheus/prometheus@v1.8.2-0.20211105201321-411021ada9ab/tsdb#DB.Appender) to obtain an "appender".
+The [golang docs](https://pkg.go.dev/github.com/prometheus/prometheus@v1.8.2-0.20211105201321-411021ada9ab/storage#Appender) speak mostly for themselves.
+
+Remember:
+
+* Use `Commit()` to add the samples to the DB and update the WAL.
+* Create a new appender each time you commit.
+* Appenders are not concurrency safe, but scrapes run concurrently and as such, leverage multiple appenders concurrently.
+  This reduces contention, although Commit() contend the same critical section (writing to the WAL is serialized), and may
+  inflate append tail latency if multiple appenders try to commit at the same time.
+
+Append may reject data due to these conditions:
+
+1) `timestamp < minValidTime` where `minValidTime` is the highest of:
+  * the maxTime of the last block (i.e. the last truncation time of Head) - updated via [`Head.Truncate()`](https://pkg.go.dev/github.com/prometheus/prometheus@v1.8.2-0.20211105201321-411021ada9ab/tsdb#Head.Truncate) and [`DB.compactHead()`](https://github.com/prometheus/prometheus/blob/411021ada9ab41095923b8d2df9365b632fd40c3/tsdb/db.go#L968)
+  * `tsdb.min-block-duration/2` older than the max time in the Head block. Note that while technically `storage.tsdb.min-block-duration` is configurable, it's a hidden option and changing it is discouraged.  So We can assume this value to be 2h.
+
+  Breaching this condition results in "out of bounds" errors.  
+  The first condition assures the block that will be generated doesn't overlap with the previous one (which simplifies querying)  
+  The second condition assures the sample won't go into the so called "compaction window", that is the section of the data that might be in process of being saved into a persistent block on disk.  (because that logic runs concurrently with ingestion without a lock)
+2) The labels don't validate. (if the set is empty or contains duplicate label names)
+3) If the sample, for the respective series (based on all the labels) is out of order or has a different value for the last (highest) timestamp seen. (results in `storage.ErrOutOfOrderSample` and `storage.ErrDuplicateSampleForTimestamp` respectively)
+
+`Commit()` may also refuse data that is out of order with respect to samples that were added via a different appender.
+
+## Querying data
+
+Use [`db.Querier()`](https://pkg.go.dev/github.com/prometheus/prometheus@v1.8.2-0.20211105201321-411021ada9ab/tsdb#DB.Querier) to obtain a "querier".
+The [golang docs](https://pkg.go.dev/github.com/prometheus/prometheus@v1.8.2-0.20211105201321-411021ada9ab/storage#Querier) speak mostly for themselves.
+
+Remember:
+
+* A querier can only see data that was committed when it was created. This limits the lifetime of a querier.
+* A querier should be closed when you're done with it.
+* Use mint/maxt to avoid loading unneeded data.
+
+
+## Example code
+
+Find the example code for ingesting samples and querying them in [`tsdb/example_test.go`](../example_test.go)
--- a/tsdb/example_test.go
+++ b/tsdb/example_test.go
@ -0,0 +1,112 @@
+// Copyright 2021 The Prometheus Authors
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package tsdb
+
+import (
+	"context"
+	"fmt"
+	"math"
+	"testing"
+	"time"
+
+	"github.com/stretchr/testify/require"
+
+	"github.com/prometheus/prometheus/model/labels"
+)
+
+func TestExample(t *testing.T) {
+	// Create a random dir to work in.  Open() doesn't require a pre-existing dir, but
+	// we want to make sure not to make a mess where we shouldn't.
+	dir := t.TempDir()
+
+	// Open a TSDB for reading and/or writing.
+	db, err := Open(dir, nil, nil, DefaultOptions(), nil)
+	require.NoError(t, err)
+
+	// Open an appender for writing.
+	app := db.Appender(context.Background())
+
+	lbls := labels.FromStrings("foo", "bar")
+	var appendedSamples []sample
+
+	// Ref is 0 for the first append since we don't know the reference for the series.
+	ts, v := time.Now().Unix(), 123.0
+	ref, err := app.Append(0, lbls, ts, v)
+	require.NoError(t, err)
+	appendedSamples = append(appendedSamples, sample{ts, v})
+
+	// Another append for a second later.
+	// Re-using the ref from above since it's the same series, makes append faster.
+	time.Sleep(time.Second)
+	ts, v = time.Now().Unix(), 124
+	_, err = app.Append(ref, lbls, ts, v)
+	require.NoError(t, err)
+	appendedSamples = append(appendedSamples, sample{ts, v})
+
+	// Commit to storage.
+	err = app.Commit()
+	require.NoError(t, err)
+
+	// In case you want to do more appends after app.Commit(),
+	// you need a new appender.
+	// app = db.Appender(context.Background())
+	//
+	// ... adding more samples.
+	//
+	// Commit to storage.
+	// err = app.Commit()
+	// require.NoError(t, err)
+
+	// Open a querier for reading.
+	querier, err := db.Querier(context.Background(), math.MinInt64, math.MaxInt64)
+	require.NoError(t, err)
+
+	ss := querier.Select(false, nil, labels.MustNewMatcher(labels.MatchEqual, "foo", "bar"))
+	var queriedSamples []sample
+	for ss.Next() {
+		series := ss.At()
+		fmt.Println("series:", series.Labels().String())
+
+		it := series.Iterator()
+		for it.Next() {
+			ts, v := it.At()
+			fmt.Println("sample", ts, v)
+			queriedSamples = append(queriedSamples, sample{ts, v})
+		}
+
+		require.NoError(t, it.Err())
+		fmt.Println("it.Err():", it.Err())
+	}
+	require.NoError(t, ss.Err())
+	fmt.Println("ss.Err():", ss.Err())
+	ws := ss.Warnings()
+	if len(ws) > 0 {
+		fmt.Println("warnings:", ws)
+	}
+	err = querier.Close()
+	require.NoError(t, err)
+
+	// Clean up any last resources when done.
+	err = db.Close()
+	require.NoError(t, err)
+
+	require.Equal(t, appendedSamples, queriedSamples)
+
+	// Output:
+	// series: {foo="bar"}
+	// sample <ts1> 123
+	// sample <ts2> 124
+	// it.Err(): <nil>
+	// ss.Err(): <nil>
+}
--- a/tsdb/head.go
+++ b/tsdb/head.go
@ -992,12 +992,15 @@ func (h *Head) Stats(statsByLabelName string) *Stats {
 	}
 }

+// RangeHead allows querying Head via an IndexReader, ChunkReader and tombstones.Reader
+// but only within a restricted range.  Used for queries and compactions.
 type RangeHead struct {
 	head       *Head
 	mint, maxt int64
 }

 // NewRangeHead returns a *RangeHead.
+// There are no restrictions on mint/maxt.
 func NewRangeHead(head *Head, mint, maxt int64) *RangeHead {
 	return &RangeHead{
 		head: head,
@ -1284,15 +1287,17 @@ const (
 	DefaultStripeSize = 1 << 14
 )

-// stripeSeries locks modulo ranges of IDs and hashes to reduce lock contention.
+// stripeSeries holds series by HeadSeriesRef ("ID") and also by hash of their labels.
+// ID-based lookups via (getByID()) are preferred over getByHash() for performance reasons.
+// It locks modulo ranges of IDs and hashes to reduce lock contention.
 // The locks are padded to not be on the same cache line. Filling the padded space
 // with the maps was profiled to be slower – likely due to the additional pointer
 // dereferences.
 type stripeSeries struct {
 	size                    int
-	series                  []map[chunks.HeadSeriesRef]*memSeries
-	hashes                  []seriesHashmap
-	locks                   []stripeLock
+	series                  []map[chunks.HeadSeriesRef]*memSeries // Sharded by ref. A series ref is the value of `size` when the series was being newly added.
+	hashes                  []seriesHashmap                       // Sharded by label hash.
+	locks                   []stripeLock                          // Sharded by ref for series access, by label hash for hashes access.
 	seriesLifecycleCallback SeriesLifecycleCallback
 }

@ -1466,17 +1471,24 @@ type memSeries struct {

 	ref           chunks.HeadSeriesRef
 	lset          labels.Labels
-	mmappedChunks []*mmappedChunk
-	mmMaxTime     int64 // Max time of any mmapped chunk, only used during WAL replay.
-	headChunk     *memChunk
+	mmappedChunks []*mmappedChunk // Immutable chunks on disk that have not yet gone into a block, in order of ascending time stamps.
+	mmMaxTime     int64           // Max time of any mmapped chunk, only used during WAL replay.
+	headChunk     *memChunk       // Most recent chunk in memory that's still being built.
 	chunkRange    int64
 	firstChunkID  int

-	nextAt        int64 // Timestamp at which to cut the next chunk.
-	sampleBuf     [4]sample
+	nextAt int64 // Timestamp at which to cut the next chunk.
+
+	// We keep the last 4 samples here (in addition to appending them to the chunk) so we don't need coordination between appender and querier.
+	// Even the most compact encoding of a sample takes 2 bits, so the last byte is not contended.
+	sampleBuf [4]sample
+
 	pendingCommit bool // Whether there are samples waiting to be committed to this series.

-	app chunkenc.Appender // Current appender for the chunk.
+	// Current appender for the head chunk. Set when a new head chunk is cut.
+	// It is nil only if headChunk is nil. E.g. if there was an appender that created a new series, but rolled back the commit
+	// (the first sample would create a headChunk, hence appender, but rollback skipped it while the Append() call would create a series).
+	app chunkenc.Appender

 	memChunkPool *sync.Pool

--- a/tsdb/head_append.go
+++ b/tsdb/head_append.go
@ -114,7 +114,7 @@ func (h *Head) Appender(_ context.Context) storage.Appender {
 }

 func (h *Head) appender() *headAppender {
-	appendID, cleanupAppendIDsBelow := h.iso.newAppendID()
+	appendID, cleanupAppendIDsBelow := h.iso.newAppendID() // Every appender gets an ID that is cleared upon commit/rollback.

 	// Allocate the exemplars buffer only if exemplars are enabled.
 	var exemplarsBuf []exemplarWithSeriesRef
@ -224,10 +224,10 @@ type headAppender struct {
 	minValidTime int64 // No samples below this timestamp are allowed.
 	mint, maxt   int64

-	series       []record.RefSeries
-	samples      []record.RefSample
-	exemplars    []exemplarWithSeriesRef
-	sampleSeries []*memSeries
+	series       []record.RefSeries      // New series held by this appender.
+	samples      []record.RefSample      // New samples held by this appender.
+	exemplars    []exemplarWithSeriesRef // New exemplars held by this appender.
+	sampleSeries []*memSeries            // Series corresponding to the samples held by this appender (using corresponding slice indices - same series may appear more than once).

 	appendID, cleanupAppendIDsBelow uint64
 	closed                          bool
@ -361,6 +361,7 @@ func (a *headAppender) GetRef(lset labels.Labels) (storage.SeriesRef, labels.Lab
 	return storage.SeriesRef(s.ref), s.lset
 }

+// log writes all headAppender's data to the WAL.
 func (a *headAppender) log() error {
 	if a.head.wal == nil {
 		return nil
@ -412,6 +413,7 @@ func exemplarsForEncoding(es []exemplarWithSeriesRef) []record.RefExemplar {
 	return ret
 }

+// Commit writes to the WAL and adds the data to the Head.
 func (a *headAppender) Commit() (err error) {
 	if a.closed {
 		return ErrAppenderClosed
@ -481,7 +483,7 @@ func (s *memSeries) append(t int64, v float64, appendID uint64, chunkDiskMapper

 	if c == nil {
 		if len(s.mmappedChunks) > 0 && s.mmappedChunks[len(s.mmappedChunks)-1].maxTime >= t {
-			// Out of order sample. Sample timestamp is already in the mmaped chunks, so ignore it.
+			// Out of order sample. Sample timestamp is already in the mmapped chunks, so ignore it.
 			return false, false
 		}
 		// There is no chunk in this series yet, create the first chunk for the sample.
@ -583,6 +585,7 @@ func (s *memSeries) mmapCurrentHeadChunk(chunkDiskMapper *chunks.ChunkDiskMapper
 	})
 }

+// Rollback removes the samples and exemplars from headAppender and writes any series to WAL.
 func (a *headAppender) Rollback() (err error) {
 	if a.closed {
 		return ErrAppenderClosed
--- a/tsdb/head_read.go
+++ b/tsdb/head_read.go
@ -175,6 +175,8 @@ func (h *headIndexReader) Series(ref storage.SeriesRef, lbls *labels.Labels, chk
 	return nil
 }

+// chunkID returns the ID corresponding to .mmappedChunks[pos]
+// (head chunk if pos==len(mmappedChunks))
 func (s *memSeries) chunkID(pos int) int {
 	return pos + s.firstChunkID
 }
@ -288,9 +290,9 @@ func (h *headChunkReader) Chunk(ref chunks.ChunkRef) (chunkenc.Chunk, error) {
 	}, nil
 }

-// chunk returns the chunk for the chunk id from memory or by m-mapping it from the disk.
+// chunk returns the chunk for the chunkID from memory or by m-mapping it from the disk.
 // If garbageCollect is true, it means that the returned *memChunk
-// (and not the chunkenc.Chunk inside it) can be garbage collected after it's usage.
+// (and not the chunkenc.Chunk inside it) can be garbage collected after its usage.
 func (s *memSeries) chunk(id int, chunkDiskMapper *chunks.ChunkDiskMapper) (chunk *memChunk, garbageCollect bool, err error) {
 	// ix represents the index of chunk in the s.mmappedChunks slice. The chunk id's are
 	// incremented by 1 when new chunk is created, hence (id - firstChunkID) gives the slice index.
@ -335,7 +337,7 @@ func (c *safeChunk) Iterator(reuseIter chunkenc.Iterator) chunkenc.Iterator {
 	return it
 }

-// iterator returns a chunk iterator.
+// iterator returns a chunk iterator for the requested chunkID.
 // It is unsafe to call this concurrently with s.append(...) without holding the series lock.
 func (s *memSeries) iterator(id int, isoState *isolationState, chunkDiskMapper *chunks.ChunkDiskMapper, it chunkenc.Iterator) chunkenc.Iterator {
 	c, garbageCollect, err := s.chunk(id, chunkDiskMapper)
@ -439,6 +441,8 @@ func (s *memSeries) iterator(id int, isoState *isolationState, chunkDiskMapper *
 	}
 }

+// memSafeIterator returns values from the wrapped stopIterator
+// except the last 4, which come from buf.
 type memSafeIterator struct {
 	stopIterator

@ -482,6 +486,8 @@ func (it *memSafeIterator) At() (int64, float64) {
 	return s.t, s.v
 }

+// stopIterator wraps an Iterator, but only returns the first
+// stopAfter values, if initialized with i=-1.
 type stopIterator struct {
 	chunkenc.Iterator

--- a/tsdb/record/record.go
+++ b/tsdb/record/record.go
@ -12,6 +12,7 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.

+// Package record contains the various record types used for encoding various Head block data in the WAL and in-memory snapshot.
 package record

 import (