Fix locks in db.reloadBlocks()

This partially reverts ae3d392aa9. ae3d392aa9 added a call to db.mtx.Lock() that lasts for the entire duration of db.reloadBlocks(), previous db.mtx would be locked only during critical part of db.reloadBlocks(). The motivation was to protect against races: 9e0351e161 (r555699794) The 'reloads' being mentioned are (I think) reloadBlocks() calls, rather than db.reload() or other methods. TestTombstoneCleanRetentionLimitsRace was added to catch this but I wasn't able to ever get any error out of it, even after disabling all calls to db.mtx in reloadBlocks() and CleanTombstones(). To make things more complicated CleanupTombstones() itself calls reloadBlocks(), so it seems that the real issue is that we might have concurrent calls to reloadBlocks(). The problem with this change is that db.reloadBlocks() can take a very long time, that's because it might need to load very large blocks from disk, which is slow. While db.mtx is locked a large chunk of the db is locked, including queries, since db.mtx read lock is needed for db.Querier() call. One of the issues this manifests itself as is a gap in all metrics and blocked queries just after a large block compaction happens. When compaction merges multiple day-or-more blocks into a week-or-more block it create a single very big block. After that block is written it needs to be loaded and that seems to be taking many seconds (30-45), during which mtx is held and everything is blocked. Turns out that there is another lock that is more fine grained and aimed at this specific use case: // cmtx ensures that compactions and deletions don't run simultaneously. cmtx sync.Mutex All calls to reloadBlocks() are wrapped inside cmtx lock. The only exception is db.reload() which this change fixes. We can't add cmtx lock inside reloadBlocks() itself because it's called by a number of functions, some of which are already holding cmtx. Looking at the code I think it is sufficient to hold cmtx and skip a reloadBlocks() wide mtx call. Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
2025-03-05 20:59:13 -08:00 · 2023-10-02 14:47:00 +01:00 · 2023-10-02 14:47:00 +01:00 · b880cea613
parent b3e30d52ce
commit b880cea613
1 changed files with 9 additions and 6 deletions
--- a/tsdb/db.go
+++ b/tsdb/db.go
@ -992,9 +992,14 @@ func open(dir string, l *slog.Logger, r prometheus.Registerer, opts *Options, rn
 	db.metrics.maxBytes.Set(float64(maxBytes))
 	db.metrics.retentionDuration.Set((time.Duration(opts.RetentionDuration) * time.Millisecond).Seconds())

+	// Calling db.reload() calls db.reloadBlocks() which requires cmtx to be locked.
+	db.cmtx.Lock()
 	if err := db.reload(); err != nil {
+		db.cmtx.Unlock()
 		return nil, err
 	}
+	db.cmtx.Unlock()
+
 	// Set the min valid time for the ingested samples
 	// to be no lower than the maxt of the last block.
 	minValidTime := int64(math.MinInt64)
@ -1568,13 +1573,9 @@ func (db *DB) reloadBlocks() (err error) {
 		db.metrics.reloads.Inc()
 	}()

-	// Now that we reload TSDB every minute, there is a high chance for a race condition with a reload
-	// triggered by CleanTombstones(). We need to lock the reload to avoid the situation where
-	// a normal reload and CleanTombstones try to delete the same block.
-	db.mtx.Lock()
-	defer db.mtx.Unlock()
-
+	db.mtx.RLock()
 	loadable, corrupted, err := openBlocks(db.logger, db.dir, db.blocks, db.chunkPool, db.opts.PostingsDecoderFactory)
+	db.mtx.RUnlock()
 	if err != nil {
 		return err
 	}
@ -1643,8 +1644,10 @@ func (db *DB) reloadBlocks() (err error) {
 	})

 	// Swap new blocks first for subsequently created readers to be seen.
+	db.mtx.Lock()
 	oldBlocks := db.blocks
 	db.blocks = toLoad
+	db.mtx.Unlock()

 	// Only check overlapping blocks when overlapping compaction is enabled.
 	if db.opts.EnableOverlappingCompaction {