WIP: This needs more tests.
It now gets a from and through value, which it may opportunistically
use to optimize the retrieval. With possible future range indices,
this could be used in a very efficient way. This change merely applies
some easy checks, which should nevertheless solve the use case of
heavy rule evaluations on servers with a lot of series churn.
Idea is the following:
- Only archive series that are at least as old as the headChunkTimeout
(which was already extremely unlikely to happen).
- Then maintain a high watermark for the last archival, i.e. no
archived series has a sample more recent than that watermark.
- Any query that doesn't reach to a time before that watermark doesn't
have to touch the archive index at all. (A production server at
Soundcloud with the aforementioned series churn and heavy rule
evaluations spends 50% of its CPU time in archive index
lookups. Since rule evaluations usually only touch very recent
values, most of those lookup should disappear with this change.)
- Federation with a very broad label matcher will profit from this,
too.
As a byproduct, the un-needed MetricForFingerprint method was removed
from the Storage interface.
This fixes https://github.com/prometheus/prometheus/issues/1059 , but
not in the obvious way (simply not updating the persist watermark,
because that's actually not that simple - we don't really know what
has gone wrong exactly). As any errors relevant here are most likely
caused by severe and unrecoverable problems with the series file,
Using the now quarantine feature is the right step. We don't really
have to be worried about any inconsistent state of the series because
it will be removed for good ASAP. Another plus is that we don't have
to declare the whole storage dirty anymore.
This requires all the panic calls upon unexpected data to be converted
into errors returned. This pollute the function signatures quite
lot. Well, this is Go...
The ideas behind this are the following:
- panic only if it's a programming error. Data corruptions happen, and
they are not programming errors.
- If we detect a data corruption, we "quarantine" the series,
essentially removing it from the database and putting its data into
a separate directory for forensics.
- Failure during writing to a series file is not considered corruption
automatically. It will call setDirty, though, so that a
crashrecovery upon the next restart will commence and check for
that.
- Series quarantining and setDirty calls are logged and counted in
metrics, but are hidden from the user of the interfaces in
interface.go, whith the notable exception of Append(). The reasoning
is that we treat corruption by removing the corrupted series, i.e. a
query for it will return no results on its next call anyway, so
return no results right now. In the case of Append(), we want to
tell the user that no data has been appended, though.
Minor side effects:
- Now consistently using filepath.* instead of path.*.
- Introduced structured logging where I touched it. This makes things
less consistent, but a complete change to structured logging would
be out of scope for this PR.
Fixes https://github.com/prometheus/prometheus/issues/1401
This remove the last (and in fact bogus) use of BoundaryValues.
Thus, a whole lot of unused (and arguably sub-optimal / ugly) code can
be removed here, too.
In a way, our instants were also ranges, just with the staleness delta
as range length. They are no treated equally, just that in one case,
the range length is set as range, in the other the staleness
delta. However, there are "real" instants where start and and time of
a query is the same. In those cases, we only want to return a single
value (the one closest before or at the equal start and end time). If
that value is the last sample in the series, odds are we have it
already in the series object. In that case, there is no need to pin or
load any chunks. A special singleSampleSeriesIterator is created for
that. This should greatly speed up instant queries as they happen
frequently for rule evaluations.
This implies a slight change of behavior as only samples added to the
respective instance of a memorySeries are returned. However, this is
most likely anyway what we want.
Following cases:
- Server has been restarted: Given the time it takes to cleanly
shutdown and start up a server, the series are now stale anyway. An
improved staleness handling (still to be implemented) will be based
on tracking if a given target is continuing to expose samples for a
given time series. In that case, we need a full scrape cycle to
decide about staleness. So again, it makes sense to consider
everything stale directly after a server restart.
- Series unarchived due to a read request: The series is definitely
stale so we don't want to return anything anyway.
- Freshly created time series or series unarchived because of a sample
append: That happens because appending a sample is imminent. Before
the fingerprint lock is released, the series will have received a
sample, and lastSamplePair will always returned the expected value.
Formalize ZeroSamplePair as return value for non-existing samples.
Change LastSamplePairForFingerprint to return a SamplePair (and not a
pointer to it), which saves allocations in a potentially extremely
frequent call.
This will fix issue #1035 and will also help to make issue #1264 less
bad.
The fundamental problem in the current code:
In the preload phase, we quite accurately determine which chunks will
be used for the query being executed. However, in the subsequent step
of creating series iterators, the created iterators are referencing
_all_ in-memory chunks in their series, even the un-pinned ones. In
iterator creation, we copy a pointer to each in-memory chunk of a
series into the iterator. While this creates a certain amount of
allocation churn, the worst thing about it is that copying the chunk
pointer out of the chunkDesc requires a mutex acquisition. (Remember
that the iterator will also reference un-pinned chunks, so we need to
acquire the mutex to protect against concurrent eviction.) The worst
case happens if a series doesn't even contain any relevant samples for
the query time range. We notice that during preloading but then we
will still create a series iterator for it. But even for series that
do contain relevant samples, the overhead is quite bad for instant
queries that retrieve a single sample from each series, but still go
through all the effort of series iterator creation. All of that is
particularly bad if a series has many in-memory chunks.
This commit addresses the problem from two sides:
First, it merges preloading and iterator creation into one step,
i.e. the preload call returns an iterator for exactly the preloaded
chunks.
Second, the required mutex acquisition in chunkDesc has been greatly
reduced. That was enabled by a side effect of the first step, which is
that the iterator is only referencing pinned chunks, so there is no
risk of concurrent eviction anymore, and chunks can be accessed
without mutex acquisition.
To simplify the code changes for the above, the long-planned change of
ValueAtTime to ValueAtOrBefore time was performed at the same
time. (It should have been done first, but it kind of accidentally
happened while I was in the middle of writing the series iterator
changes. Sorry for that.) So far, we actively filtered the up to two
values that were returned by ValueAtTime, i.e. we invested work to
retrieve up to two values, and then we invested more work to throw one
of them away.
The SeriesIterator.BoundaryValues method can be removed once #1401 is
fixed. But I really didn't want to load even more changes into this
PR.
Benchmarks:
The BenchmarkFuzz.* benchmarks run 83% faster (i.e. about six times
faster) and allocate 95% fewer bytes. The reason for that is that the
benchmark reads one sample after another from the time series and
creates a new series iterator for each sample read.
To find out how much these improvements matter in practice, I have
mirrored a beefy Prometheus server at SoundCloud that suffers from
both issues #1035 and #1264. To reach steady state that would be
comparable, the server needs to run for 15d. So far, it has run for
1d. The test server currently has only half as many memory time series
and 60% of the memory chunks the main server has. The 90th percentile
rule evaluation cycle time is ~11s on the main server and only ~3s on
the test server. However, these numbers might get much closer over
time.
In addition to performance improvements, this commit removes about 150
LOC.
This gives up on the idea to communicate throuh the Append() call (by
either not returning as it is now or returning an error as
suggested/explored elsewhere). Here I have added a Throttled() call,
which has the advantage that it can be called before a whole _batch_
of Append()'s. Scrapes will happen completely or not at all. Same for
rule group evaluations. That's a highly desired behavior (as discussed
elsewhere). The code is even simpler now as the whole ingestion buffer
could be removed.
Logging of throttled mode has been streamlined and will create at most
one message per minute.
"Rushed mode" is formerly known as "degraded mode", which is changed
with this commit, too. The name "degraded" was very misleading.
Also, switch into rushed mode if we have too many chunks in memory and
an at least reasonable amount of chunks to persist so that speeding up
persisting chunks can help.
If only very few chunks are to be truncated from a very large series
file, the rewrite of the file is a lorge overhead. With this change, a
certain ratio of the file has to be dropped to make it happen. While
only causing disk overhead at about the same ratio (by default 10%),
it will cut down I/O by a lot in above scenario.
Perhaps it would be even better to still warn in case the sample value has
changed but the timestamps are equal, but we don't have efficient access
to the last value.
For the label matching index-based preselection phase, don't do an OR
between equality and non-equality matchers. Execute only one of the two
(with equality matchers preferred when present).
Fixes https://github.com/prometheus/prometheus/issues/924
If all samples in consecutive chunks have the same timestamp, the way
we used to load chunks will fail. With this change, the persist
watermark is used to load the right amount of chunkDescs from disk.
This bug is a possible reason for the rare storage corruption we have
observed.
Fixes https://github.com/prometheus/prometheus/issues/481
While doing so, clean up and fix a few other things:
- Fix `go vet` warnings (@fabxc to blame ;).
- Fix a racey problem with unarchiving: Whenever we unarchive a
series, we essentially want to do something with it. However, until
we have done something with it, it appears like a series that is
ready to be archived or even purged. So e.g. it would be ignored
during checkpointing. With this fix, we always load the chunkDescs
upon unarchiving. This is wasteful if we only want to add a new
sample to an archived time series, but the (presumably more common)
case where we access an archived time series in a query doesn't
become more expensive.
- The change above streamlined the getOrCreateSeries ond
newMemorySeries flow. Also, the modTime is now always set correctly.
- Fix the leveldb-backed implementation of KeyValueStore.Delete. It
had the wrong behavior of still returning true, nil if a
non-existing key has been passed in.
This commit adds the honor_labels and params arguments to the scrape
config. This allows to specify query parameters used by the scrapers
and handling scraped labels with precedence.
Change #704 introduced a regression that started reading the queue only
after potential crash recovery. When more than the queue capacity was
indexed, Prometheus deadlocked.
This change is conceptually very simple, although the diff is large. It
switches logging from "github.com/golang/glog" to
"github.com/prometheus/log", while not actually changing any log
messages. V(1)-style logging has been changed to be log.Debug*().
The one central sample ingestion channel has caused a variety of
trouble. This commit removes it. Targets and rule evaluation call an
Append method directly now. To incorporate multiple storage backends
(like OpenTSDB), storage.Tee forks the Append into two different
appenders.
Note that the tsdb queue manager had its own queue anyway. It was a
queue after a queue... Much queue, so overhead...
Targets have their own little buffer (implemented as a channel) to
avoid stalling during an http scrape. But a new scrape will only be
started once the old one is fully ingested.
The contraption of three pipelined ingesters was removed. A Target is
an ingester itself now. Despite more logic in Target, things should be
less confusing now.
Also, remove lint and vet warnings in ast.go.
A number of mostly minor things:
- Rename chunk type -> chunk encoding.
- After all, do not carry around the chunk encoding to all parts of
the system, but just have one place where the encoding for new
chunks is set based on the flag. The new approach has caveats as
well, but the polution of so many method signatures is worse.
- Use the default chunk encoding for new chunks of existing
series. (Previously, only new _series_ would get chunks with the
default encoding.)
- Use an enum for chunk encoding. (But keep the version number for the
flag, for reasons discussed previously.)
- Add encoding() to the chunk interface (so that a chunk knows its own
encoding - no need to have that in a different top-level function).
- Got rid of newFollowUpChunk (which would keep the existing encoding
for all chunks of a time series). Now only use newChunk(), which
will create a chunk encoding according to the flag.
- Simplified transcodeAndAdd.
- Reordered methods of deltaEncodedChunk and doubleDeltaEncoded chunk
to match the order in the chunk interface.
- Only transcode if the chunk is not yet half full. If more than half
full, add a new chunk instead.
The capacity is basically how many persisted head chunks we will count
at most while doing other things, in particular checkpointing. To
limit the amount of already counted head chunks, keep this number low,
otherwise we will easily checkpoint too often if checkpoints take long
anyway.
In that commit, the 'maintainSeries' call was accidentally removed.
This commit refactors things a bit so that there is now a clean
'maintainMemorySeries' and a 'maintainArchivedSeries' call.
Straighten the nomenclature a bit (consistently use 'drop' for
chunks and 'purge' for series/metrics).
Remove the annoying 'Completed maintenance sweep through archived
fingerprints' message if there were no archived fingerprints to do
maintenance on.
This is done by bucketing chunks by fingerprint. If the persisting to
disk falls behind, more and more chunks are in the queue. As soon as
there are "double hits", we will now persist both chunks in one go,
doubling the disk throughput (assuming it is limited by disk
seeks). Should even more pile up so that we end wit "triple hits", we
will persist those first, and so on.
Even if we have millions of time series, this will still help,
assuming not all of them are growing with the same speed. Series that
get many samples and/or are not very compressable will accumulate
chunks faster, and they will soon get double- or triple-writes.
To improve the chance of double writes,
-storage.local.persistence-queue-capacity could be set to a higher
value. However, that will slow down shutdown a lot (as the queue has
to be worked through). So we leave it to the user to set it to a
really high value. A more fundamental solution would be to checkpoint
not only head chunks, but also chunks still in the persist queue. That
would be quite complicated for a rather limited use-case (running many
time series with high ingestion rate on slow spinning disks).
Starting a goroutine takes 1-2µs on my laptop. From the "numbers every
Go programmer should know", I had 300ns for a channel send in my
mind. Turns out, on my laptop, it takes only 60ns. That's fast enough
to warrant the machinery of yet another channel with a fixed set of
worker goroutines feeding from it. The number chosen (8 for now) is
low enough to not really afflict a measurable overhead (a big
Prometheus server has >1000 goroutines running), but high enough to
not make sample ingestion a bottleneck.
- Parallelize AppendSamples as much as possible without breaking the
contract about temporal order.
- Allocate more fingerprint locker slots.
- Do not run early checkpoints if we are behind on chunk persistence.
- Increase fpMinWaitDuration to give the disk more time for more
important things.
Also, switch math.MaxInt64 and math.MinInt64 to the new constants.
Also, set a much higher default value.
Chunk persist requests can be quite spiky. If you collect a large
number of time series that are very similar, they will tend to finish
up a chunk at about the same time. There is no reason we need to back
up scraping just because of that. The rationale of the new default
value is "1/8 of the chunks in memory".
If queries are still running when the shutdown is initiated, they will
finish _during_ the shutdown. In that case, they might request chunk
eviction upon unpinning their pinned chunks. That might completely
fill the evict request queue _after_ draining it during storage
shutdown. If that ever happens (which is the case if there are _many_
queries still running during shutdown), the affected queries will be
stuck while keeping a fingerprint locked. The checkpointing can then
not process that fingerprint (or one that shares the same lock). And
then we are deadlocked.
- Move CONTRIBUTORS.md to the more common AUTHORS.
- Added the required NOTICE file.
- Changed "Prometheus Team" to "The Prometheus Authors".
- Reverted the erroneous changes to the Apache License.
Usually, if you unarchive a series, it is to add something to it,
which will create a new head chunk. However, if a series in
unarchived, and before anything is added to it, it is handled by the
maintenance loop, it will be archived again. In that case, we have to
load the chunkDescs to know the lastTime of the series to be
archived. Usually, this case will happen only rarely (as a race, has
never happened so far, possibly because the locking around unarchiving
and the subsequent sample append is smart enough). However, during
crash recovery, we sometimes treat series as "freshly unarchived"
without directly appending a sample. We might add more cases of that
type later, so better deal with archiving properly and load chunkDescs
if required.
- Documented checkpoint file format.
- High-level description of series sanitation.
- Replace fp.LoadFromString panic with an error.
(Change in client_golang already submitted.)
- Introduced checks for series file size where appropriate.
- Removed two Law of Demeter violations.
Change-Id: I555d97a2c8f4769820c2fc8bf5d6f4e160222abc
- Delete unneeded file view_adapter.go.
- Assessed that we still need the fingerprints in nodes
(to create iterators).
- Turned numMemChunkDescs into a metric.
Change-Id: I29be963c795a075ec00c095f76bf26405535609d