This is a fairly easy attempt to dynamically evict chunks based on the
heap size. A target heap size has to be set as a command line flage,
so that users can essentially say "utilize 4GiB of RAM, and please
don't OOM".
The -storage.local.max-chunks-to-persist and
-storage.local.memory-chunks flags are deprecated by this
change. Backwards compatibility is provided by ignoring
-storage.local.max-chunks-to-persist and use
-storage.local.memory-chunks to set the new
-storage.local.target-heap-size to a reasonable (and conservative)
value (both with a warning).
This also makes the metrics intstrumentation more consistent (in
naming and implementation) and cleans up a few quirks in the tests.
Answers to anticipated comments:
There is a chance that Go 1.9 will allow programs better control over
the Go memory management. I don't expect those changes to be in
contradiction with the approach here, but I do expect them to
complement them and allow them to be more precise and controlled. In
any case, once those Go changes are available, this code has to be
revisted.
One might be tempted to let the user specify an estimated value for
the RSS usage, and then internall set a target heap size of a certain
fraction of that. (In my experience, 2/3 is a fairly safe bet.)
However, investigations have shown that RSS size and its relation to
the heap size is really really complicated. It depends on so many
factors that I wouldn't even start listing them in a commit
description. It depends on many circumstances and not at least on the
risk trade-off of each individual user between RAM utilization and
probability of OOMing during a RAM usage peak. To not add even more to
the confusion, we need to stick to the well-defined number we also use
in the targeting here, the sum of the sizes of heap objects.
Currently, if a series stops to exist, its head chunk will be kept
open for an hour. That prevents it from being persisted. Which
prevents it from being evicted. Which prevents the series from being
archived.
Most of the time, once no sample has been added to a series within the
staleness limit, we can be pretty confident that this series will not
receive samples anymore. The whole chain as described above can be
started after 5m instead of 1h. In the relaxed case, this doesn't
change a lot as the head chunk timeout is only checked during series
maintenance, and usually, a series is only maintained every six
hours. However, there is the typical scenario where a large service is
deployed, the deoply turns out to be bad, and then it is deployed
again within minutes, and quite quickly the number of time series has
tripled. That's the point where the Prometheus server is stressed and
switches (rightfully) into rushed mode. In that mode, time series are
processed as quickly as possible, but all of that is in vein if all of
those recently ended time series cannot be persisted yet for another
hour. In that scenario, this change will help most, and it's exactly
the scenario where help is most desperately needed.
These more specific methods have replaced `metricForLabelMatchers`
in cases where its `map[fingerprint]metric` result type was
not necessary or was used as an intermediate step
Avoids duplicated calls to `seriesForRange` from
`QueryRange` and `QueryInstant` methods.
This is a followup to https://github.com/prometheus/prometheus/pull/2011.
This publishes more of the methods and other names of the chunk code and
moves the chunk code to its own package. There's some unavoidable
ugliness: the chunk and chunkDesc metrics are used by both packages, so
I had to move them to the chunk package. That isn't great, but I don't
see how to do it better without a larger redesign of everything. Same
for the evict requests and some other types.
This is based on https://github.com/prometheus/prometheus/pull/1997.
This adds contexts to the relevant Storage methods and already passes
PromQL's new per-query context into the storage's query methods.
The immediate motivation supporting multi-tenancy in Frankenstein, but
this could also be used by Prometheus's normal local storage to support
cancellations and timeouts at some point.
See discussion in
https://groups.google.com/forum/#!topic/prometheus-developers/bkuGbVlvQ9g
The main idea is that the user of a storage shouldn't have to deal with
fingerprints anymore, and should not need to do an individual preload
call for each metric. The storage interface needs to be made more
high-level to not expose these details.
This also makes it easier to reuse the same storage interface for remote
storages later, as fewer roundtrips are required and the fingerprint
concept doesn't work well across the network.
NOTE: this deliberately gets rid of a small optimization in the old
query Analyzer, where we dedupe instants and ranges for the same series.
This should have a minor impact, as most queries do not have multiple
selectors loading the same series (and at the same offset).
tl;dr: This is not a fundamental solution to the indexing problem
(like tindex is) but it at least avoids utilizing the intersection
problem to the greatest possible amount.
In more detail:
Imagine the following query:
nicely:aggregating:rule{job="foo",env="prod"}
While it uses a nicely aggregating recording rule (which might have a
very low cardinality), Prometheus still intersects the low number of
fingerprints for `{__name__="nicely:aggregating:rule"}` with the many
thousands of fingerprints matching `{job="foo"}` and with the millions
of fingerprints matching `{env="prod"}`. This totally innocuous query
is dead slow if the Prometheus server has a lot of time series with
the `{env="prod"}` label. Ironically, if you make the query more
complicated, it becomes blazingly fast:
nicely:aggregating:rule{job=~"foo",env=~"prod"}
Why so? Because Prometheus only intersects with non-Equal matchers if
there are no Equal matchers. That's good in this case because it
retrieves the few fingerprints for
`{__name__="nicely:aggregating:rule"}` and then starts right ahead to
retrieve the metric for those FPs and checking individually if they
match the other matchers.
This change is generalizing the idea of when to stop intersecting FPs
and go into "retrieve metrics and check them individually against
remaining matchers" mode:
- First, sort all matchers by "expected cardinality". Matchers
matching the empty string are always worst (and never used for
intersections). Equal matchers are in general consider best, but by
using some crude heuristics, we declare some better than others
(instance labels or anything that looks like a recording rule).
- Then go through the matchers until we hit a threshold of remaining
FPs in the intersection. This threshold is higher if we are already
in the non-Equal matcher area as intersection is even more expensive
here.
- Once the threshold has been reached (or we have run out of matchers
that do not match the empty string), start with "retrieve metrics
and check them individually against remaining matchers".
A beefy server at SoundCloud was spending 67% of its CPU time in index
lookups (fingerprintsForLabelPairs), serving mostly a dashboard that
is exclusively built with recording rules. With this change, it spends
only 35% in fingerprintsForLabelPairs. The CPU usage dropped from 26
cores to 18 cores. The median latency for query_range dropped from 14s
to 50ms(!). As expected, higher percentile latency didn't improve that
much because the new approach is _occasionally_ running into the worst
case while the old one was _systematically_ doing so. The 99th
percentile latency is now about as high as the median before (14s)
while it was almost twice as high before (26s).
PromQL only requires a much narrower interface than local.Storage in
order to run queries. Narrower interfaces are easier to replace and
test, too.
We could also change the web interface to use local.Querier, except that
we'll probably use appending functions from there in the future.
Before, we checkpointed after every newly detected fingerprint
collision, which is not a problem as long as collisions are
rare. However, with a sufficient number of metrics or particular
nature of the data set, there might be a lot of collisions, all to be
detected upon the first set of scrapes, and then the checkpointing
after each detection will take a quite long time (it's O(n²),
essentially).
Since we are rebuilding the fingerprint mapping during crash recovery,
the previous, very conservative approach didn't even buy us
anything. We only ever read from the checkpoint file after a clean
shutdown, so the only time we need to write the checkpoint file is
during a clean shutdown.
The chunk encoding was hardcoded there because it mostly doesn't
matter what encoding is chosen in that test. Since type 1 is
battle-hardened enough, I'm switching to type 2 here so that we can
catch unexpected problems as a byproduct. My expectation is that the
chunk encoding doesn't matter anyway, as said, but then "unexpected
problems" contains the word "unexpected".
This improves fuzz testing in two ways:
(1) More realistic time stamps. So far, the most common case in
practice was very rare in the test: Completely regular increases of
the timestamp.
(2) Verify samples by scanning through the whole relevant section of
the series.
For Gorilla-like chunks, this showed two things:
(1) With more regularly increasing time stamps, BenchmarkFuzz is
essentially as fast as with the traditional chunks:
```
BenchmarkFuzzChunkType0-8 2 972514684 ns/op 83426196 B/op 2500044 allocs/op
BenchmarkFuzzChunkType1-8 2 971478001 ns/op 82874660 B/op 2512364 allocs/op
BenchmarkFuzzChunkType2-8 2 999339453 ns/op 76670636 B/op 2366116 allocs/op
```
(2) There was a bug related to when and how the chunk footer is
overwritten to make use for the last sample. This wasn't exposed by
random access as the last sample of a chunk is retrieved from the
values in the header in that case.
This is not a verbatim implementation of the Gorilla encoding. First
of all, it could not, even if we wanted, because Prometheus has a
different chunking model (constant size, not constant time). Second,
this adds a number of changes that improve the encoding in general or
at least for the specific use case of Prometheus (and are partially
only possible in the context of Prometheus). See comments in the code
for details.
Only return an error where callers are doing something with it except
simply logging and ignoring.
All the errors touched in this commit flag the storage as dirty
anyway, and that fact is logged anyway. So most of what is being
removed here is just log spam.
As discussed earlier, the class of errors that flags the storage as
dirty signals fundamental corruption, no even bubbling up a one-time
warning to the user (e.g. about incomplete results) isn't helping much
because _anything_ happening in the storage has to be doubted from
that point on (and in fact retroactively into the past, too). Flagging
the storage dirty, and alerting on it (plus marking the state in the
web UI) is the only way I can see right now.
As a byproduct, I cleaned up the setDirty method a bit and improved
the logged errors.
WIP: This needs more tests.
It now gets a from and through value, which it may opportunistically
use to optimize the retrieval. With possible future range indices,
this could be used in a very efficient way. This change merely applies
some easy checks, which should nevertheless solve the use case of
heavy rule evaluations on servers with a lot of series churn.
Idea is the following:
- Only archive series that are at least as old as the headChunkTimeout
(which was already extremely unlikely to happen).
- Then maintain a high watermark for the last archival, i.e. no
archived series has a sample more recent than that watermark.
- Any query that doesn't reach to a time before that watermark doesn't
have to touch the archive index at all. (A production server at
Soundcloud with the aforementioned series churn and heavy rule
evaluations spends 50% of its CPU time in archive index
lookups. Since rule evaluations usually only touch very recent
values, most of those lookup should disappear with this change.)
- Federation with a very broad label matcher will profit from this,
too.
As a byproduct, the un-needed MetricForFingerprint method was removed
from the Storage interface.
For one, remove unneeded methods.
Then, instead of using a channel for all values, use a
bufio.Scanner-like interface. This removes the need for creating a
goroutine and avoids the (unnecessary) locking performed by channel
sending and receiving.
This will make it much easier to write new chunk implementations (like
Gorilla-style encoding).
Obviously, it's really bad to depend on timing here. The proper fix
would be to have something like WaitForIndexing for other things to
wait for, too.
For now, let's see if the wait time increase fixes the issue.
This requires all the panic calls upon unexpected data to be converted
into errors returned. This pollute the function signatures quite
lot. Well, this is Go...
The ideas behind this are the following:
- panic only if it's a programming error. Data corruptions happen, and
they are not programming errors.
- If we detect a data corruption, we "quarantine" the series,
essentially removing it from the database and putting its data into
a separate directory for forensics.
- Failure during writing to a series file is not considered corruption
automatically. It will call setDirty, though, so that a
crashrecovery upon the next restart will commence and check for
that.
- Series quarantining and setDirty calls are logged and counted in
metrics, but are hidden from the user of the interfaces in
interface.go, whith the notable exception of Append(). The reasoning
is that we treat corruption by removing the corrupted series, i.e. a
query for it will return no results on its next call anyway, so
return no results right now. In the case of Append(), we want to
tell the user that no data has been appended, though.
Minor side effects:
- Now consistently using filepath.* instead of path.*.
- Introduced structured logging where I touched it. This makes things
less consistent, but a complete change to structured logging would
be out of scope for this PR.
Fixes https://github.com/prometheus/prometheus/issues/1401
This remove the last (and in fact bogus) use of BoundaryValues.
Thus, a whole lot of unused (and arguably sub-optimal / ugly) code can
be removed here, too.
In a way, our instants were also ranges, just with the staleness delta
as range length. They are no treated equally, just that in one case,
the range length is set as range, in the other the staleness
delta. However, there are "real" instants where start and and time of
a query is the same. In those cases, we only want to return a single
value (the one closest before or at the equal start and end time). If
that value is the last sample in the series, odds are we have it
already in the series object. In that case, there is no need to pin or
load any chunks. A special singleSampleSeriesIterator is created for
that. This should greatly speed up instant queries as they happen
frequently for rule evaluations.
This will fix issue #1035 and will also help to make issue #1264 less
bad.
The fundamental problem in the current code:
In the preload phase, we quite accurately determine which chunks will
be used for the query being executed. However, in the subsequent step
of creating series iterators, the created iterators are referencing
_all_ in-memory chunks in their series, even the un-pinned ones. In
iterator creation, we copy a pointer to each in-memory chunk of a
series into the iterator. While this creates a certain amount of
allocation churn, the worst thing about it is that copying the chunk
pointer out of the chunkDesc requires a mutex acquisition. (Remember
that the iterator will also reference un-pinned chunks, so we need to
acquire the mutex to protect against concurrent eviction.) The worst
case happens if a series doesn't even contain any relevant samples for
the query time range. We notice that during preloading but then we
will still create a series iterator for it. But even for series that
do contain relevant samples, the overhead is quite bad for instant
queries that retrieve a single sample from each series, but still go
through all the effort of series iterator creation. All of that is
particularly bad if a series has many in-memory chunks.
This commit addresses the problem from two sides:
First, it merges preloading and iterator creation into one step,
i.e. the preload call returns an iterator for exactly the preloaded
chunks.
Second, the required mutex acquisition in chunkDesc has been greatly
reduced. That was enabled by a side effect of the first step, which is
that the iterator is only referencing pinned chunks, so there is no
risk of concurrent eviction anymore, and chunks can be accessed
without mutex acquisition.
To simplify the code changes for the above, the long-planned change of
ValueAtTime to ValueAtOrBefore time was performed at the same
time. (It should have been done first, but it kind of accidentally
happened while I was in the middle of writing the series iterator
changes. Sorry for that.) So far, we actively filtered the up to two
values that were returned by ValueAtTime, i.e. we invested work to
retrieve up to two values, and then we invested more work to throw one
of them away.
The SeriesIterator.BoundaryValues method can be removed once #1401 is
fixed. But I really didn't want to load even more changes into this
PR.
Benchmarks:
The BenchmarkFuzz.* benchmarks run 83% faster (i.e. about six times
faster) and allocate 95% fewer bytes. The reason for that is that the
benchmark reads one sample after another from the time series and
creates a new series iterator for each sample read.
To find out how much these improvements matter in practice, I have
mirrored a beefy Prometheus server at SoundCloud that suffers from
both issues #1035 and #1264. To reach steady state that would be
comparable, the server needs to run for 15d. So far, it has run for
1d. The test server currently has only half as many memory time series
and 60% of the memory chunks the main server has. The 90th percentile
rule evaluation cycle time is ~11s on the main server and only ~3s on
the test server. However, these numbers might get much closer over
time.
In addition to performance improvements, this commit removes about 150
LOC.
If only very few chunks are to be truncated from a very large series
file, the rewrite of the file is a lorge overhead. With this change, a
certain ratio of the file has to be dropped to make it happen. While
only causing disk overhead at about the same ratio (by default 10%),
it will cut down I/O by a lot in above scenario.
The test had become flaky with Go1.5.
Theory here is that with Go1.5.x, sleeping for 10ms might not be
enough to wake up another goroutine, possibly because it is used for
GC. 50ms should always be enough due to GC pause guarantees with the
new GC.