Commit graph

190 commits

Author SHA1 Message Date
Björn Rabenstein 50e4f49b7e Merge pull request #2561 from prometheus/beorn7/storage2
storage: Evict unused chunk.Descs in crash recovery
2017-04-04 00:05:03 +02:00
beorn7 08fc6cbd39 storage: Evict unused chunk.Descs in crash recovery
This is in line with the v1.5 change in paradigm to not keep
chunk.Descs without chunks around after a series maintenance.

It's mainly motivated by avoiding excessive amounts of RAM usage
during crash recovery.

The code avoids to create memory time series with zero chunk.Descs as
that is prone to trigger weird effects. (Series maintenance would
archive series with zero chunk.Descs, but we cannot do that here
because the archive indices still have to be checked.)
2017-04-04 00:04:22 +02:00
beorn7 d284ffab03 storage: Replace fpIter by sortedFPs
The fpIter was kind of cumbersome to use and required a lock for each
iteration (which wasn't even needed for the iteration at startup after
loading the checkpoint).

The new implementation here has an obvious penalty in memory, but it's
only 8 byte per series, so 80MiB for a beefy server with 10M memory
time series (which would probably need ~100GiB RAM, so the memory
penalty is only 0.1% of the total memory need).

The big advantage is that now series maintenance happens in order,
which leads to the time between two maintenances of the same series
being less random. Ideally, after each maintenance, the next
maintenance would tackle the series with the largest number of
non-persisted chunks. That would be quite an effort to find out or
track, but with the approach here, the next maintenance will tackle
the series whose previous maintenance is longest ago, which is a good
approximation.

While this commit won't change the _average_ number of chunks
persisted per maintenance, it will reduce the mean time a given chunk
has to wait for its persistence and thus reduce the steady-state
number of chunks waiting for persistence.

Also, the map iteration in Go is non-deterministic but not truly
random. In practice, the iteration appears to be somewhat "bucketed".
You can often observe a bunch of series with similar duration since
their last maintenance, i.e. you see batches of series with similar
number of chunks persisted per maintenance. If that batch is
relatively young, a whole lot of series are maintained with very few
chunks to persist. (See screenshot in PR for a better explanation.)
2017-04-03 15:34:46 +02:00
beorn7 434ab2a6a3 storage: Evict chunks and calculate persistence pressure based on target heap size
This is a fairly easy attempt to dynamically evict chunks based on the
heap size. A target heap size has to be set as a command line flage,
so that users can essentially say "utilize 4GiB of RAM, and please
don't OOM".

The -storage.local.max-chunks-to-persist and
-storage.local.memory-chunks flags are deprecated by this
change. Backwards compatibility is provided by ignoring
-storage.local.max-chunks-to-persist and use
-storage.local.memory-chunks to set the new
-storage.local.target-heap-size to a reasonable (and conservative)
value (both with a warning).

This also makes the metrics intstrumentation more consistent (in
naming and implementation) and cleans up a few quirks in the tests.

Answers to anticipated comments:

There is a chance that Go 1.9 will allow programs better control over
the Go memory management. I don't expect those changes to be in
contradiction with the approach here, but I do expect them to
complement them and allow them to be more precise and controlled. In
any case, once those Go changes are available, this code has to be
revisted.

One might be tempted to let the user specify an estimated value for
the RSS usage, and then internall set a target heap size of a certain
fraction of that. (In my experience, 2/3 is a fairly safe bet.)
However, investigations have shown that RSS size and its relation to
the heap size is really really complicated. It depends on so many
factors that I wouldn't even start listing them in a commit
description. It depends on many circumstances and not at least on the
risk trade-off of each individual user between RAM utilization and
probability of OOMing during a RAM usage peak. To not add even more to
the confusion, we need to stick to the well-defined number we also use
in the targeting here, the sum of the sizes of heap objects.
2017-03-27 14:33:50 +02:00
beorn7 96a303b348 storage: Use staleness delta as head chunk timeout
Currently, if a series stops to exist, its head chunk will be kept
open for an hour. That prevents it from being persisted. Which
prevents it from being evicted. Which prevents the series from being
archived.

Most of the time, once no sample has been added to a series within the
staleness limit, we can be pretty confident that this series will not
receive samples anymore. The whole chain as described above can be
started after 5m instead of 1h. In the relaxed case, this doesn't
change a lot as the head chunk timeout is only checked during series
maintenance, and usually, a series is only maintained every six
hours. However, there is the typical scenario where a large service is
deployed, the deoply turns out to be bad, and then it is deployed
again within minutes, and quite quickly the number of time series has
tripled. That's the point where the Prometheus server is stressed and
switches (rightfully) into rushed mode. In that mode, time series are
processed as quickly as possible, but all of that is in vein if all of
those recently ended time series cannot be persisted yet for another
hour. In that scenario, this change will help most, and it's exactly
the scenario where help is most desperately needed.
2017-03-26 23:44:50 +02:00
Jeremy Meulemans 025c828976 Changed to open_head_chunks to address review.
Now incrementing numHeadChunks directly.
2017-02-17 07:10:13 -06:00
Jeremy Meulemans 074050b8c0 Updating for failed codeclimate check. 2017-02-16 18:04:28 -06:00
Jeremy Meulemans f70b52d0b6 Adding gauge for number of open head chunks.
Fixes #1710
2017-02-16 17:56:45 -06:00
beorn7 bed4934224 storage: One more persist error code path discovered
Also, in that code path, set chunkDescsOffset to 0 rather than -1 in
case of "dropped more chunks from persistence than from memory" so
that no other weird things happen before the series is quarantined for
good.
2017-02-09 11:51:40 +01:00
beorn7 8c8baaa558 storage: writeMemorySeries needs to return true for quarantined series
This is another fallout of my bug hunt.
2017-02-08 16:28:56 +01:00
beorn7 65dc8f44d3 storage: Test for errors returned by MaybePopulateLastTime 2017-02-01 23:43:58 +01:00
Brian Brazil f64c231dad Allow checkpoints and maintenance to happen concurrently. (#2321)
This is essential on larger Prometheus servers, as otherwise
checkpoints prevent sufficient persisting of chunks to disk.
2017-01-13 17:24:19 +00:00
Brian Brazil 1dcb7637f5 Add various persistence related metrics (#2333)
Add metrics around checkpointing and persistence

* Add a metric to say if checkpointing is happening,
and another to track total checkpoint time and count.

This breaks the existing prometheus_local_storage_checkpoint_duration_seconds
by renaming it to prometheus_local_storage_checkpoint_last_duration_seconds
as the former name is more appropriate for a summary.

* Add metric for last checkpoint size.

* Add metric for series/chunks processed by checkpoints.

For long checkpoints it'd be useful to see how they're progressing.

* Add metric for dirty series

* Add metric for number of chunks persisted per series.

You can get the number of chunks from chunk_ops,
but not the matching number of series. This helps determine
the size of the writes being made.

* Add metric for chunks queued for persistence

Chunks created includes both chunks that'll need persistence
and chunks read in for queries. This only includes chunks created
for persistence.

* Code review comments on new persistence metrics.
2017-01-11 15:11:19 +00:00
Mitsuhiro Tanda 7e369b9318 expose max memory chunks metrics (#2303)
* expose max memory chunks metrics
2016-12-27 18:34:07 +00:00
beorn7 253be23c00 storage: Sanity-check number of loaded chunk descs
Two cases:

- An unarchived metric must have at least one chunk desc loaded upon
  unarchival. Otherwise, the file is gone or has size 0, which is an
  inconsistency (because the series is still indexed in the archive
  index). Hence, quarantining is triggered.

- If loading the chunk descs of a series with a known chunkDescsOffset
  (i.e. != -1), the number of chunks loaded must be equal to
  chunkDescsOffset. If not, there is a data corruption. An error is
  returned, which leads to qurantining.

In any case, there is a guard added to not access the 1st element of
an empty chunkDescs slice. (That's what triggered the crashes in issue
2249.)  A time series with unknown chunkDescsOffset and no chunks in
memory and no chunks on disk either could trigger that case. I would
assume such a "null series" doesn't exist, but it's not entirely
unthinkable and unreasonable to happen (perhaps in future uses of the
storage). (Create a series, and then something tries to preload chunks
before the first sample is added.)
2016-12-13 23:19:39 +01:00
Fabian Reinartz 6703404cb4 Merge remote-tracking branch 'origin/release-1.2' 2016-11-01 16:35:22 +01:00
beorn7 c5bd178b93 Protect exported Querier interface method against negative time ranges 2016-11-01 15:05:01 +01:00
Fabian Reinartz 8fa18d564a storage: enhance Querier interface usage
This extracts Querier as an instantiateable and closeable object
rather than just defining extending methods of the storage interface.
This improves composability and allows abstracting query transactions,
which can be useful for transaction-level caches, consistent data views,
and encapsulating teardown.
2016-10-16 10:39:29 +02:00
Björn Rabenstein 1e2f03f668 Merge pull request #2005 from redbaron/microoptimise-matching
Microoptimise matching
2016-10-05 17:26:56 +02:00
Maxim Ivanov e6db9f8159 New fpsForLabelMatchers and seriesForLabelMatchers methods
These more specific methods have replaced `metricForLabelMatchers`
in cases where its  `map[fingerprint]metric` result type was
not necessary or was used as an intermediate step

Avoids duplicated calls to `seriesForRange` from
`QueryRange` and `QueryInstant` methods.
2016-10-05 15:15:54 +01:00
Julius Volz c9d4526428 Unpublish accidentally published series methods
There were some more accidentally published methods of the memorySeries
type which I didn't notice when reviewing https://github.com/prometheus/prometheus/pull/2011
2016-10-03 00:04:56 +02:00
Maxim Ivanov 4978a65495 Extract initial FP candidate build logic into candidateFPsForLabelMatchers method
No functional changes otherwise
2016-10-02 17:35:02 +01:00
Maxim Ivanov c048a0cde8 Add metrics to result after checking all matchers
Should be marginally faster and somewhat more GC friendly
2016-10-02 17:35:02 +01:00
Julius Volz c25f0de5ae Remove local.ZeroSample{,Pair}, use model definitions 2016-09-28 23:42:45 +02:00
Julius Volz 044ebce779 Review fixups. 2016-09-28 23:42:44 +02:00
Julius Volz d30a3c7c0f Fix accidental publishing of memorySeries.firstTime() 2016-09-26 13:25:27 +02:00
Julius Volz ab80ced756 storage: separate chunk package, publish more names
This is a followup to https://github.com/prometheus/prometheus/pull/2011.

This publishes more of the methods and other names of the chunk code and
moves the chunk code to its own package. There's some unavoidable
ugliness: the chunk and chunkDesc metrics are used by both packages, so
I had to move them to the chunk package. That isn't great, but I don't
see how to do it better without a larger redesign of everything. Same
for the evict requests and some other types.
2016-09-26 13:25:11 +02:00
Matthew Campbell 67d76e3a5d timeseries: store varbit encoded data into cassandra 2016-09-21 17:56:55 +02:00
Julius Volz c187308366 storage: Contextify storage interfaces.
This is based on https://github.com/prometheus/prometheus/pull/1997.

This adds contexts to the relevant Storage methods and already passes
PromQL's new per-query context into the storage's query methods.
The immediate motivation supporting multi-tenancy in Frankenstein, but
this could also be used by Prometheus's normal local storage to support
cancellations and timeouts at some point.
2016-09-19 16:29:07 +02:00
Julius Volz 3bfec97d46 Make the storage interface higher-level.
See discussion in
https://groups.google.com/forum/#!topic/prometheus-developers/bkuGbVlvQ9g

The main idea is that the user of a storage shouldn't have to deal with
fingerprints anymore, and should not need to do an individual preload
call for each metric. The storage interface needs to be made more
high-level to not expose these details.

This also makes it easier to reuse the same storage interface for remote
storages later, as fewer roundtrips are required and the fingerprint
concept doesn't work well across the network.

NOTE: this deliberately gets rid of a small optimization in the old
query Analyzer, where we dedupe instants and ranges for the same series.
This should have a minor impact, as most queries do not have multiple
selectors loading the same series (and at the same offset).
2016-07-25 13:59:22 +02:00
beorn7 fc6737b7fb storage: improve index lookups
tl;dr: This is not a fundamental solution to the indexing problem
(like tindex is) but it at least avoids utilizing the intersection
problem to the greatest possible amount.

In more detail:

Imagine the following query:

    nicely:aggregating:rule{job="foo",env="prod"}

While it uses a nicely aggregating recording rule (which might have a
very low cardinality), Prometheus still intersects the low number of
fingerprints for `{__name__="nicely:aggregating:rule"}` with the many
thousands of fingerprints matching `{job="foo"}` and with the millions
of fingerprints matching `{env="prod"}`. This totally innocuous query
is dead slow if the Prometheus server has a lot of time series with
the `{env="prod"}` label. Ironically, if you make the query more
complicated, it becomes blazingly fast:

    nicely:aggregating:rule{job=~"foo",env=~"prod"}

Why so? Because Prometheus only intersects with non-Equal matchers if
there are no Equal matchers. That's good in this case because it
retrieves the few fingerprints for
`{__name__="nicely:aggregating:rule"}` and then starts right ahead to
retrieve the metric for those FPs and checking individually if they
match the other matchers.

This change is generalizing the idea of when to stop intersecting FPs
and go into "retrieve metrics and check them individually against
remaining matchers" mode:

- First, sort all matchers by "expected cardinality". Matchers
  matching the empty string are always worst (and never used for
  intersections). Equal matchers are in general consider best, but by
  using some crude heuristics, we declare some better than others
  (instance labels or anything that looks like a recording rule).

- Then go through the matchers until we hit a threshold of remaining
  FPs in the intersection. This threshold is higher if we are already
  in the non-Equal matcher area as intersection is even more expensive
  here.

- Once the threshold has been reached (or we have run out of matchers
  that do not match the empty string), start with "retrieve metrics
  and check them individually against remaining matchers".

A beefy server at SoundCloud was spending 67% of its CPU time in index
lookups (fingerprintsForLabelPairs), serving mostly a dashboard that
is exclusively built with recording rules. With this change, it spends
only 35% in fingerprintsForLabelPairs. The CPU usage dropped from 26
cores to 18 cores. The median latency for query_range dropped from 14s
to 50ms(!). As expected, higher percentile latency didn't improve that
much because the new approach is _occasionally_ running into the worst
case while the old one was _systematically_ doing so. The 99th
percentile latency is now about as high as the median before (14s)
while it was almost twice as high before (26s).
2016-07-20 17:35:53 +02:00
beorn7 064b57858e Consistently use the Seconds() method for conversion of durations
This also fixes one remaining case of recording integral numbers
of seconds only for a metric, i.e. this will probably fix #1796.
2016-07-07 15:24:35 +02:00
Julius Volz 91401794fa storage: Make MemorySeriesStorage a public type
See https://twitter.com/fabxc/status/748032597876482048
2016-06-29 08:14:23 +02:00
Fabian Reinartz 425736a377 *: remove last remainers of non-second metrics 2016-06-23 17:50:39 +02:00
Julius Volz b7b6717438 Separate query interface out of local.Storage.
PromQL only requires a much narrower interface than local.Storage in
order to run queries. Narrower interfaces are easier to replace and
test, too.

We could also change the web interface to use local.Querier, except that
we'll probably use appending functions from there in the future.
2016-06-23 15:14:38 +02:00
beorn7 99881ded63 Make the number of fingerprint mutexes configurable
With a lot of series accessed in a short timeframe (by a query, a
large scrape, checkpointing, ...), there is actually quite a
significant amount of lock contention if something similar is running
at the same time.

In those cases, the number of locks needs to be increased.

On the same front, as our fingerprints don't have a lot of entropy, I
introduced some additional shuffling. With the current state, anly
changes in the least singificant bits of a FP would matter.
2016-06-02 19:18:00 +02:00
beorn7 b2ef4dc52d Correctly identify no-op appends if the value is NaN
This requires an updating of the vendored commen.model package, which
I will do once https://github.com/prometheus/common/pull/40 is merged.
2016-05-19 18:32:47 +02:00
beorn7 07a294ac15 Doc comment fixes 2016-04-26 01:05:56 +02:00
beorn7 20cba1ed8f Initialize metric vectors in memorySeriesStorage 2016-04-25 17:08:07 +02:00
beorn7 d566808d40 Bring back logging of discarded samples
But only on DEBUG level.

Also, count and report the two cases of out-of-order timestamps on the
one hand and same timestamp but different value on the other hand
separately.
2016-04-25 16:43:52 +02:00
beorn7 a90d645378 Checkpoint fingerprint mappings only upon shutdown
Before, we checkpointed after every newly detected fingerprint
collision, which is not a problem as long as collisions are
rare. However, with a sufficient number of metrics or particular
nature of the data set, there might be a lot of collisions, all to be
detected upon the first set of scrapes, and then the checkpointing
after each detection will take a quite long time (it's O(n²),
essentially).

Since we are rebuilding the fingerprint mapping during crash recovery,
the previous, very conservative approach didn't even buy us
anything. We only ever read from the checkpoint file after a clean
shutdown, so the only time we need to write the checkpoint file is
during a clean shutdown.
2016-04-15 01:03:28 +02:00
beorn7 199f309a39 Resurrect and rename invalid preload requests count metric.
It is now also used in label matching, so the name of the metric
changed from `prometheus_local_storage_invalid_preload_requests_total`
to `non_existent_series_matches_total'.
2016-03-13 11:54:24 +01:00
beorn7 e8c1f30ab2 Merge the parallel logic of getSeriesForRange and metricForFingerprint 2016-03-09 21:56:15 +01:00
beorn7 9445c7053d Add tests for range-limited label matching
While doing so, improve getSeriesForRange.
2016-03-09 21:01:03 +01:00
beorn7 47e3c90f9b Clean up error propagation
Only return an error where callers are doing something with it except
simply logging and ignoring.

All the errors touched in this commit flag the storage as dirty
anyway, and that fact is logged anyway. So most of what is being
removed here is just log spam.

As discussed earlier, the class of errors that flags the storage as
dirty signals fundamental corruption, no even bubbling up a one-time
warning to the user (e.g. about incomplete results) isn't helping much
because _anything_ happening in the storage has to be doubted from
that point on (and in fact retroactively into the past, too). Flagging
the storage dirty, and alerting on it (plus marking the state in the
web UI) is the only way I can see right now.

As a byproduct, I cleaned up the setDirty method a bit and improved
the logged errors.
2016-03-09 18:56:30 +01:00
beorn7 99854a84d7 Merge branch 'beorn7/storage6' into beorn7/storage7 2016-03-09 17:23:25 +01:00
beorn7 b343e65907 Merge branch 'beorn7/storage4' into beorn7/storage5
erge is necessary,
2016-03-09 17:14:42 +01:00
beorn7 d0a4477446 Merge branch 'beorn7/storage3' into beorn7/storage4
Conflicts:
	storage/local/preload.go
	storage/local/storage.go
	storage/local/storage_test.go
2016-03-09 17:13:16 +01:00
beorn7 55eddab25f Merge branch 'beorn7/storage2' into beorn7/storage3 2016-03-09 16:48:46 +01:00
beorn7 beb36df4bb De-flag preloadChunksForRange
Now there is preloadChunksForRange and preloadChunksForInstant in
both, the series and the storage.
2016-03-09 14:50:09 +01:00