Commit graph

263 commits

Author SHA1 Message Date
beorn7 c72979e3ed Remove a redundancy from Gorilla-style chunks
So far, the last sample in a chunk was saved twice. That's required
for adding more samples as we need to know the last sample added to
add more samples without iterating through the whole chunk. However,
once the last sample was added to the chunk before it's full, there is
no need to save it twice. Thus, the very last sample added to a chunk
can _only_ be saved in the header fields for the last sample. The
chunk has to be identifiable as closed, then. This information has
been added to the flags byte.
2016-03-20 23:09:48 +01:00
beorn7 b6dbb826ae Improve fuzz testing and fix a bug exposed
This improves fuzz testing in two ways:

(1) More realistic time stamps. So far, the most common case in
practice was very rare in the test: Completely regular increases of
the timestamp.

(2) Verify samples by scanning through the whole relevant section of
the series.

For Gorilla-like chunks, this showed two things:

(1) With more regularly increasing time stamps, BenchmarkFuzz is
essentially as fast as with the traditional chunks:

```
BenchmarkFuzzChunkType0-8              2         972514684 ns/op        83426196 B/op    2500044 allocs/op
BenchmarkFuzzChunkType1-8              2         971478001 ns/op        82874660 B/op    2512364 allocs/op
BenchmarkFuzzChunkType2-8              2         999339453 ns/op        76670636 B/op    2366116 allocs/op
```

(2) There was a bug related to when and how the chunk footer is
overwritten to make use for the last sample. This wasn't exposed by
random access as the last sample of a chunk is retrieved from the
values in the header in that case.
2016-03-20 17:21:28 +01:00
beorn7 9d8fbbe822 Review improvements 2016-03-17 17:31:56 +01:00
beorn7 8cdced3850 Implement Gorilla-inspired chunk encoding
This is not a verbatim implementation of the Gorilla encoding.  First
of all, it could not, even if we wanted, because Prometheus has a
different chunking model (constant size, not constant time).  Second,
this adds a number of changes that improve the encoding in general or
at least for the specific use case of Prometheus (and are partially
only possible in the context of Prometheus). See comments in the code
for details.
2016-03-17 14:47:08 +01:00
beorn7 8e64e8dfca Fix return statement. 2016-03-17 14:43:00 +01:00
Björn Rabenstein 98c8560851 Merge pull request #1477 from prometheus/beorn7/storage7
Solve the series churn problem...
2016-03-17 14:39:28 +01:00
beorn7 e7ac9c6863 Improvments based on review
- Moved returns into the default section of switch statement that can
  only happen then.

- Fix typo.
2016-03-17 14:37:24 +01:00
beorn7 199f309a39 Resurrect and rename invalid preload requests count metric.
It is now also used in label matching, so the name of the metric
changed from `prometheus_local_storage_invalid_preload_requests_total`
to `non_existent_series_matches_total'.
2016-03-13 11:54:24 +01:00
beorn7 e8c1f30ab2 Merge the parallel logic of getSeriesForRange and metricForFingerprint 2016-03-09 21:56:15 +01:00
beorn7 9445c7053d Add tests for range-limited label matching
While doing so, improve getSeriesForRange.
2016-03-09 21:01:03 +01:00
beorn7 47e3c90f9b Clean up error propagation
Only return an error where callers are doing something with it except
simply logging and ignoring.

All the errors touched in this commit flag the storage as dirty
anyway, and that fact is logged anyway. So most of what is being
removed here is just log spam.

As discussed earlier, the class of errors that flags the storage as
dirty signals fundamental corruption, no even bubbling up a one-time
warning to the user (e.g. about incomplete results) isn't helping much
because _anything_ happening in the storage has to be doubted from
that point on (and in fact retroactively into the past, too). Flagging
the storage dirty, and alerting on it (plus marking the state in the
web UI) is the only way I can see right now.

As a byproduct, I cleaned up the setDirty method a bit and improved
the logged errors.
2016-03-09 18:56:30 +01:00
beorn7 99854a84d7 Merge branch 'beorn7/storage6' into beorn7/storage7 2016-03-09 17:23:25 +01:00
beorn7 5e4fa96719 Merge branch 'beorn7/storage5' into beorn7/storage6 2016-03-09 17:21:32 +01:00
beorn7 b343e65907 Merge branch 'beorn7/storage4' into beorn7/storage5
erge is necessary,
2016-03-09 17:14:42 +01:00
beorn7 d0a4477446 Merge branch 'beorn7/storage3' into beorn7/storage4
Conflicts:
	storage/local/preload.go
	storage/local/storage.go
	storage/local/storage_test.go
2016-03-09 17:13:16 +01:00
beorn7 55eddab25f Merge branch 'beorn7/storage2' into beorn7/storage3 2016-03-09 16:48:46 +01:00
beorn7 161eada3ad Make chunkIterator even leaner. 2016-03-09 16:20:39 +01:00
beorn7 beb36df4bb De-flag preloadChunksForRange
Now there is preloadChunksForRange and preloadChunksForInstant in
both, the series and the storage.
2016-03-09 14:50:09 +01:00
beorn7 836f1db04c Improve MetricsForLabelMatchers
WIP: This needs more tests.

It now gets a from and through value, which it may opportunistically
use to optimize the retrieval. With possible future range indices,
this could be used in a very efficient way. This change merely applies
some easy checks, which should nevertheless solve the use case of
heavy rule evaluations on servers with a lot of series churn.

Idea is the following:

- Only archive series that are at least as old as the headChunkTimeout
  (which was already extremely unlikely to happen).

- Then maintain a high watermark for the last archival, i.e. no
  archived series has a sample more recent than that watermark.

- Any query that doesn't reach to a time before that watermark doesn't
  have to touch the archive index at all. (A production server at
  Soundcloud with the aforementioned series churn and heavy rule
  evaluations spends 50% of its CPU time in archive index
  lookups. Since rule evaluations usually only touch very recent
  values, most of those lookup should disappear with this change.)

- Federation with a very broad label matcher will profit from this,
  too.

As a byproduct, the un-needed MetricForFingerprint method was removed
from the Storage interface.
2016-03-09 00:25:59 +01:00
beorn7 167b83695c Merge branch 'beorn7/storage5' into beorn7/storage6 2016-03-08 00:20:44 +01:00
beorn7 01795382c9 Merge branch 'beorn7/storage4' into beorn7/storage5 2016-03-08 00:20:13 +01:00
beorn7 f7fc542db6 Merge branch 'master' into beorn7/storage4
Conflicts:
	storage/local/persistence.go
2016-03-08 00:14:00 +01:00
beorn7 3d86130d8c Merge branch 'master' into beorn7/storage3 2016-03-07 23:39:12 +01:00
beorn7 1f30c8de8d Merge branch 'master' into beorn7/storage2 2016-03-07 23:38:42 +01:00
beorn7 c13b1ecfe9 Make chunk iterators more DRY
This finally extracts all the common code of the two chunk iterators
into one. Any future chunk encodings with fast access by index can use
the same iterator by simply providing an indexAccessor. Other future
chunk encodings without fast index access (like Gorilla-style) can
still implement the chunkIterator interface as usual.
2016-03-07 20:23:14 +01:00
beorn7 32f280a3cd Slim down the chunkIterator interface
For one, remove unneeded methods.

Then, instead of using a channel for all values, use a
bufio.Scanner-like interface. This removes the need for creating a
goroutine and avoids the (unnecessary) locking performed by channel
sending and receiving.

This will make it much easier to write new chunk implementations (like
Gorilla-style encoding).
2016-03-07 19:50:13 +01:00
beorn7 b6fdb355d7 Move dump-heads into its own tool 2016-03-07 16:30:19 +01:00
beorn7 f193f2b8ef Add a command to promtool that dumps metadata of heads.db
I needed this today for debugging. It can certainly be improved, but
it's already quite helpful.

I refactored the reading of heads.db files out of persistence, which
is an improvement, too.

I made minor changes to the cli package to allow outputting via the
io.Writer interface.
2016-03-07 16:21:57 +01:00
beorn7 fc7de5374a Quarantine series upon problem writing to the series file
This fixes https://github.com/prometheus/prometheus/issues/1059 , but
not in the obvious way (simply not updating the persist watermark,
because that's actually not that simple - we don't really know what
has gone wrong exactly). As any errors relevant here are most likely
caused by severe and unrecoverable problems with the series file,
Using the now quarantine feature is the right step. We don't really
have to be worried about any inconsistent state of the series because
it will be removed for good ASAP. Another plus is that we don't have
to declare the whole storage dirty anymore.
2016-03-03 13:15:02 +01:00
beorn7 0ea5801e47 Handle errors caused by data corruption more gracefully
This requires all the panic calls upon unexpected data to be converted
into errors returned. This pollute the function signatures quite
lot. Well, this is Go...

The ideas behind this are the following:

- panic only if it's a programming error. Data corruptions happen, and
  they are not programming errors.

- If we detect a data corruption, we "quarantine" the series,
  essentially removing it from the database and putting its data into
  a separate directory for forensics.

- Failure during writing to a series file is not considered corruption
  automatically. It will call setDirty, though, so that a
  crashrecovery upon the next restart will commence and check for
  that.

- Series quarantining and setDirty calls are logged and counted in
  metrics, but are hidden from the user of the interfaces in
  interface.go, whith the notable exception of Append(). The reasoning
  is that we treat corruption by removing the corrupted series, i.e. a
  query for it will return no results on its next call anyway, so
  return no results right now. In the case of Append(), we want to
  tell the user that no data has been appended, though.

Minor side effects:

- Now consistently using filepath.* instead of path.*.

- Introduced structured logging where I touched it. This makes things
  less consistent, but a complete change to structured logging would
  be out of scope for this PR.
2016-03-02 23:02:34 +01:00
beorn7 b6840997a7 Merge branch 'beorn7/storage2' into beorn7/storage3 2016-03-02 16:11:25 +01:00
beorn7 ce58fd357b Merge branch 'beorn7/storage' into beorn7/storage2
Conflicts:
	storage/local/chunk.go
	storage/local/interface.go
2016-03-02 16:09:32 +01:00
beorn7 2581648f70 Separate iterators by offset
Add test that exposes the problem.
2016-03-02 16:01:03 +01:00
beorn7 c740789ce3 Improve predict_linear
Fixes https://github.com/prometheus/prometheus/issues/1401

This remove the last (and in fact bogus) use of BoundaryValues.

Thus, a whole lot of unused (and arguably sub-optimal / ugly) code can
be removed here, too.
2016-02-25 12:10:55 +01:00
beorn7 4b503ed9a5 Merge branch 'master' into beorn7/storage2 2016-02-24 14:03:49 +01:00
beorn7 059295332f Merge remote-tracking branch 'origin/master' into beorn7/storage 2016-02-24 14:02:27 +01:00
beorn7 53005c3085 Merge branch 'beorn7/storage' into beorn7/storage2 2016-02-24 14:00:56 +01:00
beorn7 28e9bbc15f Populate chunkDesc.chunkLastTime during checkpoint loading, too 2016-02-24 13:58:34 +01:00
Björn Rabenstein a8c79f0a0c Merge pull request #1422 from prometheus/release-0.17
Merge more commits from 0.17.
2016-02-23 23:07:44 +01:00
beorn7 8fa1560e48 Fix a very special case of handling the checkpoint timer 2016-02-23 16:48:35 +01:00
beorn7 41e44f6ab9 Merge branch 'master' into beorn7/storage2 2016-02-22 16:54:33 +01:00
Björn Rabenstein d9eb624322 Merge pull request #1415 from prometheus/release-0.17
Forward-merge release-0.17 into master
2016-02-22 16:39:48 +01:00
beorn7 4d1f7b49b6 Fix a race condition in calculatePersistenceUrgencyScore 2016-02-22 15:48:39 +01:00
beorn7 454ecf3f52 Rework the way ranges and instants are handled
In a way, our instants were also ranges, just with the staleness delta
as range length. They are no treated equally, just that in one case,
the range length is set as range, in the other the staleness
delta. However, there are "real" instants where start and and time of
a query is the same. In those cases, we only want to return a single
value (the one closest before or at the equal start and end time). If
that value is the last sample in the series, odds are we have it
already in the series object. In that case, there is no need to pin or
load any chunks. A special singleSampleSeriesIterator is created for
that. This should greatly speed up instant queries as they happen
frequently for rule evaluations.
2016-02-22 01:47:18 +01:00
beorn7 b876f8e6a5 Move lastSamplePair method up to memorySeries
This implies a slight change of behavior as only samples added to the
respective instance of a memorySeries are returned. However, this is
most likely anyway what we want.

Following cases:

- Server has been restarted: Given the time it takes to cleanly
  shutdown and start up a server, the series are now stale anyway. An
  improved staleness handling (still to be implemented) will be based
  on tracking if a given target is continuing to expose samples for a
  given time series. In that case, we need a full scrape cycle to
  decide about staleness. So again, it makes sense to consider
  everything stale directly after a server restart.

- Series unarchived due to a read request: The series is definitely
  stale so we don't want to return anything anyway.

- Freshly created time series or series unarchived because of a sample
  append: That happens because appending a sample is imminent. Before
  the fingerprint lock is released, the series will have received a
  sample, and lastSamplePair will always returned the expected value.
2016-02-19 18:16:41 +01:00
beorn7 1e13f89039 Return SamplePair istead of *SamplePair consistently
Formalize ZeroSamplePair as return value for non-existing samples.

Change LastSamplePairForFingerprint to return a SamplePair (and not a
pointer to it), which saves allocations in a potentially extremely
frequent call.
2016-02-19 17:00:40 +01:00
beorn7 d290340367 Fix and improve chunkDesc locking 2016-02-19 16:24:38 +01:00
beorn7 0e202dacb4 Streamline series iterator creation
This will fix issue #1035 and will also help to make issue #1264 less
bad.

The fundamental problem in the current code:

In the preload phase, we quite accurately determine which chunks will
be used for the query being executed. However, in the subsequent step
of creating series iterators, the created iterators are referencing
_all_ in-memory chunks in their series, even the un-pinned ones. In
iterator creation, we copy a pointer to each in-memory chunk of a
series into the iterator. While this creates a certain amount of
allocation churn, the worst thing about it is that copying the chunk
pointer out of the chunkDesc requires a mutex acquisition. (Remember
that the iterator will also reference un-pinned chunks, so we need to
acquire the mutex to protect against concurrent eviction.) The worst
case happens if a series doesn't even contain any relevant samples for
the query time range. We notice that during preloading but then we
will still create a series iterator for it. But even for series that
do contain relevant samples, the overhead is quite bad for instant
queries that retrieve a single sample from each series, but still go
through all the effort of series iterator creation. All of that is
particularly bad if a series has many in-memory chunks.

This commit addresses the problem from two sides:

First, it merges preloading and iterator creation into one step,
i.e. the preload call returns an iterator for exactly the preloaded
chunks.

Second, the required mutex acquisition in chunkDesc has been greatly
reduced. That was enabled by a side effect of the first step, which is
that the iterator is only referencing pinned chunks, so there is no
risk of concurrent eviction anymore, and chunks can be accessed
without mutex acquisition.

To simplify the code changes for the above, the long-planned change of
ValueAtTime to ValueAtOrBefore time was performed at the same
time. (It should have been done first, but it kind of accidentally
happened while I was in the middle of writing the series iterator
changes. Sorry for that.) So far, we actively filtered the up to two
values that were returned by ValueAtTime, i.e. we invested work to
retrieve up to two values, and then we invested more work to throw one
of them away.

The SeriesIterator.BoundaryValues method can be removed once #1401 is
fixed. But I really didn't want to load even more changes into this
PR.

Benchmarks:

The BenchmarkFuzz.* benchmarks run 83% faster (i.e. about six times
faster) and allocate 95% fewer bytes. The reason for that is that the
benchmark reads one sample after another from the time series and
creates a new series iterator for each sample read.

To find out how much these improvements matter in practice, I have
mirrored a beefy Prometheus server at SoundCloud that suffers from
both issues #1035 and #1264. To reach steady state that would be
comparable, the server needs to run for 15d. So far, it has run for
1d. The test server currently has only half as many memory time series
and 60% of the memory chunks the main server has. The 90th percentile
rule evaluation cycle time is ~11s on the main server and only ~3s on
the test server. However, these numbers might get much closer over
time.

In addition to performance improvements, this commit removes about 150
LOC.
2016-02-19 16:24:38 +01:00
beorn7 ef3ab96111 Populate first and last time in the chunk descriptor earlier
The First time is kind of trivial as we always know it when we create
a new chunkDesc.

The last time is only know when the chunk is closed, so we have to set
it at that time.

The change saves a lot of digging down into the chunk
itself. Especially the last time is relative expensive as it involves
the creation of an iterator. The first time access now doesn't require
locking, which is also a nice gain.
2016-02-15 14:06:09 +01:00
beorn7 9a3edea477 Remove race condition from TestRetentionCutoff 2016-02-12 12:13:19 +01:00