In a way, our instants were also ranges, just with the staleness delta
as range length. They are no treated equally, just that in one case,
the range length is set as range, in the other the staleness
delta. However, there are "real" instants where start and and time of
a query is the same. In those cases, we only want to return a single
value (the one closest before or at the equal start and end time). If
that value is the last sample in the series, odds are we have it
already in the series object. In that case, there is no need to pin or
load any chunks. A special singleSampleSeriesIterator is created for
that. This should greatly speed up instant queries as they happen
frequently for rule evaluations.
This implies a slight change of behavior as only samples added to the
respective instance of a memorySeries are returned. However, this is
most likely anyway what we want.
Following cases:
- Server has been restarted: Given the time it takes to cleanly
shutdown and start up a server, the series are now stale anyway. An
improved staleness handling (still to be implemented) will be based
on tracking if a given target is continuing to expose samples for a
given time series. In that case, we need a full scrape cycle to
decide about staleness. So again, it makes sense to consider
everything stale directly after a server restart.
- Series unarchived due to a read request: The series is definitely
stale so we don't want to return anything anyway.
- Freshly created time series or series unarchived because of a sample
append: That happens because appending a sample is imminent. Before
the fingerprint lock is released, the series will have received a
sample, and lastSamplePair will always returned the expected value.
Formalize ZeroSamplePair as return value for non-existing samples.
Change LastSamplePairForFingerprint to return a SamplePair (and not a
pointer to it), which saves allocations in a potentially extremely
frequent call.
This will fix issue #1035 and will also help to make issue #1264 less
bad.
The fundamental problem in the current code:
In the preload phase, we quite accurately determine which chunks will
be used for the query being executed. However, in the subsequent step
of creating series iterators, the created iterators are referencing
_all_ in-memory chunks in their series, even the un-pinned ones. In
iterator creation, we copy a pointer to each in-memory chunk of a
series into the iterator. While this creates a certain amount of
allocation churn, the worst thing about it is that copying the chunk
pointer out of the chunkDesc requires a mutex acquisition. (Remember
that the iterator will also reference un-pinned chunks, so we need to
acquire the mutex to protect against concurrent eviction.) The worst
case happens if a series doesn't even contain any relevant samples for
the query time range. We notice that during preloading but then we
will still create a series iterator for it. But even for series that
do contain relevant samples, the overhead is quite bad for instant
queries that retrieve a single sample from each series, but still go
through all the effort of series iterator creation. All of that is
particularly bad if a series has many in-memory chunks.
This commit addresses the problem from two sides:
First, it merges preloading and iterator creation into one step,
i.e. the preload call returns an iterator for exactly the preloaded
chunks.
Second, the required mutex acquisition in chunkDesc has been greatly
reduced. That was enabled by a side effect of the first step, which is
that the iterator is only referencing pinned chunks, so there is no
risk of concurrent eviction anymore, and chunks can be accessed
without mutex acquisition.
To simplify the code changes for the above, the long-planned change of
ValueAtTime to ValueAtOrBefore time was performed at the same
time. (It should have been done first, but it kind of accidentally
happened while I was in the middle of writing the series iterator
changes. Sorry for that.) So far, we actively filtered the up to two
values that were returned by ValueAtTime, i.e. we invested work to
retrieve up to two values, and then we invested more work to throw one
of them away.
The SeriesIterator.BoundaryValues method can be removed once #1401 is
fixed. But I really didn't want to load even more changes into this
PR.
Benchmarks:
The BenchmarkFuzz.* benchmarks run 83% faster (i.e. about six times
faster) and allocate 95% fewer bytes. The reason for that is that the
benchmark reads one sample after another from the time series and
creates a new series iterator for each sample read.
To find out how much these improvements matter in practice, I have
mirrored a beefy Prometheus server at SoundCloud that suffers from
both issues #1035 and #1264. To reach steady state that would be
comparable, the server needs to run for 15d. So far, it has run for
1d. The test server currently has only half as many memory time series
and 60% of the memory chunks the main server has. The 90th percentile
rule evaluation cycle time is ~11s on the main server and only ~3s on
the test server. However, these numbers might get much closer over
time.
In addition to performance improvements, this commit removes about 150
LOC.
The First time is kind of trivial as we always know it when we create
a new chunkDesc.
The last time is only know when the chunk is closed, so we have to set
it at that time.
The change saves a lot of digging down into the chunk
itself. Especially the last time is relative expensive as it involves
the creation of an iterator. The first time access now doesn't require
locking, which is also a nice gain.
This gives up on the idea to communicate throuh the Append() call (by
either not returning as it is now or returning an error as
suggested/explored elsewhere). Here I have added a Throttled() call,
which has the advantage that it can be called before a whole _batch_
of Append()'s. Scrapes will happen completely or not at all. Same for
rule group evaluations. That's a highly desired behavior (as discussed
elsewhere). The code is even simpler now as the whole ingestion buffer
could be removed.
Logging of throttled mode has been streamlined and will create at most
one message per minute.
Since we are not overestimating the number of chunks to persist
anymore, this commit also adjusts the default value for
-storage.local.memory-chunks. Update of documentation will follow.
"Rushed mode" is formerly known as "degraded mode", which is changed
with this commit, too. The name "degraded" was very misleading.
Also, switch into rushed mode if we have too many chunks in memory and
an at least reasonable amount of chunks to persist so that speeding up
persisting chunks can help.
If only very few chunks are to be truncated from a very large series
file, the rewrite of the file is a lorge overhead. With this change, a
certain ratio of the file has to be dropped to make it happen. While
only causing disk overhead at about the same ratio (by default 10%),
it will cut down I/O by a lot in above scenario.
The test had become flaky with Go1.5.
Theory here is that with Go1.5.x, sleeping for 10ms might not be
enough to wake up another goroutine, possibly because it is used for
GC. 50ms should always be enough due to GC pause guarantees with the
new GC.
This is with `golint -min_confidence=0.5`.
I left several lint warnings untouched because they were either
incorrect or I felt it was better not to change them at the moment.
If users see the crash recovery error, the chances are
they aren't shutting down Prometheus correctly. Telling
them how to do so will help them debug and fix the problem.
Perhaps it would be even better to still warn in case the sample value has
changed but the timestamps are equal, but we don't have efficient access
to the last value.
For the label matching index-based preselection phase, don't do an OR
between equality and non-equality matchers. Execute only one of the two
(with equality matchers preferred when present).
Fixes https://github.com/prometheus/prometheus/issues/924
If all samples in consecutive chunks have the same timestamp, the way
we used to load chunks will fail. With this change, the persist
watermark is used to load the right amount of chunkDescs from disk.
This bug is a possible reason for the rare storage corruption we have
observed.
Fixes https://github.com/prometheus/prometheus/issues/481
While doing so, clean up and fix a few other things:
- Fix `go vet` warnings (@fabxc to blame ;).
- Fix a racey problem with unarchiving: Whenever we unarchive a
series, we essentially want to do something with it. However, until
we have done something with it, it appears like a series that is
ready to be archived or even purged. So e.g. it would be ignored
during checkpointing. With this fix, we always load the chunkDescs
upon unarchiving. This is wasteful if we only want to add a new
sample to an archived time series, but the (presumably more common)
case where we access an archived time series in a query doesn't
become more expensive.
- The change above streamlined the getOrCreateSeries ond
newMemorySeries flow. Also, the modTime is now always set correctly.
- Fix the leveldb-backed implementation of KeyValueStore.Delete. It
had the wrong behavior of still returning true, nil if a
non-existing key has been passed in.
See https://github.com/prometheus/prometheus/issues/887, which will at
least be partially fixed by this.
From the spec https://golang.org/ref/spec#Conversions:
"In all non-constant conversions involving floating-point or complex
values, if the result type cannot represent the value the conversion
succeeds but the result value is implementation-dependent."
This ended up setting the converted values to 0 on Debian's Go 1.4.2
compiler, at least on 32-bit Debians.
This commit adds the honor_labels and params arguments to the scrape
config. This allows to specify query parameters used by the scrapers
and handling scraped labels with precedence.
Change #704 introduced a regression that started reading the queue only
after potential crash recovery. When more than the queue capacity was
indexed, Prometheus deadlocked.
This change is conceptually very simple, although the diff is large. It
switches logging from "github.com/golang/glog" to
"github.com/prometheus/log", while not actually changing any log
messages. V(1)-style logging has been changed to be log.Debug*().
Also, clean up some things in the code (especially introduction of the
chunkLenWithHeader constant to avoid the same expression all over the place).
Benchmark results:
BEFORE
BenchmarkLoadChunksSequentially 5000 283580 ns/op 152143 B/op 312 allocs/op
BenchmarkLoadChunksRandomly 20000 82936 ns/op 39310 B/op 99 allocs/op
BenchmarkLoadChunkDescs 10000 110833 ns/op 15092 B/op 345 allocs/op
AFTER
BenchmarkLoadChunksSequentially 10000 146785 ns/op 152285 B/op 315 allocs/op
BenchmarkLoadChunksRandomly 20000 67598 ns/op 39438 B/op 103 allocs/op
BenchmarkLoadChunkDescs 20000 99631 ns/op 12636 B/op 192 allocs/op
Note that everything is obviously loaded from the page cache (as the
benchmark runs thousands of times with very small series files). In a
real-world scenario, I expect a larger impact, as the disk operations
will more often actually hit the disk. To load ~50 sequential chunks,
this reduces the iops from 100 seeks and 100 reads to 1 seek and 1
read.
The one central sample ingestion channel has caused a variety of
trouble. This commit removes it. Targets and rule evaluation call an
Append method directly now. To incorporate multiple storage backends
(like OpenTSDB), storage.Tee forks the Append into two different
appenders.
Note that the tsdb queue manager had its own queue anyway. It was a
queue after a queue... Much queue, so overhead...
Targets have their own little buffer (implemented as a channel) to
avoid stalling during an http scrape. But a new scrape will only be
started once the old one is fully ingested.
The contraption of three pipelined ingesters was removed. A Target is
an ingester itself now. Despite more logic in Target, things should be
less confusing now.
Also, remove lint and vet warnings in ast.go.
A number of mostly minor things:
- Rename chunk type -> chunk encoding.
- After all, do not carry around the chunk encoding to all parts of
the system, but just have one place where the encoding for new
chunks is set based on the flag. The new approach has caveats as
well, but the polution of so many method signatures is worse.
- Use the default chunk encoding for new chunks of existing
series. (Previously, only new _series_ would get chunks with the
default encoding.)
- Use an enum for chunk encoding. (But keep the version number for the
flag, for reasons discussed previously.)
- Add encoding() to the chunk interface (so that a chunk knows its own
encoding - no need to have that in a different top-level function).
- Got rid of newFollowUpChunk (which would keep the existing encoding
for all chunks of a time series). Now only use newChunk(), which
will create a chunk encoding according to the flag.
- Simplified transcodeAndAdd.
- Reordered methods of deltaEncodedChunk and doubleDeltaEncoded chunk
to match the order in the chunk interface.
- Only transcode if the chunk is not yet half full. If more than half
full, add a new chunk instead.
This checks for the basic behaviour of GetFingerprintsForLabelMatchers, that is, whether the different matcher types filter the correct fingerprints and intersections are correct.
The capacity is basically how many persisted head chunks we will count
at most while doing other things, in particular checkpointing. To
limit the amount of already counted head chunks, keep this number low,
otherwise we will easily checkpoint too often if checkpoints take long
anyway.
In that commit, the 'maintainSeries' call was accidentally removed.
This commit refactors things a bit so that there is now a clean
'maintainMemorySeries' and a 'maintainArchivedSeries' call.
Straighten the nomenclature a bit (consistently use 'drop' for
chunks and 'purge' for series/metrics).
Remove the annoying 'Completed maintenance sweep through archived
fingerprints' message if there were no archived fingerprints to do
maintenance on.
This is done by bucketing chunks by fingerprint. If the persisting to
disk falls behind, more and more chunks are in the queue. As soon as
there are "double hits", we will now persist both chunks in one go,
doubling the disk throughput (assuming it is limited by disk
seeks). Should even more pile up so that we end wit "triple hits", we
will persist those first, and so on.
Even if we have millions of time series, this will still help,
assuming not all of them are growing with the same speed. Series that
get many samples and/or are not very compressable will accumulate
chunks faster, and they will soon get double- or triple-writes.
To improve the chance of double writes,
-storage.local.persistence-queue-capacity could be set to a higher
value. However, that will slow down shutdown a lot (as the queue has
to be worked through). So we leave it to the user to set it to a
really high value. A more fundamental solution would be to checkpoint
not only head chunks, but also chunks still in the persist queue. That
would be quite complicated for a rather limited use-case (running many
time series with high ingestion rate on slow spinning disks).
Starting a goroutine takes 1-2µs on my laptop. From the "numbers every
Go programmer should know", I had 300ns for a channel send in my
mind. Turns out, on my laptop, it takes only 60ns. That's fast enough
to warrant the machinery of yet another channel with a fixed set of
worker goroutines feeding from it. The number chosen (8 for now) is
low enough to not really afflict a measurable overhead (a big
Prometheus server has >1000 goroutines running), but high enough to
not make sample ingestion a bottleneck.
- Parallelize AppendSamples as much as possible without breaking the
contract about temporal order.
- Allocate more fingerprint locker slots.
- Do not run early checkpoints if we are behind on chunk persistence.
- Increase fpMinWaitDuration to give the disk more time for more
important things.
Also, switch math.MaxInt64 and math.MinInt64 to the new constants.
Also, set a much higher default value.
Chunk persist requests can be quite spiky. If you collect a large
number of time series that are very similar, they will tend to finish
up a chunk at about the same time. There is no reason we need to back
up scraping just because of that. The rationale of the new default
value is "1/8 of the chunks in memory".
persistence.go is way too long anyway, and a lot of code is just crash
recovery, which is not important to understand the normal operation.
Also, remove unused `exists` function.
Previously, it would return an error instead. Now we can distinguish
the cases 'error while deleting known key' vs. 'key not in index'
without testing for leveldb-internal kinds of errors.
If queries are still running when the shutdown is initiated, they will
finish _during_ the shutdown. In that case, they might request chunk
eviction upon unpinning their pinned chunks. That might completely
fill the evict request queue _after_ draining it during storage
shutdown. If that ever happens (which is the case if there are _many_
queries still running during shutdown), the affected queries will be
stuck while keeping a fingerprint locked. The checkpointing can then
not process that fingerprint (or one that shares the same lock). And
then we are deadlocked.
- Move CONTRIBUTORS.md to the more common AUTHORS.
- Added the required NOTICE file.
- Changed "Prometheus Team" to "The Prometheus Authors".
- Reverted the erroneous changes to the Apache License.
This mimics the locking leveldb is performing anyway. Advantages of
doing it separately:
- Should we ever replace the leveldb implementation by one without
double-start protection, we are still good.
- In contrast to leveldb, the new code creates a meaningful error
message.
Usually, if you unarchive a series, it is to add something to it,
which will create a new head chunk. However, if a series in
unarchived, and before anything is added to it, it is handled by the
maintenance loop, it will be archived again. In that case, we have to
load the chunkDescs to know the lastTime of the series to be
archived. Usually, this case will happen only rarely (as a race, has
never happened so far, possibly because the locking around unarchiving
and the subsequent sample append is smart enough). However, during
crash recovery, we sometimes treat series as "freshly unarchived"
without directly appending a sample. We might add more cases of that
type later, so better deal with archiving properly and load chunkDescs
if required.
- Documented checkpoint file format.
- High-level description of series sanitation.
- Replace fp.LoadFromString panic with an error.
(Change in client_golang already submitted.)
- Introduced checks for series file size where appropriate.
- Removed two Law of Demeter violations.
Change-Id: I555d97a2c8f4769820c2fc8bf5d6f4e160222abc
- Delete unneeded file view_adapter.go.
- Assessed that we still need the fingerprints in nodes
(to create iterators).
- Turned numMemChunkDescs into a metric.
Change-Id: I29be963c795a075ec00c095f76bf26405535609d