This will fix issue #1035 and will also help to make issue #1264 less
bad.
The fundamental problem in the current code:
In the preload phase, we quite accurately determine which chunks will
be used for the query being executed. However, in the subsequent step
of creating series iterators, the created iterators are referencing
_all_ in-memory chunks in their series, even the un-pinned ones. In
iterator creation, we copy a pointer to each in-memory chunk of a
series into the iterator. While this creates a certain amount of
allocation churn, the worst thing about it is that copying the chunk
pointer out of the chunkDesc requires a mutex acquisition. (Remember
that the iterator will also reference un-pinned chunks, so we need to
acquire the mutex to protect against concurrent eviction.) The worst
case happens if a series doesn't even contain any relevant samples for
the query time range. We notice that during preloading but then we
will still create a series iterator for it. But even for series that
do contain relevant samples, the overhead is quite bad for instant
queries that retrieve a single sample from each series, but still go
through all the effort of series iterator creation. All of that is
particularly bad if a series has many in-memory chunks.
This commit addresses the problem from two sides:
First, it merges preloading and iterator creation into one step,
i.e. the preload call returns an iterator for exactly the preloaded
chunks.
Second, the required mutex acquisition in chunkDesc has been greatly
reduced. That was enabled by a side effect of the first step, which is
that the iterator is only referencing pinned chunks, so there is no
risk of concurrent eviction anymore, and chunks can be accessed
without mutex acquisition.
To simplify the code changes for the above, the long-planned change of
ValueAtTime to ValueAtOrBefore time was performed at the same
time. (It should have been done first, but it kind of accidentally
happened while I was in the middle of writing the series iterator
changes. Sorry for that.) So far, we actively filtered the up to two
values that were returned by ValueAtTime, i.e. we invested work to
retrieve up to two values, and then we invested more work to throw one
of them away.
The SeriesIterator.BoundaryValues method can be removed once #1401 is
fixed. But I really didn't want to load even more changes into this
PR.
Benchmarks:
The BenchmarkFuzz.* benchmarks run 83% faster (i.e. about six times
faster) and allocate 95% fewer bytes. The reason for that is that the
benchmark reads one sample after another from the time series and
creates a new series iterator for each sample read.
To find out how much these improvements matter in practice, I have
mirrored a beefy Prometheus server at SoundCloud that suffers from
both issues #1035 and #1264. To reach steady state that would be
comparable, the server needs to run for 15d. So far, it has run for
1d. The test server currently has only half as many memory time series
and 60% of the memory chunks the main server has. The 90th percentile
rule evaluation cycle time is ~11s on the main server and only ~3s on
the test server. However, these numbers might get much closer over
time.
In addition to performance improvements, this commit removes about 150
LOC.
The First time is kind of trivial as we always know it when we create
a new chunkDesc.
The last time is only know when the chunk is closed, so we have to set
it at that time.
The change saves a lot of digging down into the chunk
itself. Especially the last time is relative expensive as it involves
the creation of an iterator. The first time access now doesn't require
locking, which is also a nice gain.
This gives up on the idea to communicate throuh the Append() call (by
either not returning as it is now or returning an error as
suggested/explored elsewhere). Here I have added a Throttled() call,
which has the advantage that it can be called before a whole _batch_
of Append()'s. Scrapes will happen completely or not at all. Same for
rule group evaluations. That's a highly desired behavior (as discussed
elsewhere). The code is even simpler now as the whole ingestion buffer
could be removed.
Logging of throttled mode has been streamlined and will create at most
one message per minute.
Since we are not overestimating the number of chunks to persist
anymore, this commit also adjusts the default value for
-storage.local.memory-chunks. Update of documentation will follow.
"Rushed mode" is formerly known as "degraded mode", which is changed
with this commit, too. The name "degraded" was very misleading.
Also, switch into rushed mode if we have too many chunks in memory and
an at least reasonable amount of chunks to persist so that speeding up
persisting chunks can help.
If only very few chunks are to be truncated from a very large series
file, the rewrite of the file is a lorge overhead. With this change, a
certain ratio of the file has to be dropped to make it happen. While
only causing disk overhead at about the same ratio (by default 10%),
it will cut down I/O by a lot in above scenario.
Allows to use graphite over tcp or udp. Metrics labels
and values are used to construct a valid Graphite path
in a way that will allow us to eventually read them back
and reconstruct the metrics.
For example, this metric:
model.Metric{
model.MetricNameLabel: "test:metric",
"testlabel": "test:value",
"testlabel2": "test:value",
)
Will become:
test:metric.testlabel=test:value.testlabel2=test:value
escape.go takes care of escaping values to match Graphite
character set, it basically uses percent-encoding as a fallback
wich will work pretty will in the graphite/grafana world.
The remote storage module also has an optional 'prefix' parameter
to prefix all metrics with a path (for example, 'prometheus.').
Graphite URLs are simply in the form tcp://host:port or
udp://host:port.
Because the InfluxDB client library currently pulls in multiple MBs of
unnecessary dependencies, I have modified and cut up the vendored
version to only pull in the few pieces that are actually needed.
On InfluxDB's side, this dependency issue is tracked in:
https://github.com/influxdb/influxdb/issues/3447
Hopefully, it will be resolved soon.
If a password is needed for InfluxDB, it may be supplied via the
INFLUXDB_PW environment variable.
The test had become flaky with Go1.5.
Theory here is that with Go1.5.x, sleeping for 10ms might not be
enough to wake up another goroutine, possibly because it is used for
GC. 50ms should always be enough due to GC pause guarantees with the
new GC.
This is with `golint -min_confidence=0.5`.
I left several lint warnings untouched because they were either
incorrect or I felt it was better not to change them at the moment.
If users see the crash recovery error, the chances are
they aren't shutting down Prometheus correctly. Telling
them how to do so will help them debug and fix the problem.
Perhaps it would be even better to still warn in case the sample value has
changed but the timestamps are equal, but we don't have efficient access
to the last value.
Allow scrape_configs to have an optional proxy_url option which specifies
a proxy to be used for all connections to hosts in that config.
Internally this modifies the various client functions to take a *url.URL pointer
which currently must point to an HTTP proxy (but has been left open-ended to
allow the url format to be extended to support others, such as maybe SOCKS if
needed).
For the label matching index-based preselection phase, don't do an OR
between equality and non-equality matchers. Execute only one of the two
(with equality matchers preferred when present).
Fixes https://github.com/prometheus/prometheus/issues/924
If all samples in consecutive chunks have the same timestamp, the way
we used to load chunks will fail. With this change, the persist
watermark is used to load the right amount of chunkDescs from disk.
This bug is a possible reason for the rare storage corruption we have
observed.
Fixes https://github.com/prometheus/prometheus/issues/481
While doing so, clean up and fix a few other things:
- Fix `go vet` warnings (@fabxc to blame ;).
- Fix a racey problem with unarchiving: Whenever we unarchive a
series, we essentially want to do something with it. However, until
we have done something with it, it appears like a series that is
ready to be archived or even purged. So e.g. it would be ignored
during checkpointing. With this fix, we always load the chunkDescs
upon unarchiving. This is wasteful if we only want to add a new
sample to an archived time series, but the (presumably more common)
case where we access an archived time series in a query doesn't
become more expensive.
- The change above streamlined the getOrCreateSeries ond
newMemorySeries flow. Also, the modTime is now always set correctly.
- Fix the leveldb-backed implementation of KeyValueStore.Delete. It
had the wrong behavior of still returning true, nil if a
non-existing key has been passed in.
See https://github.com/prometheus/prometheus/issues/887, which will at
least be partially fixed by this.
From the spec https://golang.org/ref/spec#Conversions:
"In all non-constant conversions involving floating-point or complex
values, if the result type cannot represent the value the conversion
succeeds but the result value is implementation-dependent."
This ended up setting the converted values to 0 on Debian's Go 1.4.2
compiler, at least on 32-bit Debians.
This commit adds the honor_labels and params arguments to the scrape
config. This allows to specify query parameters used by the scrapers
and handling scraped labels with precedence.
Change #704 introduced a regression that started reading the queue only
after potential crash recovery. When more than the queue capacity was
indexed, Prometheus deadlocked.
This change is conceptually very simple, although the diff is large. It
switches logging from "github.com/golang/glog" to
"github.com/prometheus/log", while not actually changing any log
messages. V(1)-style logging has been changed to be log.Debug*().
This commit creates a (so far unused) package. It contains the a custom
lexer/parser for the query language.
ast.go: New AST that interacts well with the parser.
lex.go: Custom lexer (new).
lex_test.go: Lexer tests (new).
parse.go: Custom parser (new).
parse_test.go: Parser tests (new).
functions.go: Changed function type, dummies for parser testing (barely changed/dummies).
printer.go: Adapted from rules/ and adjusted to new AST (mostly unchanged, few additions).
Also, clean up some things in the code (especially introduction of the
chunkLenWithHeader constant to avoid the same expression all over the place).
Benchmark results:
BEFORE
BenchmarkLoadChunksSequentially 5000 283580 ns/op 152143 B/op 312 allocs/op
BenchmarkLoadChunksRandomly 20000 82936 ns/op 39310 B/op 99 allocs/op
BenchmarkLoadChunkDescs 10000 110833 ns/op 15092 B/op 345 allocs/op
AFTER
BenchmarkLoadChunksSequentially 10000 146785 ns/op 152285 B/op 315 allocs/op
BenchmarkLoadChunksRandomly 20000 67598 ns/op 39438 B/op 103 allocs/op
BenchmarkLoadChunkDescs 20000 99631 ns/op 12636 B/op 192 allocs/op
Note that everything is obviously loaded from the page cache (as the
benchmark runs thousands of times with very small series files). In a
real-world scenario, I expect a larger impact, as the disk operations
will more often actually hit the disk. To load ~50 sequential chunks,
this reduces the iops from 100 seeks and 100 reads to 1 seek and 1
read.