This is done by bucketing chunks by fingerprint. If the persisting to
disk falls behind, more and more chunks are in the queue. As soon as
there are "double hits", we will now persist both chunks in one go,
doubling the disk throughput (assuming it is limited by disk
seeks). Should even more pile up so that we end wit "triple hits", we
will persist those first, and so on.
Even if we have millions of time series, this will still help,
assuming not all of them are growing with the same speed. Series that
get many samples and/or are not very compressable will accumulate
chunks faster, and they will soon get double- or triple-writes.
To improve the chance of double writes,
-storage.local.persistence-queue-capacity could be set to a higher
value. However, that will slow down shutdown a lot (as the queue has
to be worked through). So we leave it to the user to set it to a
really high value. A more fundamental solution would be to checkpoint
not only head chunks, but also chunks still in the persist queue. That
would be quite complicated for a rather limited use-case (running many
time series with high ingestion rate on slow spinning disks).
Otherwise it will fail to "go get" the "github.com/golang/protobuf"
package because its dir already exists in the vendored packages,
but the copy isn't a git repository.
We were using Godep incorrectly (cloning repos from the internet during
build time instead of including Godeps/_workspace in the GOPATH via
"godep go"). However, to avoid even having to fetch "godeps" from the
internet during build, this now just copies the vendored files into the
GOPATH.
Also, the protocol buffer library moved from Google Code to GitHub,
which is reflected in these updates.
This fixes https://github.com/prometheus/prometheus/issues/525
This dramatically decreases the needed time and memory for building the
blob files. The memory numbers are measured via the
memory.max_usage_in_bytes value from cgroups.
* generating files.go:
OLD: 466MB 19s
NEW: 80MB 1s
* building files.go:
OLD: 1210MB 2.25s
NEW: 7MB 0.05s
Starting a goroutine takes 1-2µs on my laptop. From the "numbers every
Go programmer should know", I had 300ns for a channel send in my
mind. Turns out, on my laptop, it takes only 60ns. That's fast enough
to warrant the machinery of yet another channel with a fixed set of
worker goroutines feeding from it. The number chosen (8 for now) is
low enough to not really afflict a measurable overhead (a big
Prometheus server has >1000 goroutines running), but high enough to
not make sample ingestion a bottleneck.
- Parallelize AppendSamples as much as possible without breaking the
contract about temporal order.
- Allocate more fingerprint locker slots.
- Do not run early checkpoints if we are behind on chunk persistence.
- Increase fpMinWaitDuration to give the disk more time for more
important things.
Also, switch math.MaxInt64 and math.MinInt64 to the new constants.
- Increase samplesQueueCapacity.
- Improve docstring for the above.
- Accept a short waiting period for the ingest channel to become
ready. This should depend on the http timeout, but 100ms is probably
good enough to cushion bursts bigger than samplesQueueCapacity,
while it is unlikely that anybody ever will set an HTTP timeout
similarly short.
This is now not even trying to throttle in a benign way, but creates a
fully-fledged error. Advantage: It shows up very visible on the status
page. Disadvantage: The server does not really adjusts to a lower
scraping rate. However, if your ingestion backs up, you are in a very
irregulare state, I'd say it _should_ be considered an error and not
dealt with in a more graceful way.
In different news: I'll work on optimizing ingestion so that we will
not as easily run into that situation in the first place.
- original series data is saved so it can be re-transformed after
Rickshaw's stacking modified the series data
- always reconstruct graphs from scratch instead of updating the
settings of an existing one (simplification)
- always wipe and recreate all graph-related DOM elements completely so
that no left-over event handlers cause background event handlers
The simple algorithm applied here will increase the actual interval
incrementally, whenever and as long as the scrape itself takes longer
than the configured interval. Once it takes shorter again, the actual
interval will iteratively decrease again.
Also, set a much higher default value.
Chunk persist requests can be quite spiky. If you collect a large
number of time series that are very similar, they will tend to finish
up a chunk at about the same time. There is no reason we need to back
up scraping just because of that. The rationale of the new default
value is "1/8 of the chunks in memory".
This is related to #454. Queries now timeout after a duration set by
the -query.timeout flag. The TotalEvalTimer is now started/stopped
inside any of the ast.Eval* functions.
When Rickshaw was updated to 1.5.1 in
fd43daf82e,
the Rickshaw upstream package now contained 3 different D3 files:
d3.min.js
d3.v2.js
d3.v3.js
For details on why that is, see
https://groups.google.com/forum/#!topic/d3-js/lXQgKA7mtEw
For the 1.5.1 Rickshaw to work properly (being able to format dates with
D3 without causing a JS error), it needs d3.v2.js or d3.v3.js, not the
d3.min.js one. I chose to update us to d3.v3.js now, since that is the
most recent and minified version, and I didn't see any problems with it
(also, the current Rickshaw examples are using that D3 version).
Currently, displaying graphs with a range >14d is broken. This fixes
that.