Also, fix problems in shutdown.
Starting serving and shutdown still has to be cleaned up properly.
It's a mess.
Change-Id: I51061db12064e434066446e6fceac32741c4f84c
Some other improvements on the way, in particular codec -> codable
renaming and addition of LookupSet methods.
Change-Id: I978f8f3f84ca8e4d39a9d9f152ae0ad274bbf4e2
Most important, the heads file will now persist all the chunk descs,
too. Implicitly, it will serve as the persisted form of the
fp-to-series map.
Change-Id: Ic867e78f2714d54c3b5733939cc5aef43f7bd08d
BinaryMarshaler instead of encodable.
BinaryUnmarshaler instead of decodable.
Left 'codable' in place for lack of a better word.
Change-Id: I8a104be7d6db916e8dbc47ff95e6ff73b845ac22
Large delta values often imply a difference between a large base value
and the large delta value, potentially resulting in small numbers with
a huge precision error. Since large delta values need 8 bytes anyway,
we are not even saving memory.
As a solution, always save the absoluto value rather than a delta once
8 bytes would be needed for the delta. Timestamps are then saved as 8
byte integers, while values are always saved as float64 in that case.
Change-Id: I01100d600515e16df58ce508b50982ffd762cc49
Go downloads moved to a different URL and require following redirects
(curl's '-L' option) now.
Go 1.3 deliberately randomizes ranges over maps, which uncovered some
bugs in our tests. These are fixed too.
Change-Id: Id2d9e185d8d2379a9b7b8ad5ba680024565d15f4
- Always spell out the time unit (e.g. milliseconds instead of ms).
- Remove "_total" from the names of metrics that are not counters.
- Make use of the "Namespace" and "Subsystem" fields in the options.
- Removed the "capacity" facet from all metrics about channels/queues.
These are all fixed via command line flags and will never change
during the runtime of a process. Also, they should not be part of
the same metric family. I have added separate metrics for the
capacity of queues as convenience. (They will never change and are
only set once.)
- I left "metric_disk_latency_microseconds" unchanged, although that
metric measures the latency of the storage device, even if it is not
a spinning disk. "SSD" is read by many as "solid state disk", so
it's not too far off. (It should be "solid state drive", of course,
but "metric_drive_latency_microseconds" is probably confusing.)
- Brian suggested to not mix "failure" and "success" outcome in the
same metric family (distinguished by labels). For now, I left it as
it is. We are touching some bigger issue here, especially as other
parts in the Prometheus ecosystem are following the same
principle. We still need to come to terms here and then change
things consistently everywhere.
Change-Id: If799458b450d18f78500f05990301c12525197d3
The first sort in groupByFingerprint already ensures that all resulting sample
lists contain only one fingerprint. We also already assume that all
samples passed into AppendSamples (and thus groupByFingerprint) are
chronologically sorted within each fingerprint.
The extra chronological sort is thus superfluous. Furthermore, this
second sort didn't only sort chronologically, but also compared all
metric fingerprints again (although we already know that we're only
sorting within samples for the same fingerprint). This caused a huge
memory and runtime overhead.
In a heavily loaded real Prometheus, this brought down disk flush times
from ~9 minutes to ~1 minute.
OLD:
BenchmarkLevelDBAppendRepeatingValues 5 331391808 ns/op 44542953 B/op 597788 allocs/op
BenchmarkLevelDBAppendsRepeatingValues 5 329893512 ns/op 46968288 B/op 3104373 allocs/op
NEW:
BenchmarkLevelDBAppendRepeatingValues 5 299298635 ns/op 43329497 B/op 567616 allocs/op
BenchmarkLevelDBAppendsRepeatingValues 20 92204601 ns/op 1779454 B/op 70975 allocs/op
Change-Id: Ie2d8db3569b0102a18010f9e106e391fda7f7883
This fixes the problem where samples become temporarily unavailable for
queries while they are being flushed to disk. Although the entire
flushing code could use some major refactoring, I'm explicitly trying to
do the minimal change to fix the problem since there's a whole new
storage implementation in the pipeline.
Change-Id: I0f5393a30b88654c73567456aeaea62f8b3756d9