Right now the /alerts page of Prometheus sorts alerts by severity
(firing, pending, inactive). Once multiple alerts have the same
severity, their order seems to correlate to how they are placed in the
configuration files, but not always. Looking at the code, we make use of
sort.Sort(), which is documented not to provide a stable sort. The
Less() function also only takes the alert state into account.
This change extends the Less() function to provide a lexicographic order
on both the alert state and the name. This means I can finally find the
alerts I'm looking for without using my browser's search feature.
Keeping these around has two problems:
1) Each desc takes 64 bytes, 10 of them is 640B. This is a lot of
overhead on a 1024 byte chunk.
2) It can take well over a week to reach a point where this and thus
Prometheus memory usage as a whole enters steady state. This makes RAM
estimation very hard for users, and makes it difficult to investigate
things like memory fragmentation.
Instead we'll wipe them during each memory series maintenance cycle, and
if a query pulls them in they'll hang around as cache until the next
cycle.
This imposes a hard limit on the number of samples ingested from the
target. This is counted after metric relabelling, to allow dropping of
problemtic metrics.
This is intended as a very blunt tool to prevent overload due to
misbehaving targets that suddenly jump in sample count (e.g. adding
a label containing email addresses).
Add metric to track how often this happens.
Fixes#2137
Introduce two new relabel actions. labeldrop, and labelkeep.
These can be used to filter the set of labels by matching regex
- labeldrop: drops all labels that match the regex
- labelkeep: drops all labels that do not match the regex
Two cases:
- An unarchived metric must have at least one chunk desc loaded upon
unarchival. Otherwise, the file is gone or has size 0, which is an
inconsistency (because the series is still indexed in the archive
index). Hence, quarantining is triggered.
- If loading the chunk descs of a series with a known chunkDescsOffset
(i.e. != -1), the number of chunks loaded must be equal to
chunkDescsOffset. If not, there is a data corruption. An error is
returned, which leads to qurantining.
In any case, there is a guard added to not access the 1st element of
an empty chunkDescs slice. (That's what triggered the crashes in issue
2249.) A time series with unknown chunkDescsOffset and no chunks in
memory and no chunks on disk either could trigger that case. I would
assume such a "null series" doesn't exist, but it's not entirely
unthinkable and unreasonable to happen (perhaps in future uses of the
storage). (Create a series, and then something tries to preload chunks
before the first sample is added.)
We are writing federation responses streaming. So after
the first byte we wrote, the status header is fixed. We cannot
return an HTTP error for intermediate error but should just abort
and log instead.