Rational:
* When the config is reloaded and the provider context is canceled, we need to
exit the current ZK `TargetProvider.Run` method as a new provider will be
instantiated.
* In case `Stop` is called on the `ZookeeperTreeCache`, the update/events
channel may not be closed as it is shared by multiple caches and would
thus be double closed.
* Stopping all `zookeeperTreeCacheNode`s on teardown ensures all associated
watcher go-routines will be closed eagerly rather than implicityly on
connection close events.
The Kubernetes pod SD creates a target for each declared port, as documented:
https://prometheus.io/docs/operating/configuration/#pod
> The pod role discovers all pods and exposes their containers as targets. For
> each declared port of a container, a single target is generated. If a
> container has no specified ports, a port-free target per container is created
> for manually adding a port via relabeling.
This results in the default port being the declared port, or no port if none are
declared.
* Force buckets in a histogram to be monotonic for quantile estimation
The assumption that bucket counts increase monotonically with increasing
upperBound may be violated during:
* Recording rule evaluation of histogram_quantile, especially when rate()
has been applied to the underlying bucket timeseries.
* Evaluation of histogram_quantile computed over federated bucket
timeseries, especially when rate() has been applied
This is because scraped data is not made available to RR evalution or
federation atomically, so some buckets are computed with data from the N
most recent scrapes, but the other buckets are missing the most recent
observations.
Monotonicity is usually guaranteed because if a bucket with upper bound
u1 has count c1, then any bucket with a higher upper bound u > u1 must
have counted all c1 observations and perhaps more, so that c >= c1.
Randomly interspersed partial sampling breaks that guarantee, and rate()
exacerbates it. Specifically, suppose bucket le=1000 has a count of 10 from
4 samples but the bucket with le=2000 has a count of 7, from 3 samples. The
monotonicity is broken. It is exacerbated by rate() because under normal
operation, cumulative counting of buckets will cause the bucket counts to
diverge such that small differences from missing samples are not a problem.
rate() removes this divergence.)
bucketQuantile depends on that monotonicity to do a binary search for the
bucket with the qth percentile count, so breaking the monotonicity
guarantee causes bucketQuantile() to return undefined (nonsense) results.
As a somewhat hacky solution until the Prometheus project is ready to
accept the changes required to make scrapes atomic, we calculate the
"envelope" of the histogram buckets, essentially removing any decreases
in the count between successive buckets.
* Fix up comment docs for ensureMonotonic
* ensureMonotonic: Use switch statement
Use switch statement rather than if/else for better readability.
Process the most frequent cases first.
* Test Longer Tests in Travis
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
* Make test Target Run All Tests
* Add test-short to run short tests
test is running all the tests now as we are running make tests in
CircleCI and I think the base image is shared across Prometheus Org.
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
* Remove Empty Line
Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
- checkpointSeriesMapAndHeads accepts a context now to allow
cancelling.
- If a shutdown is initiated, cancel the ongoing checkpoint. (We will
create a final checkpoint anyway.)
- Always wait for at least as long as the last checkpoint took before
starting the next checkpoint (to cap the time spending checkpointing
at 50%).
- If an error has occurred during checkpointing, don't bother to sync
the write.
- Make sure the temporary checkpoint file is deleted, even if an error
has occurred.
- Clean up the checkpoint loop a bit. (The concurrent Timer.Reset(0)
call might have cause a race.)
Fixes#2480. For certain definition of "fixes".
This is something that should never happen. Sadly, it does happen,
albeit extremely rarely. This could be some weird cornercase we
haven't covered yet. Or it happens as a consequesnce of data
corruption or a crash recovery gone bad.
This is not a "real" fix as we don't know the root cause of the
incident reported in #2480. However, this makes sure the server does
not crash, but deals gracefully with the problem: The series in
question is quarantined, which even makes it available for forensics.
An unopenable archived_fingerprint_to_timerange is simply deleted and
will be rebuilt during crash recovery (wich can then take quite some time).
An unopenable archived_fingerprint_to_metric is not deleted but
instructions to the user are logged. A deletion has to be done by the
user explicitly as it means losing all archived series (and a repair
with a 3rd party tool might still be possible).