The Client type is already exposed, but can't be used without the config for it
also being exposed. Using the remote.Client from other programs is useful to do
full end-to-end tests of Prometheus's remote protocol against adapter
implementations.
We would overscan when hitting a value directly, interspersed with
samples in between timestamps. Apparently, that happens rarely enough
that it was only noticed recently.
This can happen in the situation where the system scales up the number of shards massively (to deal with some backlog), then scales it down again as the number of samples sent during the time period is less than the number received.
* Compress remote storage requests and responses with unframed/raw snappy, for compatibility with other languages.
* Remove backwards compatibility code from remote_storage_adapter, update example_write_adapter
* Add /documentation/examples/remote_storage/example_write_adapter/example_writer_adapter to .gitignore
* Use request.Context() instead of a global map of contexts.
* Add some basic opentracing instrumentation on the query path.
* Remove tracehandler endpoint.
- checkpointSeriesMapAndHeads accepts a context now to allow
cancelling.
- If a shutdown is initiated, cancel the ongoing checkpoint. (We will
create a final checkpoint anyway.)
- Always wait for at least as long as the last checkpoint took before
starting the next checkpoint (to cap the time spending checkpointing
at 50%).
- If an error has occurred during checkpointing, don't bother to sync
the write.
- Make sure the temporary checkpoint file is deleted, even if an error
has occurred.
- Clean up the checkpoint loop a bit. (The concurrent Timer.Reset(0)
call might have cause a race.)
Fixes#2480. For certain definition of "fixes".
This is something that should never happen. Sadly, it does happen,
albeit extremely rarely. This could be some weird cornercase we
haven't covered yet. Or it happens as a consequesnce of data
corruption or a crash recovery gone bad.
This is not a "real" fix as we don't know the root cause of the
incident reported in #2480. However, this makes sure the server does
not crash, but deals gracefully with the problem: The series in
question is quarantined, which even makes it available for forensics.
An unopenable archived_fingerprint_to_timerange is simply deleted and
will be rebuilt during crash recovery (wich can then take quite some time).
An unopenable archived_fingerprint_to_metric is not deleted but
instructions to the user are logged. A deletion has to be done by the
user explicitly as it means losing all archived series (and a repair
with a 3rd party tool might still be possible).
Sadly, we have a number of places where we use varint encoding for
numbers that cannot be negative. We could have saved a bit by using
uvarint encoding. On the bright side, we now have a 50% chance to
detect data corruption. :-/
Fixes#1800 and #2492.
This is in line with the v1.5 change in paradigm to not keep
chunk.Descs without chunks around after a series maintenance.
It's mainly motivated by avoiding excessive amounts of RAM usage
during crash recovery.
The code avoids to create memory time series with zero chunk.Descs as
that is prone to trigger weird effects. (Series maintenance would
archive series with zero chunk.Descs, but we cannot do that here
because the archive indices still have to be checked.)