This is another corner-case that was previously never exercised
because the rewriting of a series file was never prevented by the
shrink ratio.
Scenario: There is an existing series on disk, which is archived. If a
new sample comes in for that file, a new chunk in memory is created,
and the chunkDescsOffset is set to -1. If series maintenance happens
before the series has at least one chunk to persist _and_ an
insufficient chunks on disk is old enough for purging (so that the
shrink ratio kicks in), dropAndPersistChunks would return 0, but it
should return the chunk length of the series file.
Also, in that code path, set chunkDescsOffset to 0 rather than -1 in
case of "dropped more chunks from persistence than from memory" so
that no other weird things happen before the series is quarantined for
good.
The append call may reuse cds, and thus change its len.
(In practice, this wouldn't happen as cds should have len==cap.
Still, the previous order of lines was problematic.)
This decreases checkpoint size by not checkpointing things
that don't actually need checkpointing.
This is fully compatible with the v2 checkpoint format,
as it makes series appear as though the only chunksdescs
in memory are those that need persisting.
With this change the scraping caches series references and only
allocates label sets if it has to retrieve a new reference.
pkg/textparse is used to do the conditional parsing and reduce
allocations from 900B/sample to 0 in the general case.
Add metrics around checkpointing and persistence
* Add a metric to say if checkpointing is happening,
and another to track total checkpoint time and count.
This breaks the existing prometheus_local_storage_checkpoint_duration_seconds
by renaming it to prometheus_local_storage_checkpoint_last_duration_seconds
as the former name is more appropriate for a summary.
* Add metric for last checkpoint size.
* Add metric for series/chunks processed by checkpoints.
For long checkpoints it'd be useful to see how they're progressing.
* Add metric for dirty series
* Add metric for number of chunks persisted per series.
You can get the number of chunks from chunk_ops,
but not the matching number of series. This helps determine
the size of the writes being made.
* Add metric for chunks queued for persistence
Chunks created includes both chunks that'll need persistence
and chunks read in for queries. This only includes chunks created
for persistence.
* Code review comments on new persistence metrics.
When a large Prometheus starts up fresh it can take many minutes
to warmup and clear out the index queue. A larger queue means less
blocking, bigger batches and cuts down startup time by ~50%.
Keeping these around has two problems:
1) Each desc takes 64 bytes, 10 of them is 640B. This is a lot of
overhead on a 1024 byte chunk.
2) It can take well over a week to reach a point where this and thus
Prometheus memory usage as a whole enters steady state. This makes RAM
estimation very hard for users, and makes it difficult to investigate
things like memory fragmentation.
Instead we'll wipe them during each memory series maintenance cycle, and
if a query pulls them in they'll hang around as cache until the next
cycle.
Two cases:
- An unarchived metric must have at least one chunk desc loaded upon
unarchival. Otherwise, the file is gone or has size 0, which is an
inconsistency (because the series is still indexed in the archive
index). Hence, quarantining is triggered.
- If loading the chunk descs of a series with a known chunkDescsOffset
(i.e. != -1), the number of chunks loaded must be equal to
chunkDescsOffset. If not, there is a data corruption. An error is
returned, which leads to qurantining.
In any case, there is a guard added to not access the 1st element of
an empty chunkDescs slice. (That's what triggered the crashes in issue
2249.) A time series with unknown chunkDescsOffset and no chunks in
memory and no chunks on disk either could trigger that case. I would
assume such a "null series" doesn't exist, but it's not entirely
unthinkable and unreasonable to happen (perhaps in future uses of the
storage). (Create a series, and then something tries to preload chunks
before the first sample is added.)
This extracts Querier as an instantiateable and closeable object
rather than just defining extending methods of the storage interface.
This improves composability and allows abstracting query transactions,
which can be useful for transaction-level caches, consistent data views,
and encapsulating teardown.
When using the chunking code in other projects (both Weave Prism and
ChronixDB ingester), you sometimes want to know how well you are
utilizing your chunks when closing/storing them.
These more specific methods have replaced `metricForLabelMatchers`
in cases where its `map[fingerprint]metric` result type was
not necessary or was used as an intermediate step
Avoids duplicated calls to `seriesForRange` from
`QueryRange` and `QueryInstant` methods.
This is a followup to https://github.com/prometheus/prometheus/pull/2011.
This publishes more of the methods and other names of the chunk code and
moves the chunk code to its own package. There's some unavoidable
ugliness: the chunk and chunkDesc metrics are used by both packages, so
I had to move them to the chunk package. That isn't great, but I don't
see how to do it better without a larger redesign of everything. Same
for the evict requests and some other types.
* Add config, HTTP Basic Auth and TLS support to the generic write path.
- Move generic write path configuration to the config file
- Factor out config.TLSConfig -> tlf.Config translation
- Support TLSConfig for generic remote storage
- Rename Run to Start, and make it non-blocking.
- Dedupe code in httputil for TLS config.
- Make remote queue metrics global.
This is based on https://github.com/prometheus/prometheus/pull/1997.
This adds contexts to the relevant Storage methods and already passes
PromQL's new per-query context into the storage's query methods.
The immediate motivation supporting multi-tenancy in Frankenstein, but
this could also be used by Prometheus's normal local storage to support
cancellations and timeouts at some point.
CPUs have to serialise write access to a single cache line
effectively reducing level of possible parallelism. Placing
mutexes on different cache lines avoids this problem.
Most gains will be seen on NUMA servers where CPU interconnect
traffic is especially expensive
Before:
go test . -run none -bench BenchmarkFingerprintLocker
BenchmarkFingerprintLockerParallel-4 2000000 932 ns/op
BenchmarkFingerprintLockerSerial-4 30000000 49.6 ns/op
After:
go test . -run none -bench BenchmarkFingerprintLocker
BenchmarkFingerprintLockerParallel-4 3000000 569 ns/op
BenchmarkFingerprintLockerSerial-4 30000000 51.0 ns/op
My aim is to support the new grpc generic write path in Frankenstein. On the surface this seems easy - however I've hit a number of problems that make me think it might be better to not use grpc just yet.
The explanation of the problems requires a little background. At weave, traffic to frankenstein need to go through a couple of services first, for SSL and to be authenticated. So traffic goes:
internet -> frontend -> authfe -> frankenstein
- The frontend is Nginx, and adds/removes SSL. Its done this way for legacy reasons, so the certs can be managed in one place, although eventually we imagine we'll merge it with authfe. All traffic from frontend is sent to authfe.
- Authfe checks the auth tokens / cookie etc and then picks the service to forward the RPC to.
- Frankenstein accepts the reads and does the right thing with them.
First problem I hit was Nginx won't proxy http2 requests - it can accept them, but all calls downstream are http1 (see https://trac.nginx.org/nginx/ticket/923). This wasn't such a big deal, so it now looks like:
internet --(grpc/http2)--> frontend --(grpc/http1)--> authfe --(grpc/http1)--> frankenstein
Next problem was golang grpc server won't accept http1 requests (see https://groups.google.com/forum/#!topic/grpc-io/JnjCYGPMUms). It is possible to link a grpc server in with a normal go http mux, as long as the mux server is serving over SSL, as the golang http client & server won't do http2 over anything other than an SSL connection. This would require making all our service to service comms SSL. So I had a go a writing a grpc http1 server, and got pretty far. But is was a bit of a mess.
So finally I thought I'd make a separate grpc frontend for this, running in parallel with the frontend/authfe combo on a different port - and first up I'd need a grpc reverse proxy. Ideally we'd have some nice, generic reverse proxy that only knew about a map from service names -> downstream service, and didn't need to decode & re-encode every request as it went through. It seems like this can't be done with golang's grpc library - see https://github.com/mwitkow/grpc-proxy/issues/1.
And then I was surprised to find you can't do grpc from browsers! See http://www.grpc.io/faq/ - not important to us, but I'm starting to question why we decided to use grpc in the first place?
It would seem we could have most of the benefits of grpc with protos over HTTP, and this wouldn't preclude moving to grpc when its a bit more mature? In fact, the grcp FAQ even admits as much:
> Why is gRPC better than any binary blob over HTTP/2?
> This is largely what gRPC is on the wire.
This adds a flag -storage.local.engine which allows turning off local
storage in Prometheus. Instead of adding if-conditions and nil checks to
all parts of Prometheus that deal with Prometheus's local storage
(including the web interface), disabling local storage simply means
replacing the normal local storage with a noop version that throws
samples away and returns empty query results. We also don't add the noop
storage to the fanout appender to decrease internal overhead.
Instead of returning empty results, an alternate behavior could be to
return errors on any query that point out that the local storage is
disabled. Not sure which one is more preferable, so I went with the
empty result option for now.
By splitting the single queue into multiple queues and flushing each individual queue in serially (and all queues in parallel), we can guarantee to preserve the order of timestampsin samples sent to downstream systems.
- fold metric name into labels
- return initialization errors back to main
- add snappy compression
- better context handling
- pre-allocation of labels
- remove generic naming
- other cleanups
This uses a new proto format, with scope for multiple samples per
timeseries in future. This will allow users to pump samples out to
whatever they like without having to change the core Prometheus code.
There's also an example receiver to save users figuring out the
boilerplate themselves.
Turns out its valid to have an overall chunk which is smaller than the
full doubleDeltaHeaderBytes size -- if it has a single sample, it
doesn't fill the whole header. Updated unmarshalling check to respect
this.
This is (hopefully) a fix for #1653
Specifically, this makes it so that if the length for the stored
delta/doubleDelta is somehow corrupted to be too small, the attempt to
unmarshal will return an error.
The current (broken) behavior is to return a malformed chunk, which can
then lead to a panic when there is an attempt to read header values.
The referenced issue proposed creating chunks with a minimum length -- I
instead opted to just error on the attempt to unmarshal, since I'm not
clear on how it could be safe to proceed when the length is
incorrect/unknown.
The issue also talked about possibly "quarantining series", but I don't
know the surrounding code well enough to understand how to make that
happen.