This change switches the remote_write API to use the TSDB WAL. This should reduce memory usage and prevent sample loss when the remote end point is down.
We use the new LiveReader from TSDB to tail WAL segments. Logic for finding the tracking segment is included in this PR. The WAL is tailed once for each remote_write endpoint specified. Reading from the segment is based on a ticker rather than relying on fsnotify write events, which were found to be complicated and unreliable in early prototypes.
Enqueuing a sample for sending via remote_write can now block, to provide back pressure. Queues are still required to acheive parallelism and batching. We have updated the queue config based on new defaults for queue capacity and pending samples values - much smaller values are now possible. The remote_write resharding code has been updated to prevent deadlocks, and extra tests have been added for these cases.
As part of this change, we attempt to guarantee that samples are not lost; however this initial version doesn't guarantee this across Prometheus restarts or non-retryable errors from the remote end (eg 400s).
This changes also includes the following optimisations:
- only marshal the proto request once, not once per retry
- maintain a single copy of the labels for given series to reduce GC pressure
Other minor tweaks:
- only reshard if we've also successfully sent recently
- add pending samples, latest sent timestamp, WAL events processed metrics
Co-authored-by: Chris Marchbanks <csmarchbanks.com> (initial prototype)
Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com> (sharding changes)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
1. Added an ability to resize text area on mouseclick
2. Remember selected target status button on page reload
Signed-off-by: Maria Nemtinova <nemtinovamasha@gmail.com>
* web: updated bootstrap3-typeahead file to work with bootstrap 4.0.0
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* web: Replaced bootstrap-3.3.1 with bootstrap 4.0.0
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* web: Added bootstrap4-glyphicons as 4.0.0 doesnt include bootstrap3 glyphicons
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* web: updated js jquery to 3.3.1
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* web: updated _base.html to import new bootstrap 4.0.0, jquery3.3.1 and bootstrap class tags to be 4.0 compatible
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* web: _base.html missed word out in title tag (Server).
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* web: updated alerts.html class names and tags to be bootstrap 4 compatible.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* web: updated config.html class names and tags to be bootstrap 4 compatible.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* web: updated flags.html class names and tags to be bootstrap 4 compatible.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* web: updated service-discovery.html class names and tags to be bootstrap 4 compatible.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* web: updated status.html class names and tags to be bootstrap 4 compatible.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* web: updated targets.html class names and tags to be bootstrap 4 compatible.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* web: updated graph_template.handlebar class names and tags to be bootstrap 4 compatible.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* web: alerts.css fix for button color inheritance on alerts page.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* web: graph.css fix for color inheritance.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* web: prometheus.css updated to fix nav bar.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* web: previous merge conflict not fixed correctly on _base.html
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* menu.lib and prom.lib imports updated
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* bootstrap 4.1.3 imported
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* Bootstrap 4.1.3 imported into _base.html
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* bootstrap 4.1.3 imported into prom.lib
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* menu.lib style adjusted to view sidebar
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* Alert colour uplifted to bootstrap 4.1.3
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* Alerts display code reformatted similarly to config
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* Consoles pages adjusted to account for new navbar
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* LHS Menu fixed in console pages
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* Minor changes to prom_console to adjust lhs nav
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* Prom.lib and some css updated to fix console graph controls
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* Bootstrap 4.0.0 files removed
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* Consoles configured so that the graph fits with the new side bar, css files also adjusted
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* Import popper.min.js for dropdowns
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* Popper.min.js imported locally
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* Re-added #4764 and fixed css
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* Removed .DS_Store
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* Rebuilt assets
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* Spaces between buttons and inputs on graph page removed
Signed-off-by: ksherryBAE <kieran.sherry@baesystems.com>
* fixed spacing in buttons on /targets
Signed-off-by: Pritam Bhudia <pritam.bhudia@baesystems.com>
* Updated vfsdata.go
Signed-off-by: Pritam Bhudia <pritam.bhudia@baesystems.com>
* fixed typeahead issue
Signed-off-by: James Ritchie <james.g.ritchie@baesystems.com>
* added css for dropdown
Signed-off-by: James Ritchie <james.g.ritchie@baesystems.com>
* changed order of css imports
Signed-off-by: James Ritchie <james.g.ritchie@baesystems.com>
* tinkered with CSS changes to make keyboard select and mouseover match
Signed-off-by: James Ritchie <james.g.ritchie@baesystems.com>
* *: bump gRPC dependencies
This change updates the gRPC dependencies to more recent versions:
* github.com/gogo/protobuf => v1.2.0
* github.com/grpc-ecosystem/grpc-gateway => v1.6.3
* google.golang.org/grpc => v1.17.0
In addition scripts/genproto.sh leverages Go modules information instead of
hardcoding SHA1 commits. This ensures that the code is generated from
the exact same sources.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Run 'make proto' in CI
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Revert tabs -> spaces change
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Fix 'make proto' step
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* 'go get' grpc/protobuf dependencies
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Prepopulate cache with go mod download
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* *: use latest release of staticcheck
It also fixes a couple of things in the code flagged by the additional
checks.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Use official release of staticcheck
Also run 'go list' before staticcheck to avoid failures when downloading packages.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* added `Copy to clipboard` button
Signed-off-by: Stafford Williams <stafford.williams@gmail.com>
* generate vsfdata
Signed-off-by: Stafford Williams <stafford.williams@gmail.com>
* new lines
Signed-off-by: Stafford Williams <stafford.williams@gmail.com>
* single newline
Signed-off-by: Stafford Williams <stafford.williams@gmail.com>
When a metric has a null value, number formatters like
`humanizeNoSmallPrefix` will throw "Uncaught TypeError: Cannot read
property 'toPrecision' of null".
This is fixed by explicitly checking for `null` and returning the string
"null".
Note: This is usually not seen as rickshaw doesn't show annotations for
null values, but still calls the formatter.
Signed-off-by: David Coles <coles.david@gmail.com>
* update promlog to latest version
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* Update api tests, fix main setup
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* tidy go.sum
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* revendor prometheus/common
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* only initialize config; use kingpin for remote_storage_adapter
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* actually parse the flags
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* clean up imports
Signed-off-by: Alex Yu <yu.alex96@gmail.com>
* web: added ability to set page title through flag.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* Reformatted variable names and Flag description for readability.
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* assets_vfsdata.go
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* Flag name changed from web.ui-title to web.page-title
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
* make assets
Signed-off-by: Andrew Chiu <andrew.chiu2@baesystems.com>
By default the gRPC client of the REST API gateway relies on the
HTTP_PROXY variable to connect to the local gRPC server which isn't
desired as the server runs in the same process. This change uses a
custom dialer that connects directly to the server's address.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* *: move to go 1.11
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Reduce number of places where we specify the Go version
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Add evaluationTimestamp (Last Evaluation) column to display on /rules
Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>
* Add lastScrapeDuration ("Scrape Duration") to display on /targets
Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>
* Updates based on Julius' feedback
Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>
* Update to set timestamp to when eval started (after eval completes)
Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>
* Update /rules to display time since last evaluation
Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>
* Re-order Last Eval/Eval Time to be consistent with targets page
Signed-off-by: Will Hegedus <wbhegedus@liberty.edu>
With the addition of the errors in the views list, it is now difficult
to have a view on all the rules in a screen witdh.
This commit adds wrapping to improve the overall display of the rules
page.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
The scrape manage receiver's channel now just saves the target sets
and another backgorund runner updates the scrape loops every 5 seconds.
This is so that the scrape manager doesn't block the receiving channel
when it does the long background reloading of the scrape loops.
Active and dropped targets are now saved in each scrape pool instead of
the scrape manager. This is mainly to avoid races when getting the
targets via the web api.
When reloading the scrape loops now happens in parallel to speed up the
final disared state and this also speeds up the prometheus's shutting
down.
Also updated some funcs signatures in the web package for consistency.
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
* web: fix asset paths for Windows platforms
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* web: add tests
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Limit the number of samples remote read can return.
- Return 413 entity too large.
- Limit can be set be a flag. Allow 0 to mean no limit.
- Include limit in error message.
- Set default limit to 50M (* 16 bytes = 800MB).
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
When prom2 came out the storage querier interface consolidated to a
single Select() method. While doing this it makes it impossible as the
implementer of the querier to know if you are being called for metadata
or actual data. The workaround has been to check if the SelectParams are
nil, which the federation call is always nil. This has 2 negative
consequences (1) remote implementations interpret this as a metadata
call, which makes the federation endpoint return nothing. (2) this means
that the storage implementations don't get the same information passed
down to them as far as SelectParams goes.
This diff simply adds SelectParams to the Select() call in the
federation handler
Mitigation for #4057
Signed-off-by: Thomas Jackson <jacksontj.89@gmail.com>
Looking at https://tech.townsourced.com/post/embedding-static-files-in-go/ (which was mentioned in the issue), vfsgen has all the needed features.
In particular:
- Reproducible builds (no issue with timestamping).
- Well maintained and relatively popular.
- Integration with go generate.
- Self-contained (no external dependency).
* [WIP] Replace go-bindata by vfsgen
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Add license + remove doc.go
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Generate templates assets
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Use new templates assets
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* split static assets
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Idempotent make assets
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Update vendor/
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* vendor vfsgendev
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Update README.md
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Simplify assets generation
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Fix README.md
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Use generate helper program instead of vfsgen
This avoids installing vfsgendev in the target environment.
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Remove unused vfsgen package
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Fix Makefile
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* vendoring shurcooL/vfsgen
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Fix go generate command
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Sync web/ui/assets_vfsdata.go
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
There are many more (mostly finalizers like Close/Stop/etc.), but most of
the others seemed like one couldn't do much about them anyway.
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* adding information about the health and errors for Rules
adding Health() and LastError() to the Rule interface. This will allow
us to easily surface information about rules.
Signed-off-by: noqcks <benny@noqcks.io>
* updating rules.html with fields for Rule errors and health state
Signed-off-by: noqcks <benny@noqcks.io>
* fix code comment grammar & access Rule health/error info using a mutex
Signed-off-by: noqcks <benny@noqcks.io>
* s/Errors/Error/ in rules.html to remain consistent with targets.html
Signed-off-by: noqcks <benny@noqcks.io>
* adding periods to code comments in reporting/alerting
Signed-off-by: noqcks <benny@noqcks.io>
* putting health/error below mutex in struct field
Signed-off-by: noqcks <benny@noqcks.io>
It was added 5 years ago by Matt and I'm not sure anyone ever used
it after public release (since we have /debug/pprof/heap as well).
It also lacked error checking and allows people to write to disk over HTTP.
Signed-off-by: Julius Volz <julius.volz@gmail.com>
* Allow for BufferedSeriesIterator instances to be created without an underlying iterator, to simplify their usage.
Signed-off-by: Alin Sinpalean <alin.sinpalean@gmail.com>
* Add Start/End to SelectParams
* Make remote read use the new selectParams for start/end
This commit will continue sending the start/end time of the remote read
query as the overarching promql time and the specific range of data that
the query is intersted in receiving a response to is now part of the
ReadHints (upstream discussion in #4226).
* Remove unused vendored code
The genproto.sh script was updated, but the code wasn't regenerated.
This simply removes the vendored deps that are no longer part of the
codegen output.
Signed-off-by: Thomas Jackson <jacksontj.89@gmail.com>
This adds a per-target cache of scraped metadata. The metadata is only
available for the lifecycle of the attached target. An API endpoint allows
to select metadata by metric name and a label selection of targets.
Signed-off-by: Fabian Reinartz <freinartz@google.com>
Displaying all the dropped targets in the service-discovery page hurts
the Prometheus server as well as the browser when thousands of dropped
targets exist. This change limits this number to 1,000 and display the
number of active/total targets per scrape configuration.
Add warning when more than 100 targets are dropped
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
* Move range logic to 'eval'
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Make aggregegate range aware
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* PromQL is statically typed, so don't eval to find the type.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Extend rangewrapper to multiple exprs
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Start making function evaluation ranged
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Make instant queries a special case of range queries
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Eliminate evalString
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Evaluate range vector functions one series at a time
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Make unary operators range aware
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Make binops range aware
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Pass time to range-aware functions.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Make simple _over_time functions range aware
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Reduce allocs when working with matrix selectors
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Add basic benchmark for range evaluation
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Reuse objects for function arguments
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Do dropmetricname and allocating output vector only once.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Add range-aware support for range vector functions with params
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Optimise holt_winters, cut cpu and allocs by ~25%
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Make rate&friends range aware
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Make more functions range aware. Document calling convention.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Make date functions range aware
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Make simple math functions range aware
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Convert more functions to be range aware
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Make more functions range aware
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Specialcase timestamp() with vector selector arg for range awareness
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Remove transition code for functions
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Remove the rest of the engine transition code
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Remove more obselete code
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Remove the last uses of the eval* functions
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Remove engine finalizers to prevent corruption
The finalizers set by matrixSelector were being called
just before the value they were retruning to the pool
was then being provided to the caller. Thus a concurrent query
could corrupt the data that the user has just been returned.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Add new benchmark suite for range functinos
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Migrate existing benchmarks to new system
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Expand promql benchmarks
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Simply test by removing unused range code
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* When testing instant queries, check range queries too.
To protect against subsequent steps in a range query being
affected by the previous steps, add a test that evaluates
an instant query that we know works again as a range query
with the tiimestamp we care about not being the first step.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Reuse ring for matrix iters. Put query results back in pool.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Reuse buffer when iterating over matrix selectors
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Unary minus should remove metric name
Cut down benchmarks for faster runs.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Reduce repetition in benchmark test cases
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Work series by series when doing normal vectorSelectors
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Optimise benchmark setup, cuts time by 60%
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Have rangeWrapper use an evalNodeHelper to cache across steps
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Use evalNodeHelper with functions
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Cache dropMetricName within a node evaluation.
This saves both the calculations and allocs done by dropMetricName
across steps.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Reuse input vectors in rangewrapper
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Reuse the point slices in the matrixes input/output by rangeWrapper
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Make benchmark setup faster using AddFast
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Simplify benchmark code.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Add caching in VectorBinop
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Use xor to have one-level resultMetric hash key
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Add more benchmarks
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Call Query.Close in apiv1
This allows point slices allocated for the response data
to be reused by later queries, saving allocations.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Optimise histogram_quantile
It's now 5-10% faster with 97% less garbage generated for 1k steps
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Make the input collection in rangeVector linear rather than quadratic
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Optimise label_replace, for 1k steps 15x fewer allocs and 3x faster
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Optimise label_join, 1.8x faster and 11x less memory for 1k steps
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Expand benchmarks, cleanup comments, simplify numSteps logic.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Address Fabian's comments
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Comments from Alin.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Address jrv's comments
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Remove dead code
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Address Simon's comments.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Rename populateIterators, pre-init some sizes
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Handle case where function has non-matrix args first
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Split rangeWrapper out to rangeEval function, improve comments
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Cleanup and make things more consistent
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Make EvalNodeHelper public
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
* Fabian's comments.
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
Fix race by properly locking access to scrape pools. Use separate mutex for information needed by UI so that UI isn't blocked when targets are being updated.
* web: replace deprecated InstrumentHandler()
This change replaces the deprecated InstrumentHandler function by the
equivalent functions from the promhttp package.
The following metrics are removed:
* http_request_duration_microseconds (Summary).
* http_request_size_bytes (Summary).
* http_requests_total (Counter).
And the following metrics are added instead:
* prometheus_http_request_duration_seconds (Histogram).
* prometheus_http_response_size_bytes (Histogram).
* promhttp_metric_handler_requests_in_flight (Gauge).
* promhttp_metric_handler_requests_total (Counter).
* Update github.com/prometheus/common/route package
* web: refactor using the new prometheus/common/route package
After removing the checkbox in #3913 the only remaining element that
looked like it was the new Show Annotations checkbox on the Alerts page.
Which in turn didn't look like the Enable query history checkout on the
graph page. So:
1. This takes the Enable query history button as canonical.
2. Updates the show annotations button code to match it.
3. Simplifies the JS for the checkbox.
The new Service Discovery page uses the CSS/JS from the Targets page but
used slightly differently. This makes the job header match in the
Service Discovery page for a more consistent look-n-feel.
* Added only healthy to Targets
This adds a "Only heathly" button to supplement the "Only unhealthy"
button. The two are mutually exclusive.
I've also added a red/green text color to the buttons.
Arguably this could be a toggle instead if folks think this is
worthwhile... Happy to modify it.
* Moved functions above init
* Simplifed code and made prettier
* Appeased codeacy
* Made buttons square
* Fix JS error: cannot read source of undefined
When the page was refreshed with queries on the page,
the updateTypeaheadMetricsSet function was called before
the typeahead had been initialized.
* Fix: updates URL when query submits
When queries were submitted by pressing enter, the URL did not update
to reflect the change. Not sure why, but this was only the case when
the queries were non-simple, meaning when either labels werre specified
or other promql functions were used.
* Rebase master and make assets
This is a very minor UX change. The current "No Alert rules" present
table row has the `alert_header` class attached. This changes the cursor
and some other stuff and makes sense with the populated table but less
sense with the unpopulated table. So removing it the latter case.
This adds a parameter to the storage selection interface which allows
query engine(s) to pass information about the operations surrounding a
data selection.
This can for example be used by remote storage backends to infer the
correct downsampling aggregates that need to be provided.
When you have no alerting rules defined you get a screen sharing this
information in the WebUI. If no rules are defined then you instead see
an empty white screen. This adds a "No rules" defined `else` clause and
a `Rules` header to the page.
* Do not autoselect the first item in the dropdown
* Historical queries only show in dropdown when toggled on
* Move shared behavior to queryHistory.isEnabled function
* Do not auto submit selected history queries
net.Listener converts 0.0.0.0 to :: which fails for hosts where IPv6 is
disabled. This change uses the original listen address parameter instead
of grpcl.Addr().String().
Federation makes use of dedupedSeriesSet to merge SeriesSets for every
query into one output stream. If many match[] arguments are provided,
many dedupedSeriesSet objects will get chained. This has the downside of
causing a potential O(n*k) running time, where n is the number of series
and k the number of match[] arguments.
In the mean time, the storage package provides a mergeSeriesSet that
accomplishes the same with an O(n*log(k)) running time by making use of
a binary heap. Let's just get rid of dedupedSeriesSet and change all
existing callers to use mergeSeriesSet.
When there is an empty result set, the Prometheus server replies with
{"status":"success","data":{"resultType":"vector","result":null}}
That "null" reply was not handled correctly by the graphing library.
This commit handles that case and shows "no data" in the UI console view
instead of throwing an error in the browser javascript console.
Fixes#3515
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
API consumers should be able to get insight into the query run times.
The UI currently measures total roundtrip times. This PR allows for more
fine grained metrics to be exposed.
* adds new timer for total execution time (queue + eval)
* expose new timer, queue timer, and eval timer in stats field of the
range query response:
```json
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [],
"stats": {
"execQueueTimeNs": 4683,
"execTotalTimeNs": 2086587,
"totalEvalTimeNs": 2077851
}
}
}
```
* stats field is optional, only set when query parameter `stats` is not
empty
Try it via
```sh
curl 'http://localhost:9090/api/v1/query_range?query=up&start=1486480279&end=1486483879&step=14000&stats=true'
```
Review feedback
* moved query stats json generation to query_stats.go
* use seconds for all query timers
* expose all timers available
* Changed ExecTotalTime string representation from Exec queue total time to Exec total time
This PR fixes#3072 by providing POST endpoints for `query` and `query_range`.
POST request must be made with `Content-Type: application/x-www-form-urlencoded` header.
* Add UI warning for time drift >30 seconds
* Yellow time drift warning & better warning message
* Set warning threshold to 30 sec
* Include changed assets
* Re-add contexts to storage.Storage.Querier()
These are needed when replacing the storage by a multi-tenant
implementation where the tenant is stored in the context.
The 1.x query interfaces already had contexts, but they got lost in 2.x.
* Convert promql.Engine to use native contexts
No matter how we refactor docs, `/docs/` will stay the prefix, so there's not long-term risk in changing this.
One we version docs, we should probably try and keep link & version in sync.
Whenever a route prefix is applied, the router prepends the prefix to
the URL path on the request. For most handlers, this is not an issue
because the request's path is only used for routing and is not actually
needed by the handler itself. However, Prometheus delegates the handling
of the /debug/* endpoints to the http.DefaultServeMux which has it's own
routing logic that depends on the url.Path. As a result, whenever a
prefix is applied, the prefixed URL is passed to the DefaultServeMux
which has no awareness of the prefix and returns a 404.
This change fixes the issue by creating a new serveDebug handler which
routes requests /debug/* requests to appropriate net/http/pprof handler
and removing the net/http/pprof import in cmd/prometheus since it is no
longer necessary.
Fixes#2183.
This PR adds the `/status/config` endpoint which exposes the currently
loaded Prometheus config. This is the same config that is displayed on
`/config` in the UI in YAML format. The response payload looks like
such:
```
{
"status": "success",
"data": {
"yaml": <CONFIG>
}
}
```
Issue #3046 is triggered by html/template changes in go1.9.
See https://tip.golang.org/pkg/html/template. Quote:
// To ease migration to Go 1.9 and beyond, "html" and "urlquery" will
// continue to be allowed as the last command in a pipeline. However, if the
// pipeline occurs in an unquoted attribute value context, "html" is
// disallowed. Avoid using "html" and "urlquery" entirely in new templates.
The commit also includes a trivial whitespace fix.
To cover the cases where stale markers may not be available,
we need to infer the interval and mark series stale based on that.
As we're lacking stale markers this is less accurate, however
it should be good enough for these cases.
We need 4 intervals as if say we had data at t=0 and t=10,
coming via federation. The next data point should be at t=20 however it
could take up to t=30 for it actually to be ingested, t=40 for it to be
scraped via federation and t=50 for it to be ingested.
We then add 10% on to that for slack, as we do elsewhere.
* Use request.Context() instead of a global map of contexts.
* Add some basic opentracing instrumentation on the query path.
* Remove tracehandler endpoint.
This is needed for federating non-instance level metrics, so they don't
end up with the instance label of the prometheus target.
Also sort external labels, so label output order is consistent.
* Fixed int64 overflow for timestamp in v1/api parseDuration and parseTime
This led to unexpected results on wrong query with "(...)&start=148966367200.372&end=1489667272.372"
That query is wrong because of `start > end` but actually internal int64 overflow caused start to be something around MinInt64 (huge negative value) and was passing validation.
BTW: Not sure if negative timestamp makes sense even.. But model.Earliest is actually MinInt64, can someone explain me why?
Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
* Added missing trailing periods on comments.
Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
* MOved to only `<` and `>`. Removed equal.
Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
Expose buildQueryUrl, refactor dispatch to use
buildQueryUrl will allow users to execute queries over the range of an
existing graph. This will be helpful to select data series they wish to
annotate the graph with, for example.
The fuzzy library didn't try to find a "best match", but settled on the
first fuzzy match that exists. This patch includes a modified version of
the fuzzy library, which recursivley tries on the rest of the search
string to find a better match. If found, returns that one.
Another small modification is that if a pattern fully matches, it
skips the lookup entirley and returns the highest score possible for
that match.
For some of the queries, the fuzzy lookup was not filtering properly.
The problem is due to the "replace" beind made on the query itself. It
accidently removes only the first underscore. This patch changes it so
that it removes all of the whitespaces, letting the fuzzy algorithm do
its magic, also fixing this problem.
Originally, the underscore were replaced by a space for this specific
reason, to let the user type a space and have the lookup treat it as the
word break.
Fixes#2380
retreival.Target contains a mutex. It was copied in the Targets()
call. This potentially can wreak a lot of havoc.
It might even have caused the issues reported as #2266 and #2262 .
Right now the /alerts page of Prometheus sorts alerts by severity
(firing, pending, inactive). Once multiple alerts have the same
severity, their order seems to correlate to how they are placed in the
configuration files, but not always. Looking at the code, we make use of
sort.Sort(), which is documented not to provide a stable sort. The
Less() function also only takes the alert state into account.
This change extends the Less() function to provide a lexicographic order
on both the alert state and the name. This means I can finally find the
alerts I'm looking for without using my browser's search feature.
We are writing federation responses streaming. So after
the first byte we wrote, the status header is fixed. We cannot
return an HTTP error for intermediate error but should just abort
and log instead.
Adds also the moment.js library, which is a dependency of it.
Following conventions in the web/ui directory, I am not including the original
sources or LICENSE files.
If an existing request is aborted due to a new request, ignore the completion of the initial request.
Example:
1. Chrome dev tools: enable 5 second network latency
2. Execute query
3. A second later, execute the query again
4. Currently, the spinner will hide, and the stats will immediately display, as if the request had completed. Instead, the spinner and stats should wait until the 2nd execution finishes.
* Add fuzzy search to /graph textarea
We have a few thousands different metrics and looking up some of them
can be quite annoying with the simple string matching.
This patch adds a fuzzy search to the textarea lookup box on the /graph
page. It uses a small neat library from github.com/mattyork/fuzzy.
* Add fuzzy lib to NOTICE and re-build assets
Previously built assets changed the mode.
This extracts Querier as an instantiateable and closeable object
rather than just defining extending methods of the storage interface.
This improves composability and allows abstracting query transactions,
which can be useful for transaction-level caches, consistent data views,
and encapsulating teardown.
If an existing request is aborted due to a new request, ignore the completion of the initial request.
Example:
1. Chrome dev tools: enable 5 second network latency
2. Execute query
3. A second later, execute the query again
4. Currently, the spinner will hide, and the stats will immediately display, as if the request had completed. Instead, the spinner and stats should wait until the 2nd execution finishes.
This is based on https://github.com/prometheus/prometheus/pull/1997.
This adds contexts to the relevant Storage methods and already passes
PromQL's new per-query context into the storage's query methods.
The immediate motivation supporting multi-tenancy in Frankenstein, but
this could also be used by Prometheus's normal local storage to support
cancellations and timeouts at some point.
For Weaveworks' Frankenstein, we need to support multitenancy. In
Frankenstein, we initially solved this without modifying the promql
package at all: we constructed a new promql.Engine for every
query and injected a storage implementation into that engine which would
be primed to only collect data for a given user.
This is problematic to upstream, however. Prometheus assumes that there
is only one engine: the query concurrency gate is part of the engine,
and the engine contains one central cancellable context to shut down all
queries. Also, creating a new engine for every query seems like overkill.
Thus, we want to be able to pass per-query contexts into a single engine.
This change gets rid of the promql.Engine's built-in base context and
allows passing in a per-query context instead. Central cancellation of
all queries is still possible by deriving all passed-in contexts from
one central one, but this is now the responsibility of the caller. The
central query context is now created in main() and passed into the
relevant components (web handler / API, rule manager).
In a next step, the per-query context would have to be passed to the
storage implementation, so that the storage can implement multi-tenancy
or other features based on the contextual information.
This will avoid duplicate MetricFamilies, thereby shrinking the size
of the federation payload and also creating legal text format.
Also, add unit tests for federation. They were also needed for the
previous state of the code, but were missing.
This reverts commit aa43d34a86.
This brings back the /graph changes so that @grandbora can continue to
work on the redirect for backwards compatibility. And other changes
can already take the new /graph parameters into account.
This revert will be reverted once v1.1 is released and has its own
release branch. Since we had already change on top of this, there was
no cleaner way of cutting those changes out.
This commit reverts the following commits:
Revert "Update backend helpers and templates to new url schema"
This reverts commit fc6cdd0611.
Revert "Refactor graph.js"
This reverts commit 445fac56e0.
Revert "Use query parameters in the url"
This reverts commit 3e18d86d8a.
Revert "Point to correct place for GraphLinkForExpression"
This reverts commit 3da825fc76.
Assets are also updated.
There's no corresponding table column for this table header. The
placeholder link for silences was removed in e8800730.
Accordingly, regenerate `web/ui/bindata.go` by running:
make assets format
See discussion in
https://groups.google.com/forum/#!topic/prometheus-developers/bkuGbVlvQ9g
The main idea is that the user of a storage shouldn't have to deal with
fingerprints anymore, and should not need to do an individual preload
call for each metric. The storage interface needs to be made more
high-level to not expose these details.
This also makes it easier to reuse the same storage interface for remote
storages later, as fewer roundtrips are required and the fingerprint
concept doesn't work well across the network.
NOTE: this deliberately gets rid of a small optimization in the old
query Analyzer, where we dedupe instants and ranges for the same series.
This should have a minor impact, as most queries do not have multiple
selectors loading the same series (and at the same offset).
I got feedback from different sources about rules and targets being
too heavy in the status tab if their are lots of them.
This change also allows for more fine-granular locking.
Prometheus is Apache 2 licensed, and most source files have the
appropriate copyright license header, but some were missing it without
apparent reason. Correct that by adding it.
The chunk encoding was hardcoded there because it mostly doesn't
matter what encoding is chosen in that test. Since type 1 is
battle-hardened enough, I'm switching to type 2 here so that we can
catch unexpected problems as a byproduct. My expectation is that the
chunk encoding doesn't matter anyway, as said, but then "unexpected
problems" contains the word "unexpected".
WIP: This needs more tests.
It now gets a from and through value, which it may opportunistically
use to optimize the retrieval. With possible future range indices,
this could be used in a very efficient way. This change merely applies
some easy checks, which should nevertheless solve the use case of
heavy rule evaluations on servers with a lot of series churn.
Idea is the following:
- Only archive series that are at least as old as the headChunkTimeout
(which was already extremely unlikely to happen).
- Then maintain a high watermark for the last archival, i.e. no
archived series has a sample more recent than that watermark.
- Any query that doesn't reach to a time before that watermark doesn't
have to touch the archive index at all. (A production server at
Soundcloud with the aforementioned series churn and heavy rule
evaluations spends 50% of its CPU time in archive index
lookups. Since rule evaluations usually only touch very recent
values, most of those lookup should disappear with this change.)
- Federation with a very broad label matcher will profit from this,
too.
As a byproduct, the un-needed MetricForFingerprint method was removed
from the Storage interface.
This commit simplifies the TargetHealth type and moves the target
status into the target itself. This also removes a race where error
and last scrape time could have been out of sync.
Formalize ZeroSamplePair as return value for non-existing samples.
Change LastSamplePairForFingerprint to return a SamplePair (and not a
pointer to it), which saves allocations in a potentially extremely
frequent call.
It's actually happening in several places (and for flags, we use the
standard Go time.Duration...). This at least reduces all our
home-grown parsing to one place (in model).
This enables metric name autocompletion for every word in an expression,
not just the very first one. It would be great to also support all
language keywords during autocompletion in the future.
This adapts some functionality from the Go standard library for string
literal lexing and unquoting/unescaping.
The following string types are now supported:
Double- or single-quoted strings:
These support all escape sequences that Go supports in double-quoted
string literals. The difference is that Prometheus also has
single-quoted strings (instead of single-quoted runes in Go). Raw
newlines are not allowed.
Backtick-quoted raw strings:
Strings quoted in backticks are treated as raw strings just like in Go
and may contain raw newlines and other special characters directly.
Fixes https://github.com/prometheus/prometheus/issues/1122
Fixes https://github.com/prometheus/prometheus/issues/1121
This is with `golint -min_confidence=0.5`.
I left several lint warnings untouched because they were either
incorrect or I felt it was better not to change them at the moment.