Commit graph

90 commits

Author SHA1 Message Date
beorn7 5de2df752f Hacky implementation of protobuf parsing
This "brings back" protobuf parsing, with the only goal to play with
the new sparse histograms.

The Prom-2.x style parser is highly adapted to the structure of the
Prometheus text format (and later OpenMetrics). Some jumping through
hoops is required to feed protobuf into it.

This is not meant to be a model for the final implementation. It
should just enable sparse histogram ingestion at a reasonable
efficiency.

Following known shortcomings and flaws:

- No tests yet.

- Summaries and legacy histograms, i.e. without sparse buckets, are
  ignored.

- Staleness doesn't work (but this could be fixed in the appender, to
  be discussed).

- No tricks have been tried that would be similar to the tricks the
  text parsers do (like direct pointers into the HTTP response
  body). That makes things weird here. Tricky optimizations only make
  sense once the final format is specified, which will almost
  certainly not be the old protobuf format. (Interestingly, I expect
  this implementation to be in fact much more efficient than the
  original protobuf ingestion in Prom-1.x.)

- This is using a proto3 version of metrics.proto (mostly to be
  consistent with the other protobuf uses). However, proto3 sees no
  difference between an unset field. We depend on that to distinguish
  between an unset timestamp and the timestamp 0 (1970-01-01, 00:00:00
  UTC). In this experimental code, we just assume that timestamp is
  never specified and therefore a timestamp of 0 always is interpreted
  as "not set".

Signed-off-by: beorn7 <beorn@grafana.com>
2021-07-01 01:35:11 +02:00
Julius Volz 9d495afd2c Remove trailing zeros in scrape timeout header
See https://twitter.com/AviKivity/status/1405147699557638145 and
https://twitter.com/juliusvolz/status/1405790211670515712

Signed-off-by: Julius Volz <julius.volz@gmail.com>
2021-06-18 09:38:12 +02:00
Levi Harrison b5f6f8fb36 Switched to go-kit/log
Signed-off-by: Levi Harrison <git@leviharrison.dev>
2021-06-11 12:28:36 -04:00
hanjm 1df05bfd49 Add body_size_limit to prevent bad targets response large body cause Prometheus server OOM (#8827)
Signed-off-by: hanjm <hanjinming@outlook.com>
2021-05-29 07:05:42 +08:00
Levi Harrison 2826fbeeb7
SD: Add target creation failure counter and change failure handling (#8786)
* Added metric and changed failure/drop strategy

Signed-off-by: Levi Harrison <git@leviharrison.dev>
2021-05-28 23:50:59 +02:00
Callum Styan 8fd73b1d28
Add Exemplar Remote Write support (#8296)
* Write exemplars to the WAL and send them over remote write.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Update example for exemplars, print data in a more obvious format.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Add metrics for remote write of exemplars.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Fix incorrect slices passed to send in remote write.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* We need to unregister the new metrics.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address review comments

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Order of exemplar append vs write exemplar to WAL needs to change.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Several fixes to prevent sending uninitialized or incorrect samples with an exemplar. Fix dropping exemplar for missing series. Add tests for queue_manager sending exemplars

Signed-off-by: Martin Disibio <mdisibio@gmail.com>

* Store both samples and exemplars in the same timeseries buffer to remove the alloc when building final request, keep sub-slices in separate buffers for re-use

Signed-off-by: Martin Disibio <mdisibio@gmail.com>

* Condense sample/exemplar delivery tests to parameterized sub-tests

Signed-off-by: Martin Disibio <mdisibio@gmail.com>

* Rename test methods for clarity now that they also handle exemplars

Signed-off-by: Martin Disibio <mdisibio@gmail.com>

* Rename counter variable. Fix instances where metrics were not updated correctly

Signed-off-by: Martin Disibio <mdisibio@gmail.com>

* Add exemplars to LoadWAL benchmark

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* last exemplars timestamp metric needs to convert value to seconds with
ms precision

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Process exemplar records in a separate go routine when loading the WAL.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address review comments related to clarifying comments and variable
names. Also refactor sample/exemplar to enqueue prompb types.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Regenerate types proto with comments, update protoc version again.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Put remote write of exemplars behind a feature flag.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address some of Ganesh's review comments.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Move exemplar remote write feature flag to a config file field.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address Bartek's review comments.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Don't allocate exemplar buffers in queue_manager if we're not going to
send exemplars over remote write.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Add ValidateExemplar function, validate exemplars when appending to head
and log them all to WAL before adding them to exemplar storage.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address more reivew comments from Ganesh.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Add exemplar total label length check.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address a few last review comments

Signed-off-by: Callum Styan <callumstyan@gmail.com>

Co-authored-by: Martin Disibio <mdisibio@gmail.com>
2021-05-06 13:53:52 -07:00
Damien Grisonnet b50f9c1c84
Add label scrape limits (#8777)
* scrape: add label limits per scrape

Add three new limits to the scrape configuration to provide some
mechanism to defend against unbound number of labels and excessive
label lengths. If any of these limits are broken by a sample from a
scrape, the whole scrape will fail. For all of these configuration
options, a zero value means no limit.

The `label_limit` configuration will provide a mechanism to bound the
number of labels per-scrape of a certain sample to a user defined limit.
This limit will be tested against the sample labels plus the discovery
labels, but it will exclude the __name__ from the count since it is a
mandatory Prometheus label to which applying constraints isn't
meaningful.

The `label_name_length_limit` and `label_value_length_limit` will
prevent having labels of excessive lengths. These limits also skip the
__name__ label for the same reasons as the `label_limit` option and will
also make the scrape fail if any sample has a label name/value length
that exceed the predefined limits.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* scrape: add metrics and alert to label limits

Add three gauge, one for each label limit to easily access the
limit set by a certain scrape target.
Also add a counter to count the number of targets that exceeded the
label limits and thus were dropped. This is useful for the
`PrometheusLabelLimitHit` alert that will notify the users that scraping
some targets failed because they had samples exceeding the label limits
defined in the scrape configuration.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* scrape: apply label limits to __name__ label

Apply limits to the __name__ label that was previously skipped and
truncate the label names and values in the error messages as they can be
very very long.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* scrape: remove label limits gauges and refactor

Remove `prometheus_target_scrape_pool_label_limit`,
`prometheus_target_scrape_pool_label_name_length_limit`, and
`prometheus_target_scrape_pool_label_value_length_limit` as they are not
really useful since we don't have the information on the labels in it.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
2021-05-06 09:56:21 +01:00
Marco Pracucci 4da5c25ea4
Upgrade prometheus/common to v0.21.0
Signed-off-by: Marco Pracucci <marco@pracucci.com>
2021-04-21 12:19:16 +02:00
Julien Pivotto e14176756f
Merge pull request #8601 from dgl/fix-8243
Ensure that timestamp comparison uses wall clock time
2021-03-16 16:00:25 +01:00
Callum Styan 289ba11b79
Add circular in-memory exemplars storage (#6635)
* Add circular in-memory exemplars storage

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
Signed-off-by: Martin Disibio <mdisibio@gmail.com>

Co-authored-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com>
Co-authored-by: Martin Disibio <mdisibio@gmail.com>

* Fix some comments, clean up exemplar metrics struct and exemplar tests.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Fix exemplar query api null vs empty array issue.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

Co-authored-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
Co-authored-by: Tom Wilkie <tom.wilkie@gmail.com>
Co-authored-by: Martin Disibio <mdisibio@gmail.com>
2021-03-16 15:17:45 +05:30
David Leadbeater 21a282fabe Ensure that timestamp comparison uses wall clock time
It's not possible to assume subtraction and addition of a time.Time will
result in consistent values.

Signed-off-by: David Leadbeater <dgl@dgl.cx>
2021-03-15 13:05:17 +00:00
Tom Wilkie 7369561305
Combine Appender.Add and AddFast into a single Append method. (#8489)
This moves the label lookup into TSDB, whilst still keeping the cached-ref optimisation for repeated Appends.

This makes the API easier to consume and implement.  In particular this change is motivated by the scrape-time-aggregation work, which I don't think is possible to implement without it as it needs access to label values.

Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2021-02-18 17:37:00 +05:30
Brian Brazil ebe0da7a72
Protect sp.loops from concurrent access. (#8176)
Manager.reload takes the mutex that would make it safe, however
releases it before the goroutines spawned are finished with it.
Thus more explicit locking of scrapePool.Sync/stop/reload is needed.

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2020-11-12 16:06:25 +00:00
Brian Brazil 3f8e51738c
More granular locking for scrapeLoop. (#8104)
Don't lock for all of Sync/stop/reload as that holds up /metrics and the
UI when they want a list of active/dropped targets. Instead take
advantage of the fact that Sync/stop/reload cannot be called
concurrently by the scrape Manager and lock just on the targets
themselves.

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2020-10-26 14:46:20 +00:00
Julien Pivotto be5ba1a62d Fix wordings
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-10-07 21:44:36 +02:00
Julien Pivotto 671f7c66e5 Adjust comment
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-10-07 18:28:02 +02:00
Julien Pivotto 627ff84599 Adjust flag
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-10-07 18:25:52 +02:00
Julien Pivotto 536dfb6234 Add an experimental, hidden flag
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-10-07 17:31:46 +02:00
Julien Pivotto b90c7a55da Simplify logic
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-10-06 21:17:16 +02:00
Julien Pivotto ccc1df3140 Fix comment
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-10-06 13:48:24 +02:00
Julien Pivotto 98e14611a5 Move the tolerance logic in the loop function.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-10-05 18:20:10 +02:00
Julien Pivotto 6544f95403 Introduce timestamp tolerance in scrapes
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-10-05 18:20:10 +02:00
iurii bd53b5ff37
Unnecessary go routine spawn. (#7879)
* Unnecessary go routine spawn.
* Remove unnecessary local variable creation.

Signed-off-by: iurii <iurii@coins.ph>
Co-authored-by: iurii <iurii@coins.ph>
2020-09-02 16:26:42 +01:00
Andy Bursavich 4e6a94a27d
Invert service discovery dependencies (#7701)
This also fixes a bug in query_log_file, which now is relative to the config file like all other paths.

Signed-off-by: Andy Bursavich <abursavich@gmail.com>
2020-08-20 13:48:26 +01:00
Julien Pivotto 2899773b01
Do not stop scrapes in progress during reload (#7752)
* Do not stop scrapes in progress during reload.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-08-07 15:58:16 +02:00
johncming 5578c96307
scrape: fix typo. (#7712)
Signed-off-by: johncming <johncming@yahoo.com>
2020-08-01 09:56:21 +01:00
Julien Pivotto 7b5507ce4b
Scrape: defer report (#7700)
When I started wotking on target_limit, scrapeAndReport did not exist
yet. Then I simply rebased my work without thinking.

It appears that there is a lot that can be inline if I defer() the
report.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-07-31 19:11:08 +02:00
Annanay ec562f152b Merge branch 'master' into appender-context
Signed-off-by: Annanay <annanayagarwal@gmail.com>
2020-07-31 13:03:56 +05:30
Julien Pivotto f482c7bdd7
Add per scrape-config targets limit (#7554)
* Add per scrape-config targets limit

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-07-30 14:20:24 +02:00
Annanay 7f98a744e5 Add context to Appender interface
Signed-off-by: Annanay <annanayagarwal@gmail.com>
2020-07-24 19:40:51 +05:30
johncming 490f9c664e
scrape: remove two blank lines. (#7610)
Signed-off-by: johncming <johncming@yahoo.com>
2020-07-19 07:34:04 +02:00
Julien Pivotto 754461b74f
Reuse the same appender for report and scrape (#7562)
Additionally, implement isolation in collectResultAppender.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-07-16 13:53:39 +02:00
Julien Pivotto 190addffd8
Change Scrape Loop mtx to Mutex (#7553)
It was still RWLock but we never use the read lock..

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-07-11 15:37:13 +02:00
Brian Brazil f9d21f10ec
Only relabelling should apply for scrape_samples_scraped_post_relabelling. (#7342)
More consistent variable names.

Fixes #7298

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2020-06-04 16:00:37 +01:00
Brian Brazil c9565f08aa
Pass reference to checkAddError so appendErrors is updated. (#7294)
This was preventing the warnings from being logged.

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2020-05-26 15:14:55 +01:00
Marek Slabicki 8224ddec23
Capitalizing first letter of all log lines (#7043)
Signed-off-by: Marek Slabicki <thaniri@gmail.com>
2020-04-11 09:22:18 +01:00
Callum Styan c453def8c5
Separate scrape add error checking out into it's own function. (#6930)
* Separate scrape add error checking out into it's own function.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* pass sampleLimitError to checkAddError instead of returning an error

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Return bool, error from checkAddError so we can properly handle
ErrNotFound for AddFast. This should in theory never happen, but the
previous code path handled this case. Adds a test for this, which master
passes and the previous commit fails.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address comment changes.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Move sampleAdded inside the loop iteration within append, since that's
the only block the variable is used in.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2020-03-25 19:31:48 -07:00
Julien Pivotto d6ad5551c9
Scrape: do not put staleness marker when cache is reused (#7011)
* Scrape: do not put staleness marker when cache is reused

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-03-20 17:43:26 +01:00
Julien Pivotto 8907ba6235 Make TSDB use storage errors
This fixes #6992, which was introduced by #6777. There was an
intermediate component which translated TSDB errors into storage errors,
but that component was deleted and this bug went unnoticed, until we
were watching at the Prombench results. Without this, scrape will fail
instead of dropping samples or using "Add" when the series have been
garbage collected.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-03-17 22:24:25 +01:00
Björn Rabenstein d80b0810c1
Move crucial actions to defer (#6918)
With defer having less of a performance penalty, there is no reason
not to do those crucial operations via defer.

Context: With isolation in place, if we forget to Commit/Rollback, the
low watermark will get stuck forever.

The current code should not have any bugs, but moving to defer helps
to avoid future bugs.

This is also moving the `closeAppend` in the `Commit` implementation
itself to defer. If logging to the WAL fails, we would have missed the
`closeAppend`.

Signed-off-by: beorn7 <beorn@grafana.com>
2020-03-13 20:54:47 +01:00
Brian Brazil 5da8990053
Log scrape append failures as debug rather than warn. (#6852)
This is most likely due to an endpoint not producing valid
metrics output, which we should treat the same as a failed
scrape, and thus not spam the application logs with it.

Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
2020-03-06 00:46:03 +00:00
李国忠 52025bd7a9
[comments] change word ‘wheter’ to ‘whether’ (#6912)
* [comments] change word ‘wheter’ to ‘whether’
Signed-off-by: fuling <fuling.lgz@alibaba-inc.com>

* [comments] change word ‘wheter’ to ‘whether’
Signed-off-by: fuling <fuling.lgz@alibaba-inc.com>
2020-03-02 13:51:24 +05:30
Julien Pivotto ed623f69e2
tsdb: don't allow ingesting empty labelsets (#6891)
* tsdb: don't allow ingesting empty labelsets

When we ingest an empty labelset in the head, further blocks can not be
compacted, with the error:

```
level=error ts=2020-02-27T21:26:58.379Z caller=db.go:659 component=tsdb
msg="compaction failed" err="persist head block: write compaction:
add series: out-of-order series added with label set \"{}\" / prev:
\"{}\""
```

We should therefore reject those invalid empty labelsets upfront.

This can be reproduced with the following:

```
cat << END > prometheus.yml
scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 1s
    basic_auth:
      username: test
      password: test
    metric_relabel_configs:
    - regex: ".*"
      action: labeldrop

    static_configs:
    - targets:
      - 127.0.1.1:9090
END
./prometheus --storage.tsdb.min-block-duration=1m
```
And wait a few minutes.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-03-02 07:18:05 +00:00
Bartlomiej Plotka 34426766d8 Unify Iterator interfaces. All point to storage now.
This is part of https://github.com/prometheus/prometheus/pull/5882 that can be done to simplify things.
All todos I added will be fixed in follow up PRs.

* querier.Querier, querier.Appender, querier.SeriesSet, and querier.Series interfaces merged
with storage interface.go. All imports that.
* querier.SeriesIterator replaced by chunkenc.Iterator
* Added chunkenc.Iterator.Seek method and tests for xor implementation (?)
* Since we properly handle SelectParams for Select methods I adjusted min max
based on that. This should help in terms of performance for queries with functions like offset.
* added Seek to deletedIterator and test.
* storage/tsdb was removed as it was only a unnecessary glue with incompatible structs.

No logic was changed, only different source of abstractions, so no need for benchmarks.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
2020-02-17 18:03:54 +00:00
gotjosh 8b49c9285d
scrape: Add metrics to track bytes and entries in the metadata cache (#6675)
Signed-off-by: gotjosh <josue@grafana.com>
2020-01-29 11:13:18 +00:00
Julien Pivotto fafb7940b1 Pass over scrape cache to the next scrape (#6670)
* Pass over scrape cache to the next scrape

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2020-01-22 12:13:47 +00:00
gotjosh 05842176a6 Make the scrape.metricMetadataStore interface public
To test the implementation of our metric metadata API, we need to represent various states of metadata in the scrape metadata store. That is currently not possible as the interface and method to set the store are private.

This changes the interface, list and get methods, and the SetMetadaStore function to be public.

Incidentally, the scrapeCache implementation needs to be renamed to match the new signature.

Signed-off-by: gotjosh <josue@grafana.com>
2019-12-05 10:29:58 +00:00
Geoffrey Beausire 5cb7987314 Fix relabaling collision when using exported label
When using both a label and the suffix+label in the
relabel config. It's possible that Prometheus remove
the suffx+label for no obvious reason. It's due to a
collision when merging labels from target and from
the sample.

Signed-off-by: Geoffrey Beausire <g.beausire@criteo.com>
2019-11-26 11:03:11 +01:00
Dustin Hooten ca60bf298c React UI: Implement /targets page (#6276)
* Add LastScrapeDuration to targets endpoint

Signed-off-by: Dustin Hooten <dhooten@splunk.com>

* Add Scrape job name to targets endpoint

Signed-off-by: Dustin Hooten <dhooten@splunk.com>

* Implement the /targets page in react

Signed-off-by: Dustin Hooten <dhooten@splunk.com>

* Add state query param to targets endpoint

Signed-off-by: Dustin Hooten <dhooten@splunk.com>

* Use state filter in api call

Signed-off-by: Dustin Hooten <dhooten@splunk.com>

* api feedback

Signed-off-by: Dustin Hooten <dhooten@splunk.com>

* pr feedback frontend

Signed-off-by: Dustin Hooten <dhooten@splunk.com>

* Implement and use localstorage hook

Signed-off-by: Dustin Hooten <dhooten@splunk.com>

* PR feedback

Signed-off-by: Dustin Hooten <dhooten@splunk.com>
2019-11-11 22:42:24 +01:00
johncming 1fa5a75a3a Ctx name (#5961)
* scrape: rename ctx name for readability

Signed-off-by: johncming <johncming@yahoo.com>

* scrape: use self ctx instead of parent ctx.

Signed-off-by: johncming <johncming@yahoo.com>
2019-08-28 15:55:09 +02:00