Commit graph

179 commits

Author SHA1 Message Date
beorn7 4210aac74a Merge branch 'main' into sparsehistogram 2022-03-22 14:47:42 +01:00
Alvin Lin 8b5eb562b1 Re-generate test cert to fix test_windows test failures
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
2022-03-17 19:37:18 +01:00
Goutham Veeramachaneni c4f8020dca
Embed MetadaStore in scrape context (#10450)
This will allow downstream users to easily access metadata required.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
2022-03-16 09:45:15 +01:00
Robert Fratto f0ec619eec
scrape: allow providing a custom Dialer for scraping (#10415)
* scrape: allow providing a custom Dialer for scraping

This commit extends config.ScrapeConfig with an optional field to
override how HTTP connections to targets are created. This field is not
set directly in Prometheus, and is only added for the convenience of
downstream importers.

Closes #9706

Signed-off-by: Robert Fratto <robertfratto@gmail.com>

* scrape: move custom dial function to scrape.Options

Signed-off-by: Robert Fratto <robertfratto@gmail.com>
2022-03-09 00:48:47 +01:00
Jayapriya Pai edfe657b54
scrape: Fix label_limits cache usage (#10370)
Fixes #10344

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
2022-03-03 18:37:53 +01:00
Julien Pivotto f695df843f Improve content-type error handling
- Call err everywhere
- Change log message to underscore-separated field

Followup on #10186

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2022-02-08 11:02:51 +01:00
Matheus Pimenta 8d8ce641a4
error for invalid media type should not be completely swallowed (#10186)
* error for invalid media type should not be completely swallowed

Signed-off-by: Matheus Pimenta <matheuscscp@gmail.com>
2022-02-08 10:57:56 +01:00
Jonatan Ivanov b6df3b6f67
Prefer 1.0.0 in the accept header for application/openmetrics-text (#9431)
related: https://github.com/prometheus/client_java/issues/702
fixes gh-9430

Signed-off-by: Jonatan Ivanov <jonatan.ivanov@gmail.com>
2022-01-28 00:37:51 +01:00
beorn7 86cc83b13c storage: iterator fixes after merge
Signed-off-by: beorn7 <beorn@grafana.com>
2021-12-18 14:12:01 +01:00
beorn7 64c7bd2b08 Merge branch 'main' into sparsehistogram 2021-12-18 14:04:25 +01:00
Julius Volz fa552b98bb
Merge pull request #9996 from roidelapluie/fixreportlimit
Fix reporting metrics when sample limit is reached during the report
2021-12-17 13:17:07 +01:00
Julien Pivotto 67a64ee092
Remove check against cfg so interval/ timeout are always set (#10023) (#10031)
Signed-off-by: Nicholas Blott <blottn@tcd.ie>

Co-authored-by: Nicholas Blott <blottn@tcd.ie>
2021-12-16 16:46:14 +01:00
Julien Pivotto e94a0b28e1 Append reporting metrics without limit
If reporting metrics fails due to reaching the limit, this makes the
target appear as UP in the UI, but the metrics are missing.

This commit bypasses that limit for report metrics.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-12-16 13:26:53 +01:00
Björn Rabenstein 7e42acd3b1
tsdb: Rework iterators (#9877)
- Pick At... method via return value of Next/Seek.
- Do not clobber returned buckets.
- Add partial FloatHistogram suppert.

Note that the promql package is now _only_ dealing with
FloatHistograms, following the idea that PromQL only knows float
values.

As a byproduct, I have removed the histogramSeries metric. In my
understanding, series can have both float and histogram samples, so
that metric doesn't make sense anymore.

As another byproduct, I have converged the sampleBuf and the
histogramSampleBuf in memSeries into one. The sample type stored in
the sampleBuf has been extended to also contain histograms even before
this commit.

Signed-off-by: beorn7 <beorn@grafana.com>
2021-11-29 13:24:23 +05:30
beorn7 5d4db805ac Merge branch 'main' into sparsehistogram 2021-11-17 19:57:31 +01:00
beorn7 4c28d9fac7 Move to histogram.Histogram pointers
This is to avoid copying the many fields of a histogram.Histogram all
the time.

This also fixes a bunch of formerly broken tests.

Signed-off-by: beorn7 <beorn@grafana.com>
2021-11-12 23:17:35 +01:00
beorn7 c954cd9d1d Move packages out of deprecated pkg directory
This creates a new `model` directory and moves all data-model related
packages over there:
  exemplar labels relabel rulefmt textparse timestamp value

All the others are more or less utilities and have been moved to `util`:
  gate logging modetimevfs pool runtime

Signed-off-by: beorn7 <beorn@grafana.com>
2021-11-09 08:03:10 +01:00
Dieter Plaetinck cda025b5b5
TSDB: demistify SeriesRefs and ChunkRefs (#9536)
* TSDB: demistify seriesRefs and ChunkRefs

The TSDB package contains many types of series and chunk references,
all shrouded in uint types.  Often the same uint value may
actually mean one of different types, in non-obvious ways.

This PR aims to clarify the code and help navigating to relevant docs,
usage, etc much quicker.

Concretely:

* Use appropriately named types and document their semantics and
  relations.
* Make multiplexing and demuxing of types explicit
  (on the boundaries between concrete implementations and generic
  interfaces).
* Casting between different types should be free.  None of the changes
  should have any impact on how the code runs.

TODO: Implement BlockSeriesRef where appropriate (for a future PR)

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* feedback

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* agent: demistify seriesRefs and ChunkRefs

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>
2021-11-06 15:40:04 +05:30
Mateusz Gozdek 1a6c2283a3 Format Go source files using 'gofumpt -w -s -extra'
Part of #9557

Signed-off-by: Mateusz Gozdek <mgozdekof@gmail.com>
2021-11-02 19:52:34 +01:00
Darshan Chaudhary a7e554b158
add check service-discovery command (#8970)
Signed-off-by: darshanime <deathbullet@gmail.com>
2021-11-01 14:42:12 +01:00
DrAuYueng 69e309d202
Expose TargetsFromGroup/AlertmanagerFromGroup func and reuse this for (#9343)
static/file sd config check in promtool

Signed-off-by: DrAuYueng <ouyang1204@gmail.com>
2021-10-28 02:01:28 +02:00
Furkan Türkal a6e6011d55
Add scrape_body_size_bytes metric (#9569)
Fixes #9520

Signed-off-by: Furkan <furkan.turkal@trendyol.com>
2021-10-24 23:45:31 +02:00
Levi Harrison 5d409b0637
Remove interval and timeout parameters (#9578) 2021-10-24 10:38:21 -04:00
Julien Pivotto b0c98e01c8
Include scrape labels in the hash (#9551)
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-10-20 23:44:45 +02:00
beorn7 a9008f5423 Merge branch 'main' into sparsehistogram 2021-10-19 17:14:23 +02:00
beorn7 b8d953a5a0 scrape: Avoid creating a label map during conflict resolution
This also avoids the recursive function call. I think it is quite
readable. And much less code.

Signed-off-by: beorn7 <beorn@grafana.com>
2021-10-15 21:56:48 +02:00
Shirley Leu c890ea407f
Resolve conflicts between multiple exported label prefixes (#9479)
Resolve conflicts between multiple exported label prefixes

Signed-off-by: Shirley Leu <shirley.w.leu@gmail.com>
2021-10-15 20:31:03 +02:00
beorn7 7a8bb8222c Style cleanup of all the changes in sparsehistogram so far
A lot of this code was hacked together, literally during a
hackathon. This commit intends not to change the code substantially,
but just make the code obey the usual style practices.

A (possibly incomplete) list of areas:

* Generally address linter warnings.

* The `pgk` directory is deprecated as per dev-summit. No new packages should
  be added to it. I moved the new `pkg/histogram` package to `model`
  anticipating what's proposed in #9478.

* Make the naming of the Sparse Histogram more consistent. Including
  abbreviations, there were just too many names for it: SparseHistogram,
  Histogram, Histo, hist, his, shs, h. The idea is to call it "Histogram" in
  general. Only add "Sparse" if it is needed to avoid confusion with
  conventional Histograms (which is rare because the TSDB really has no notion
  of conventional Histograms). Use abbreviations only in local scope, and then
  really abbreviate (not just removing three out of seven letters like in
  "Histo"). This is in the spirit of
  https://github.com/golang/go/wiki/CodeReviewComments#variable-names

* Several other minor name changes.

* A lot of formatting of doc comments. For one, following
  https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences
  , but also layout question, anticipating how things will look like
  when rendered by `godoc` (even where `godoc` doesn't render them
  right now because they are for unexported types or not a doc comment
  at all but just a normal code comment - consistency is queen!).

* Re-enabled `TestQueryLog` and `TestEndopints` (they pass now,
  leaving them disabled was presumably an oversight).

* Bucket iterator for histogram.Histogram is now created with a
  method.

* HistogramChunk.iterator now allows iterator recycling. (I think
  @dieterbe only commented it out because he was confused by the
  question in the comment.)

* HistogramAppender.Append panics now because we decided to treat
  staleness marker differently.

Signed-off-by: beorn7 <beorn@grafana.com>
2021-10-11 13:02:03 +02:00
beorn7 fd5ea4e0b5 Merge branch 'main' into sparsehistogram 2021-10-07 23:16:42 +02:00
Julien Pivotto 63b3e4e5ec
Enable HTTP2 again (#9398)
We are re-enabling HTTP 2 again. There has been a few bugfixes upstream
in go, and we have also enabled ReadIdleTimeout.

Fix #7588
Fix #9068

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-09-26 23:16:12 +02:00
Robert Fratto daf2887fd4 expose scrape.userAgentHeader like remote.UserAgent
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
2021-09-13 14:10:34 -04:00
Julien Pivotto 48a101be1b
Allow to tune the scrape tolerance (#9283)
* Allow to tune the scrape tolerance

In most of the classic monitoring use cases, a few milliseconds
difference can be omitted.

In Prometheus, a few millisecond difference can however make a big
difference.

Currently, Prometheus will ignore up to 2 ms difference in the
alignments.

It turns out that for users who can afford a 10ms difference, there is a
lot of resources and disk space to win, as shown in this graph, which
shows the bytes / samples over a production Prometheus server. You can
clearly see the switch from 2ms to 10ms tolerance.

This pull request enables the adjustment of the scrape timestamp
alignment tolerance.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>

* Fix golint

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2021-09-08 17:27:33 +05:30
Bryan Boreham 92a3eeac55
Create less garbage when parsing metrics (#9299)
* Refactor: extract function to make scrapeLoop for testing

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Add benchmarks for ScrapeLoopAppend

For Prometheus and OpenMetrics

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Create less garbage when parsing metrics

Exemplar escapes to heap due to being passed through text-parser
interface, but we can reduce the impact by hoisting it out of the loop
and resetting it after every use.

(Note the cost was paid on every line even when exemplars were disabled)

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Create less garbage when parsing OpenMetrics

After calling parseLVals() we always append the return value, so pass in
what we want to append it to and save garbage.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
2021-09-08 13:39:21 +05:30
Łukasz Mierzwa f0a26266c0 Add scrape_sample_limit metric
This adds a new metric exposing per target scrape sample_limit value. Metrics are only exposed if extra-scrape-metrics feature flag is enabled.
scrape_sample_limit will make it easy to monitor and alert on targets getting close to configured sample_limit, which is important given than exceeding sample_limit results in the entire scrape results being rejected.

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
2021-09-03 15:42:41 +01:00
SuperQ 31f4108758
Add scrape_timeout_seconds metric
Add a new built-in metric `scrape_timeout_seconds` to allow monitoring
of the ratio of scrape duration to the scrape timeout. Hide behind a
feature flag to avoid additional cardinality by default.

Signed-off-by: SuperQ <superq@gmail.com>
2021-09-02 12:15:35 +02:00
Levi Harrison 70f597b033
Configure Scrape Interval and Timeout Via Relabeling (#8911)
* Configure scrape interval and timeout with labels

Signed-off-by: Levi Harrison <git@leviharrison.dev>
2021-08-31 17:37:32 +02:00
Ganesh Vernekar 8b70e87ab9
Merge remote-tracking branch 'upstream/main' into sparse-refactor
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
2021-08-05 12:16:08 +05:30
Arunprasad Rajkumar 5527e26efc
scrape: fix 'target_limit exceeded error' when reloading conf with 0
Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>
2021-07-27 17:34:22 +05:30
austin ce 5bdfba1d20
Extract and export GetFQDN()
Signed-off-by: austin ce <austin.cawley@gmail.com>
2021-07-21 12:55:02 -04:00
Naka Masato a1c1313b3c
fix typo in comment for scrape manager (#9094)
Signed-off-by: Masato Naka <masatonaka1989@gmail.com>
2021-07-19 15:55:13 +05:30
beorn7 5de2df752f Hacky implementation of protobuf parsing
This "brings back" protobuf parsing, with the only goal to play with
the new sparse histograms.

The Prom-2.x style parser is highly adapted to the structure of the
Prometheus text format (and later OpenMetrics). Some jumping through
hoops is required to feed protobuf into it.

This is not meant to be a model for the final implementation. It
should just enable sparse histogram ingestion at a reasonable
efficiency.

Following known shortcomings and flaws:

- No tests yet.

- Summaries and legacy histograms, i.e. without sparse buckets, are
  ignored.

- Staleness doesn't work (but this could be fixed in the appender, to
  be discussed).

- No tricks have been tried that would be similar to the tricks the
  text parsers do (like direct pointers into the HTTP response
  body). That makes things weird here. Tricky optimizations only make
  sense once the final format is specified, which will almost
  certainly not be the old protobuf format. (Interestingly, I expect
  this implementation to be in fact much more efficient than the
  original protobuf ingestion in Prom-1.x.)

- This is using a proto3 version of metrics.proto (mostly to be
  consistent with the other protobuf uses). However, proto3 sees no
  difference between an unset field. We depend on that to distinguish
  between an unset timestamp and the timestamp 0 (1970-01-01, 00:00:00
  UTC). In this experimental code, we just assume that timestamp is
  never specified and therefore a timestamp of 0 always is interpreted
  as "not set".

Signed-off-by: beorn7 <beorn@grafana.com>
2021-07-01 01:35:11 +02:00
Ganesh Vernekar 04ad56d9b8
Append sparse histograms into the Head block (#9013)
* Append sparse histograms into the Head block

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Add AtHistogram() to Iterator interface. Make HistoChunk conform to Chunk interface.

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
2021-06-29 20:08:46 +05:30
Ganesh Vernekar 64bea6999e
HistogramAppender interface for sparse histograms (#9007)
Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>
2021-06-28 20:30:55 +05:30
Julius Volz 9d495afd2c Remove trailing zeros in scrape timeout header
See https://twitter.com/AviKivity/status/1405147699557638145 and
https://twitter.com/juliusvolz/status/1405790211670515712

Signed-off-by: Julius Volz <julius.volz@gmail.com>
2021-06-18 09:38:12 +02:00
Levi Harrison b5f6f8fb36 Switched to go-kit/log
Signed-off-by: Levi Harrison <git@leviharrison.dev>
2021-06-11 12:28:36 -04:00
hanjm 1df05bfd49 Add body_size_limit to prevent bad targets response large body cause Prometheus server OOM (#8827)
Signed-off-by: hanjm <hanjinming@outlook.com>
2021-05-29 07:05:42 +08:00
Levi Harrison 2826fbeeb7
SD: Add target creation failure counter and change failure handling (#8786)
* Added metric and changed failure/drop strategy

Signed-off-by: Levi Harrison <git@leviharrison.dev>
2021-05-28 23:50:59 +02:00
Callum Styan 8fd73b1d28
Add Exemplar Remote Write support (#8296)
* Write exemplars to the WAL and send them over remote write.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Update example for exemplars, print data in a more obvious format.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Add metrics for remote write of exemplars.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Fix incorrect slices passed to send in remote write.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* We need to unregister the new metrics.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address review comments

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Order of exemplar append vs write exemplar to WAL needs to change.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Several fixes to prevent sending uninitialized or incorrect samples with an exemplar. Fix dropping exemplar for missing series. Add tests for queue_manager sending exemplars

Signed-off-by: Martin Disibio <mdisibio@gmail.com>

* Store both samples and exemplars in the same timeseries buffer to remove the alloc when building final request, keep sub-slices in separate buffers for re-use

Signed-off-by: Martin Disibio <mdisibio@gmail.com>

* Condense sample/exemplar delivery tests to parameterized sub-tests

Signed-off-by: Martin Disibio <mdisibio@gmail.com>

* Rename test methods for clarity now that they also handle exemplars

Signed-off-by: Martin Disibio <mdisibio@gmail.com>

* Rename counter variable. Fix instances where metrics were not updated correctly

Signed-off-by: Martin Disibio <mdisibio@gmail.com>

* Add exemplars to LoadWAL benchmark

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* last exemplars timestamp metric needs to convert value to seconds with
ms precision

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Process exemplar records in a separate go routine when loading the WAL.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address review comments related to clarifying comments and variable
names. Also refactor sample/exemplar to enqueue prompb types.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Regenerate types proto with comments, update protoc version again.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Put remote write of exemplars behind a feature flag.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address some of Ganesh's review comments.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Move exemplar remote write feature flag to a config file field.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address Bartek's review comments.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Don't allocate exemplar buffers in queue_manager if we're not going to
send exemplars over remote write.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Add ValidateExemplar function, validate exemplars when appending to head
and log them all to WAL before adding them to exemplar storage.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address more reivew comments from Ganesh.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Add exemplar total label length check.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Address a few last review comments

Signed-off-by: Callum Styan <callumstyan@gmail.com>

Co-authored-by: Martin Disibio <mdisibio@gmail.com>
2021-05-06 13:53:52 -07:00
Damien Grisonnet b50f9c1c84
Add label scrape limits (#8777)
* scrape: add label limits per scrape

Add three new limits to the scrape configuration to provide some
mechanism to defend against unbound number of labels and excessive
label lengths. If any of these limits are broken by a sample from a
scrape, the whole scrape will fail. For all of these configuration
options, a zero value means no limit.

The `label_limit` configuration will provide a mechanism to bound the
number of labels per-scrape of a certain sample to a user defined limit.
This limit will be tested against the sample labels plus the discovery
labels, but it will exclude the __name__ from the count since it is a
mandatory Prometheus label to which applying constraints isn't
meaningful.

The `label_name_length_limit` and `label_value_length_limit` will
prevent having labels of excessive lengths. These limits also skip the
__name__ label for the same reasons as the `label_limit` option and will
also make the scrape fail if any sample has a label name/value length
that exceed the predefined limits.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* scrape: add metrics and alert to label limits

Add three gauge, one for each label limit to easily access the
limit set by a certain scrape target.
Also add a counter to count the number of targets that exceeded the
label limits and thus were dropped. This is useful for the
`PrometheusLabelLimitHit` alert that will notify the users that scraping
some targets failed because they had samples exceeding the label limits
defined in the scrape configuration.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* scrape: apply label limits to __name__ label

Apply limits to the __name__ label that was previously skipped and
truncate the label names and values in the error messages as they can be
very very long.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

* scrape: remove label limits gauges and refactor

Remove `prometheus_target_scrape_pool_label_limit`,
`prometheus_target_scrape_pool_label_name_length_limit`, and
`prometheus_target_scrape_pool_label_value_length_limit` as they are not
really useful since we don't have the information on the labels in it.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
2021-05-06 09:56:21 +01:00
Marco Pracucci 4da5c25ea4
Upgrade prometheus/common to v0.21.0
Signed-off-by: Marco Pracucci <marco@pracucci.com>
2021-04-21 12:19:16 +02:00