Commit graph

580 commits

Author SHA1 Message Date
Ben Kochie 9570924511
Merge pull request #9638 from prometheus/superq/agentMode
Add agent mode identifier
2022-05-24 10:11:21 +02:00
Matthieu MOREL 36eee11434
refactor (package cmd): move from github.com/pkg/errors to 'errors' and 'fmt' packages (#10733)
Signed-off-by: Matthieu MOREL <mmorel-35@users.noreply.github.com>

Co-authored-by: Matthieu MOREL <mmorel-35@users.noreply.github.com>
2022-05-24 16:58:59 +10:00
Łukasz Mierzwa 44e5f220c0
Move prometheus_ready metric to web package (#10729)
This moves prometheus_ready to the web package and links it with the ready variable that decides if HTTP requests should return 200 or 503.
This is a follow up change from #10682

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
2022-05-23 16:00:59 +02:00
Łukasz Mierzwa 070e409dba
Add prometheus_ready metric (#10682)
When Prometheus starts it can take a long time before WAL is replayed and it can do anything useful. While it's starting it exposes metrics and other Prometheus servers can scrape it.
We do have alerts that fire if any Prometheus server is not ingesting samples and so far we've been dealing with instances that are starting for a long time by adding a check on Prometheus process uptime. Relying on uptime isn't ideal because the time needed to start depends on the number of metrics scraped, and so on the amount of data in WAL.
To help write better alerts it would be great if Prometheus exposed a metric that tells us it's fully started, that way any alert that suppose to notify us about any runtime issue can filter out starting instances.

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
2022-05-23 11:42:01 +02:00
Ben Ye af5ea213f7
promtool: support matchers when querying label values (#10727)
* promtool: support matchers when querying label values

Signed-off-by: Ben Ye <ben.ye@bytedance.com>

* address review comment

Signed-off-by: Ben Ye <ben.ye@bytedance.com>
2022-05-23 11:10:45 +10:00
Łukasz Mierzwa d3c9c4f574
Stop rule manager before TSDB is stopped (#10680)
During shutdown TSDB is stopped before rule manager is stopped. Since  TSDB shutdown can take a long time (minutes or 10s of minutes) it keeps rule manager running while parts of Prometheus are already stopped (most notebly scrape manager). This can cause false positive alerts to fire, mostly those that rely on absent() calls since new sample appends will stop while alert queries are still evaluated.
Stop rules before stopping TSDB and scrape manager to avoid this problem.

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
2022-05-20 23:26:06 +02:00
Alban Hurtaud 41630b8e88
Add hidden flag to configure discovery loop interval (#10634)
* Add hidden flag to configure discovery loop interval

Signed-off-by: Alban HURTAUD <alban.hurtaud@amadeus.com>
2022-05-06 00:42:04 +02:00
Matthieu MOREL e2ede285a2
refactor: move from io/ioutil to io and os packages (#10528)
* refactor: move from io/ioutil to io and os packages
* use fs.DirEntry instead of os.FileInfo after os.ReadDir

Signed-off-by: MOREL Matthieu <matthieu.morel@cnp.fr>
2022-04-27 11:24:36 +02:00
Filip Petkovski 1c1b174a8e
Add a --lint flag to the promtool check rules and check config commands (#10435)
* Add a --lint flag to the promtool check rules and check config commands

Checking rules with promtool emits warnings in the case of duplicate rules.
These warnings do not result in a non-zero exit code and are difficult to
spot in CI environments. Additionally, checking for duplicates is closer
to a lint check rather than a syntax check.

This commit adds a --lint flag to commands which include checking rules.
The flag can be used to enable or disable certain linting options
and cause the execution to return a non-zero exit code in case
those options are not met.

Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>

* Exit with status 3 on lint error

Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>
2022-04-06 00:05:11 -04:00
Julien Pivotto 390956d317
Log gomaxprocs messages (#10506)
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2022-03-30 19:16:22 +02:00
TomasKohout c0fd228bad
Add dependency on go.uber.org/automaxprocs (#10498)
* add dependency on go.uber.org/automaxprocs

Signed-off-by: Tomas Kohout <tomas.kohout1995@gmail.com>

Co-authored-by: Peter Bourgon <peterbourgon@users.noreply.github.com>
Co-authored-by: Julien Pivotto <roidelapluie@gmail.com>
2022-03-30 12:50:11 +02:00
Julien Pivotto f9d8e5245a
Plugins support (#10495)
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2022-03-29 14:44:39 +02:00
Wilbert Guo 83a2e52bc2
Add SyncForState Implementation for Ruler HA (#10070)
* continuously syncing activeAt for alerts

Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* add import

Signed-off-by: Yijie Qin <qinyijie@amazon.com>
Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* Refactor SyncForState and add unit tests

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* Format code

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* Add hook for syncForState

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Fix go lint

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Refactor syncForState override implementation

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Add syncForState override func as argument to Update()

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Fix go formatting

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Fix circleci test errors

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

Remove overrideFunc as argument to run()

Signed-off-by: Wilbert Guo <wilbeguo@amazon.com>

* remove the syncForState

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* use the override function to decide if need to replace the activeAt or not

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix test case

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix format

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* Trigger build

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fixing comments

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* return the result of map of alerts instead of single one

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* upper case the QueryforStateSeries

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* use a more generic rule group post process function type

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix indentation

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix gofmt

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix lint

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fixing naming

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fix comments

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* add the lastEvalTimestamp as parameter

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* fmt

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

* change funcType to func

Signed-off-by: Yijie Qin <qinyijie@amazon.com>

Co-authored-by: Yijie Qin <qinyijie@amazon.com>
Co-authored-by: Yijie Qin <63399121+qinxx108@users.noreply.github.com>
2022-03-29 02:16:46 +02:00
Alan Protasio 606ef33d91 Track and report Samples Queried per query
We always track total samples queried and add those to the standard set
of stats queries can report.

We also allow optionally tracking per-step samples queried. This must be
enabled both at the engine and query level to be tracked and rendered.
The engine flag is exposed via a Prometheus feature flag, while the
query flag is set when stats=all.

Co-authored-by: Alan Protasio <approtas@amazon.com>
Co-authored-by: Andrew Bloomgarden <blmgrdn@amazon.com>
Co-authored-by: Harkishen Singh <harkishensingh@hotmail.com>
Signed-off-by: Andrew Bloomgarden <blmgrdn@amazon.com>
2022-03-21 23:49:17 +01:00
Mauro Stettler b025390cb4
Disable chunk write queue by default, allow user to configure the exact size (#10425)
* Disable chunk write queue by default

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* update flag description

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
2022-03-11 17:26:59 +01:00
ian woolf 025528a5d6
cmd: use os.MkdirTemp instead of ioutil.TempDir (#10320)
Signed-off-by: ianwoolf <btw515wolf2@gmail.com>
2022-03-08 14:08:53 +01:00
Łukasz Mierzwa a4317bf0ec
Run gofumpt on all files (#10392)
* Run gofumpt on all files

Getting golangci-lint errors when building on my laptop, possibly because I have newer version of gofumpt then what it was formatted with.
Run gofumpt -w -extra on all files as it will be needed in the future anyway.

* Update golangci-lint to v1.44.2

v1.44.0 upgraded gofumpt so bumping version in CI will help keep formatting correct for everyone

* Address golangci-lint error

Getting 'error-strings: error strings should not be capitalized or end with punctuation or a newline' from revive here.
Drop new line.

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
2022-03-03 17:21:05 +01:00
SuperQ b297520666
Add agent mode identifier
Identify in the logs and liveness endpoints if the server is running in
Agent mode or not.

Signed-off-by: SuperQ <superq@gmail.com>
2022-02-17 05:27:09 +01:00
Tobias Klausmann b998636893
Improve error logging for missing config and QL dir (#10260)
* Improve error logging for missing config and QL dir

Currently, when Prometheus can't open its config file or the query
logging dir under the data dir, it only logs what it has been given
default or commandline/config. Depending on the environment this can be
less than helpful, since the working directory may be unclear to the
user. I have specifically kept the existing error messages as intact as
possible to a) still log the parameter as given and b) cause as little
disruption for log-parsers/-analyzers as possible.

So in case of the config file or the data dir being non-absolute paths,
I use os.GetWd to find the working dir and assemble an absolute path for
error logging purposes. If GetWd fails, we just log "unknown", as
recovering from an error there would be very complex measure, likely not
worth the code/effort.

Example errors:

```
$ ./prometheus
ts=2022-02-06T16:00:53.034Z caller=main.go:445 level=error msg="Error loading config (--config.file=prometheus.yml)" fullpath=/home/klausman/src/prometheus/prometheus.yml err="open prometheus.yml: no such file or directory"
$ touch prometheus.yml
$ ./prometheus
[...]
ts=2022-02-06T16:01:00.992Z caller=query_logger.go:99 level=error component=activeQueryTracker msg="Error opening query log file" file=data/queries.active fullpath=/home/klausman/src/prometheus/data/queries.active err="open data/queries.active: permission denied"
panic: Unable to create mmap-ed active query log
[...]
$
```

Signed-off-by: Tobias Klausmann <klausman@schwarzvogel.de>

* Replace our own logic with just using filepath.Abs()

Signed-off-by: Tobias Klausmann <klausman@schwarzvogel.de>

* Further simplification

Signed-off-by: Tobias Klausmann <klausman@schwarzvogel.de>

* Review edits

Signed-off-by: Tobias Klausmann <klausman@schwarzvogel.de>

* Review edits

Signed-off-by: Tobias Klausmann <klausman@schwarzvogel.de>

* Review edits

Signed-off-by: Tobias Klausmann <klausman@schwarzvogel.de>
2022-02-16 17:43:15 +01:00
Julien Pivotto 9a2e93228e
Switch to grafana/regexp everywhere (#10268)
Let's have a consistent library for regexp.

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
2022-02-13 00:58:27 +01:00
Matej Gera 2c61d29b2a
Tracing: Migrate to OpenTelemetry library (#9724)
Signed-off-by: Matej Gera <matejgera@gmail.com>
2022-01-25 11:08:04 +01:00
Rodrigo Queiro 70c1446a64
Clarify units of --storage.tsdb.retention.size (#10154)
The flag uses Base2Bytes: 129ed4ec8b/cmd/prometheus/main.go (L1476)

Signed-off-by: Rodrigo Queiro <rodrigoq@google.com>
2022-01-13 00:55:57 +01:00
beorn7 b39f2739e5 PromQL: Always enable negative offset and @ modifier
This follows the line of argument that the invariant of not looking
ahead of the query time was merely emerging behavior and not a
documented stable feature. Any query that looks ahead of the query
time was simply invalid before the introduction of the negative offset
and the @ modifier.

Signed-off-by: beorn7 <beorn@grafana.com>
2022-01-11 17:08:55 +01:00
beorn7 61509fc840 PromQL: Promote negative offset and @ modifer to stable
Following the argument that breaking the invariant that PromQL does
not look ahead of the evaluation time implies a breaking change, we
still need to keep the feature flag around, but at least we can
communicate that the feature is considered stable, and that the
feature flags will be ignored from v3 on.

Signed-off-by: beorn7 <beorn@grafana.com>
2022-01-11 00:34:33 +01:00
Björn Rabenstein 0f4a1e6eac
Merge pull request #10119 from prometheus/beorn7/remote
API: Promote remote-write-receiver to stable
2022-01-10 15:55:10 +01:00
chenlujjj 2ce94ac196
Add '--weight' flag to 'promtool check metrics' command (#10045) 2022-01-07 16:58:28 -05:00
beorn7 8fdfa52976 API: Promote remote-write-receiver to stable
Since `/api/v1/write` is a mutating endpoint, we should still activate
the remote-write-receiver explicitly. But we should do it in the same
way as the other mutating endpoints, i.e. via a flag
`--web.enable-remote-write-receiver`.

This commit marks the feature flag as deprecated, i.e. it still works
but logs a warning on startup. This enables users to seamlessly
migrate. With the next minor release, we can start ignoring the
feature flag (but still warn a user that is trying to use it).

Signed-off-by: beorn7 <beorn@grafana.com>
2022-01-05 15:36:07 +01:00
David Leadbeater a961062c37
Disable time based retention in tests (#8818)
Fixes #7699.

Signed-off-by: David Leadbeater <dgl@dgl.cx>
2022-01-02 23:46:03 +01:00
Jessica G 174a1147d5
Merge pull request #9861 from JessicaGreben/minor-prom-improvements
Add exit code constants in promtool
2021-12-31 12:07:02 -08:00
jessicagreben 4b03fa3100 replace config exit code with failure exit code
Signed-off-by: jessicagreben <jessicagrebens@gmail.com>
2021-12-30 05:37:57 -08:00
jessicagreben 59f7ef06d0 update exit code for sd
Signed-off-by: jessicagreben <jessicagrebens@gmail.com>
2021-12-18 04:45:15 -08:00
Nicholas Blott c92673fb14
Remove check against cfg so interval/ timeout are always set (#10023)
Signed-off-by: Nicholas Blott <blottn@tcd.ie>
2021-12-16 13:28:46 +01:00
Julien Pivotto db1551bd21
Merge pull request #10016 from prometheus/release-2.32
Merge back release 2.32
2021-12-14 20:58:58 +01:00
Ben Ye d9bbe7f3dd
mention agent mode in enable-feature flag help description (#9939)
Signed-off-by: Ben Ye <ben.ye@bytedance.com>
2021-12-04 21:13:24 +01:00
zzehring 42628899b5 promtool: Add --syntax-only flag for check config
This commit adds a `--syntax-only` flag for `promtool check config`.
When passing in this flag, promtool will omit various file existence
checks that would cause the check to fail (e.g. the check would not
fail if `rule_files` files don't exist at their respective paths).

This functionality will allow CI systems to check the syntax of
configs without worrying about referenced files.

Fixes: #5222
Signed-off-by: zzehring <zack.zehring@grafana.com>
2021-12-02 15:33:11 -05:00
jessicagreben 99bb56fc46 add errcodes from sd file
Signed-off-by: jessicagreben <jessicagrebens@gmail.com>
2021-12-01 04:45:18 -08:00
Filip Petkovski 5849521e90
promtool: Fix credentials file check (#9883)
The promtool check config command still uses the bearer_token_file
field which is deprecated in favour of authorization.credentials_file.

This commit modifies the command to use the new field insted.

Fixes #9874

Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>
2021-11-30 15:02:07 +11:00
jessicagreben 764f2d03a5 add const for exit codes
Signed-off-by: jessicagreben <jessicagrebens@gmail.com>
2021-11-24 09:17:49 -08:00
Matheus Alcantara d9a8c453a0
cmd: use t.TempDir instead of ioutil.TempDir on tests (#9852) 2021-11-23 20:09:28 -05:00
Robert Fratto 72a9f7fee9
Share TSDB locker code with agent (#9623)
* share tsdb db locker code with agent

Closes #9616

Signed-off-by: Robert Fratto <robertfratto@gmail.com>

* add flag to disable lockfile for agent

Signed-off-by: Robert Fratto <robertfratto@gmail.com>

* use agentOnlySetting instead of PreAction

Signed-off-by: Robert Fratto <robertfratto@gmail.com>

* tsdb: address review feedback

1. Rename Locker to DirLocker
2. Move DirLocker to tsdb/tsdbutil
3. Name metric using fmt.Sprintf
4. Refine error checking in DirLocker test

Signed-off-by: Robert Fratto <robertfratto@gmail.com>

* tsdb: create test utilities to assert expected DirLocker behavior

Signed-off-by: Robert Fratto <robertfratto@gmail.com>

* tsdb/tsdbutil: fix lint errors

Signed-off-by: Robert Fratto <robertfratto@gmail.com>

* tsdb/agent: fix windows test failure

Use new DB variable instead of overriding the old one.

Signed-off-by: Robert Fratto <robertfratto@gmail.com>
2021-11-11 11:45:25 -05:00
Mateusz Gozdek c08bb86be0 cmd/prometheus: use random listen port in TestStartupInterrupt test
So it can be run in parallel safely.

Signed-off-by: Mateusz Gozdek <mgozdekof@gmail.com>
2021-11-11 01:37:24 +01:00
Mateusz Gozdek 7bd7573891 cmd/prometheus/main_unix_test.go: fix unix test styling
* Formatting of error message is missing a space after ':'.
* t.Fatalf should be used instead of  t.Errorf+return.

Signed-off-by: Mateusz Gozdek <mgozdekof@gmail.com>
2021-11-11 01:37:24 +01:00
Robert Fratto 4a83e6f453
Remove agent mode warnings when loading configs (#9622)
PR #9618 introduced failing to load the config file when agent mode is
configured to run with unspported settings. This made the block that
logs a warning on their configuration no-op, which is now removed.

Signed-off-by: Robert Fratto <robertfratto@gmail.com>
2021-11-10 19:39:30 +05:30
Mateusz Gozdek fa1b14e146 cmd/prometheus: randomize test port and isolate test data directory
Between the tests. This enables parallelizing those tests, which should
cut the test execution time.

Signed-off-by: Mateusz Gozdek <mgozdekof@gmail.com>
2021-11-10 09:40:43 +01:00
beorn7 c954cd9d1d Move packages out of deprecated pkg directory
This creates a new `model` directory and moves all data-model related
packages over there:
  exemplar labels relabel rulefmt textparse timestamp value

All the others are more or less utilities and have been moved to `util`:
  gate logging modetimevfs pool runtime

Signed-off-by: beorn7 <beorn@grafana.com>
2021-11-09 08:03:10 +01:00
Dieter Plaetinck cda025b5b5
TSDB: demistify SeriesRefs and ChunkRefs (#9536)
* TSDB: demistify seriesRefs and ChunkRefs

The TSDB package contains many types of series and chunk references,
all shrouded in uint types.  Often the same uint value may
actually mean one of different types, in non-obvious ways.

This PR aims to clarify the code and help navigating to relevant docs,
usage, etc much quicker.

Concretely:

* Use appropriately named types and document their semantics and
  relations.
* Make multiplexing and demuxing of types explicit
  (on the boundaries between concrete implementations and generic
  interfaces).
* Casting between different types should be free.  None of the changes
  should have any impact on how the code runs.

TODO: Implement BlockSeriesRef where appropriate (for a future PR)

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* feedback

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>

* agent: demistify seriesRefs and ChunkRefs

Signed-off-by: Dieter Plaetinck <dieter@grafana.com>
2021-11-06 15:40:04 +05:30
Bartlomiej Plotka 789274bf9c cmd: Fixed storage flag regression introduced in #9660
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
2021-11-06 00:16:43 +01:00
Sunil Thaha 4bdaea7663
fix: storage.tsdb.path randomly initialised to data-agent/ (#9660)
Using the same variable for storage.tsdb.path and storage.agent.path
as below in main.go causes cfg.localStoragePath to be data/ or
data-agent/ at random.

  a.Flag("storage.tsdb.path", "Base path for metrics storage.").
      PreAction(serverOnlySetting()).
      Default("data/").StringVar(&cfg.localStoragePath)

  a.Flag("storage.agent.path", "Base path for metrics storage.").
      PreAction(agentOnlySetting()).
      Default("data-agent/").StringVar(&cfg.localStoragePath)
This patch fixes it by using a different variable for storage.agent.path

Signed-off-by: Sunil Thaha sthaha@redhat.com

Signed-off-by: Sunil Thaha <sthaha@redhat.com>
2021-11-04 10:08:01 +00:00
Bartlomiej Plotka e68ccc7708
Fix misleading agent-only/server-only check messages. (#9650)
* Fix misleading agent-only/server-only check messages.

Issue:

```
[root@host01 ~]# docker run -it --net=host --rm -v /root/editor/prom-agent-batcopter.yaml:/etc/prometheus/prometheus.yaml -v /root/prom-batcopter-data:/prometheus -u root --name prom-agent-batcopter quay.io/prometheus/prometheus:main --enable-feature=agent --config.file=/etc/prometheus/prometheus.yaml --storage.tsdb.path=/prometheus --web.listen-address=:9091
ts=2021-11-02T16:00:59.789Z caller=main.go:205 level=info msg="Experimental agent mode enabled."
The following flag(s) can not be used in agent mode: ["--enable-feature"]
```

Problem was that PreAction gives us all parsed flag. Context does not give us any info on what flag clause it was defined.

Also added info for flag help about being server or agent only.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

* gofumpt.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
2021-11-04 09:08:53 +00:00
Mateusz Gozdek c3beca72e2 cmd/prometheus: wait for Prometheus to shutdown in tests
So temporary data directory can be successfully removed, as on Windows,
directory cannot be in used while removal.

Signed-off-by: Mateusz Gozdek <mgozdekof@gmail.com>
2021-11-02 20:14:19 +01:00