Commit graph

140 commits

Author SHA1 Message Date
Fabian Reinartz dc7d27ab9a retrieval: add honor label handling and parametrized querying.
This commit adds the honor_labels and params arguments to the scrape
config. This allows to specify query parameters used by the scrapers
and handling scraped labels with precedence.
2015-06-23 13:45:14 +02:00
Brian Brazil 0dbae36d36 Allow ingested metrics to be relabeled.
The main purpose of this is to allow for blacklisting
of expensive metrics as a tactical option.
It could also find uses for renaming and removing labels
from federation.
2015-06-13 15:18:27 +01:00
Brian Brazil 58ceae82bc Revert "Allow ingested metrics to be relabeled."
This reverts commit f2f26ca08f.

Was accidentally pushed to master instead of a branch for PR.
2015-06-12 22:12:26 +01:00
Brian Brazil f2f26ca08f Allow ingested metrics to be relabeled.
The main purpose of this is to allow for blacklisting
of expensive metrics as a tactical option.
It could also find uses for renaming and removing labels
from federation.
2015-06-12 22:06:30 +01:00
Brian Brazil b8b1d3cbac Web: Add pre-relabel labels to status page.
Figuring out what's going on with the new service discovery
and labels is difficult. Add a popover with the labels
to the target table to make things simpler, and help
discovery of potentially useful labels.
2015-06-08 12:19:01 +01:00
Fabian Reinartz 0de6edbdfc Move pkg/ to util/ 2015-06-01 21:12:32 +02:00
Fabian Reinartz dfaf31a1da Move web/httputils to pkg/httputil and add DeadlineClient to it 2015-06-01 21:12:31 +02:00
Julius Volz 267fd34156 Switch Prometheus to use github.com/prometheus/log.
This change is conceptually very simple, although the diff is large. It
switches logging from "github.com/golang/glog" to
"github.com/prometheus/log", while not actually changing any log
messages. V(1)-style logging has been changed to be log.Debug*().
2015-05-20 18:19:32 +02:00
Fabian Reinartz 385919a65a Avoid inter-component blocking if ingestion/scraping blocks.
Appending to the storage can block for a long time. Timing out
scrapes can also cause longer blocks. This commit avoids that those
blocks affect other compnents than the target itself.
Also the Target interface was removed.
2015-05-18 17:58:51 +02:00
Fabian Reinartz 1a2d57b45c Move template functionality out of target.
The target implementation and interface contain methods only serving a
specific purpose of the templates. They were moved to the template
as they operate on more fundamental target data.
2015-05-18 13:35:43 +02:00
Fabian Reinartz dbc08d390e Move target status data into its own object 2015-05-18 11:15:42 +02:00
Fabian Reinartz 93548a8882 Add initial file based service discovery.
This commits adds file based service discovery which reads target
groups from specified files. It detects changes based on file watches
and regular refreshes.
2015-05-15 14:44:54 +02:00
Fabian Reinartz d5aa012fd0 Make HTTP basic auth configurable for scrape targets. 2015-05-15 12:47:50 +02:00
Fabian Reinartz bb540fd9fd Implement config reloading on SIGHUP.
With this commit, sending SIGHUP to the Prometheus process will reload
and apply the configuration file. The different components attempt
to handle failing changes gracefully.
2015-05-13 16:49:46 +02:00
Fabian Reinartz 5fbde88919 Switch config to YAML format. 2015-05-07 16:52:14 +02:00
Fabian Reinartz 945c49a2dd Add relabelling to target management.
This commit adds a relabelling stage on the set of base
labels from which a target is created. It allows to drop
targets and rewrite any regular or internal label.
2015-04-30 18:46:33 +02:00
Fabian Reinartz 0b619b46d6 Change JobConfig to ScrapeConfig.
This commit changes the configuration interface from job configs to scrape
configs. This includes allowing multiple ways of target definition at once
and moving DNS SD to its own config message. DNS SD can now contain multiple
DNS names per configured discovery.
2015-04-28 23:18:55 +02:00
Fabian Reinartz 5015c2a0e8 Make target manager source based.
This commit shifts responsibility for maintaining targets from providers and
pools to the target manager. Target groups have a source name that identifies
them for updates.
2015-04-24 15:49:35 +02:00
Fabian Reinartz 4f8673aa88 Simplify update sync for targets, format config fixtures. 2015-04-19 10:36:26 +02:00
beorn7 fa1935a644 Remove /api/targets call and do not show job and instance labels on status.
/api/targets was undocumented and never used and also broken.

Showing instance and job labels on the status page (next to targets)
does not make sense as those labels are set in an obvious way.

Also add a doc comment to TargetStateToClass.
2015-03-18 18:53:43 +01:00
beorn7 be11cb2b07 Remove the sample ingestion channel.
The one central sample ingestion channel has caused a variety of
trouble. This commit removes it. Targets and rule evaluation call an
Append method directly now. To incorporate multiple storage backends
(like OpenTSDB), storage.Tee forks the Append into two different
appenders.

Note that the tsdb queue manager had its own queue anyway. It was a
queue after a queue... Much queue, so overhead...

Targets have their own little buffer (implemented as a channel) to
avoid stalling during an http scrape. But a new scrape will only be
started once the old one is fully ingested.

The contraption of three pipelined ingesters was removed. A Target is
an ingester itself now. Despite more logic in Target, things should be
less confusing now.

Also, remove lint and vet warnings in ast.go.
2015-03-15 14:08:22 +01:00
Julius Volz 140eede5e0 Rename UNREACHABLE to UNHEALTHY.
The current wording suggests that a target is not reachable at all,
although it might also get set when the target was reachable, but there
was some other error during the scrape (invalid headers or invalid
scrape content). UNHEALTHY is a more general wording that includes all
these cases.

For consistency, ALIVE is also renamed to HEALTHY.
2015-03-07 23:18:18 +01:00
Sergiusz 'q3k' Bazański 0d0bb3c030 Change instance identifiers to be host:port
This changes the PublicURL function into InstanceIdentifier, which now
returns a simple <host>:<port> string instead of a full URL.
2015-02-20 16:21:13 +01:00
Sergiusz 'q3k' Bazański bb69a3d284 Hide HTTP auth parts from URL
This  instroduces an extra function in the Target interface (PublicURL)
which is used to populate the instance field in scraped metrics.
2015-02-19 18:58:47 +01:00
beorn7 0f191629c6 Next try to deal with backed-up ingestion.
This is now not even trying to throttle in a benign way, but creates a
fully-fledged error. Advantage: It shows up very visible on the status
page. Disadvantage: The server does not really adjusts to a lower
scraping rate. However, if your ingestion backs up, you are in a very
irregulare state, I'd say it _should_ be considered an error and not
dealt with in a more graceful way.

In different news: I'll work on optimizing ingestion so that we will
not as easily run into that situation in the first place.
2015-02-09 17:32:47 +01:00
beorn7 16a1a6d324 Add another check for stopped scraper. 2015-02-06 18:30:33 +01:00
beorn7 5678a86924 Throttle scraping if a scrape took longer than the configured interval.
The simple algorithm applied here will increase the actual interval
incrementally, whenever and as long as the scrape itself takes longer
than the configured interval. Once it takes shorter again, the actual
interval will iteratively decrease again.
2015-02-06 16:44:56 +01:00
Bjoern Rabenstein 5859b74f1b Clean up license issues.
- Move CONTRIBUTORS.md to the more common AUTHORS.
- Added the required NOTICE file.
- Changed "Prometheus Team" to "The Prometheus Authors".
- Reverted the erroneous changes to the Apache License.
2015-01-21 20:07:45 +01:00
Bjoern Rabenstein b09453af1d Adjust to new client_golang API. 2015-01-21 15:42:25 +01:00
Julius Volz d6b9e97655 Remove extraction.Result type, simplify code. 2015-01-08 16:34:01 +01:00
Brian Brazil e56786b221 Have scrape time as a pseudovariable, not a prometheus variable.
This ensures it has the right timestamp, and is easier to work with.

Switch sd variable away from 'outcome', using total/failed instead.
2014-12-27 00:39:33 +00:00
Johannes 'fish' Ziemke ff95a52b0f Rename Address to URL
The "Address" is actually a URL which may contain username and
password. Calling this Address is misleading so we rename it.

Change-Id: I441c7ab9dfa2ceedc67cde7a47e6843a65f60511
2014-12-18 12:18:16 +01:00
Bjoern Rabenstein b1e4956142 Apply a giant code cleanup.
Essentially:

- Remove unused code.

- Make it 'go vet' clean. The only remaining warnings are in generated code.

- Make it 'golint' clean. The only remaining warnings are in gerenated code.

- Smoothed out same minor things.

Change-Id: I3fe5c1fbead27b0e7a9c247fee2f5a45bc2d42c6
2014-12-10 16:16:49 +01:00
Bjoern Rabenstein 89bb376bce Reduce lock-protected area during scrape.
Change-Id: Iaa7faa7c916b1890b568d05bd8bfff6299b6767d
2014-12-05 19:40:41 +01:00
Bjoern Rabenstein fee88a7a77 Remove the remaining races, new and old.
Also, resolve a few other TODOs.

Change-Id: Icb39b5a5e8ca22ebcb48771cd8951c5d9e112691
2014-12-03 18:07:23 +01:00
Bjoern Rabenstein 14bda4180c Changes after pair code review.
Change-Id: Ib72d40f8e9027818cfbbd32a7a7201eebda07455
2014-11-25 17:12:59 +01:00
Bjoern Rabenstein a2feed343a Convert another occurrence from chan bool to chan struct{}.
Change-Id: I11ba127a934ee3aec0fcd139ad32a7751cff77a0
2014-11-25 17:10:39 +01:00
Bjoern Rabenstein 74c143c4c9 Improve scraper shutdown time.
- Stop target pools in parallel.
- Stop individual scrapers in goroutines, too.
- Timing tweaks.

Change-Id: I9dff1ee18616694f14b04408eaf1625d0f989696
2014-11-25 17:10:39 +01:00
Bjoern Rabenstein 92156ee89d Drain the newBaseLabels channel upon shutdown.
This should help cut down shutdown times.

Change-Id: I6e70a598a9e49aa6eeeb2034105b1bc6e9014324
2014-11-25 17:10:39 +01:00
Bjoern Rabenstein 6b37e47f9e Remove unused metrics.
Change-Id: Icf03ba4ce92a5e38daf12930f9661daba79c83bb
2014-11-25 17:09:03 +01:00
Bjoern Rabenstein 4447708c9f Fix a race in target.go.
Also, fix problems in shutdown.
Starting serving and shutdown still has to be cleaned up properly.
It's a mess.

Change-Id: I51061db12064e434066446e6fceac32741c4f84c
2014-11-25 17:08:45 +01:00
Bjoern Rabenstein e0a6cb281e Fix the accept header.
A '/' is a separator and has to be in a quoted string.

Change-Id: If7a3a847f84f8f709074d05dc98b5b21e954030c
2014-11-25 17:02:00 +01:00
Brian Brazil 5edf689133 Stagger scrapes to spread out load.
Change-Id: Ib141b271e4adfb817886871f86051c207b05cf35
2014-11-25 17:02:00 +01:00
Bjoern Rabenstein 1909686789 Make metrics exported by the Prometheus server itself more consistent.
- Always spell out the time unit (e.g. milliseconds instead of ms).

- Remove "_total" from the names of metrics that are not counters.

- Make use of the "Namespace" and "Subsystem" fields in the options.

- Removed the "capacity" facet from all metrics about channels/queues.
  These are all fixed via command line flags and will never change
  during the runtime of a process. Also, they should not be part of
  the same metric family. I have added separate metrics for the
  capacity of queues as convenience. (They will never change and are
  only set once.)

- I left "metric_disk_latency_microseconds" unchanged, although that
  metric measures the latency of the storage device, even if it is not
  a spinning disk. "SSD" is read by many as "solid state disk", so
  it's not too far off. (It should be "solid state drive", of course,
  but "metric_drive_latency_microseconds" is probably confusing.)

- Brian suggested to not mix "failure" and "success" outcome in the
  same metric family (distinguished by labels). For now, I left it as
  it is. We are touching some bigger issue here, especially as other
  parts in the Prometheus ecosystem are following the same
  principle. We still need to come to terms here and then change
  things consistently everywhere.

Change-Id: If799458b450d18f78500f05990301c12525197d3
2014-11-25 17:02:00 +01:00
Brian Brazil 4a2b96f848 Remove backoff on scrape failure.
Having metrics with variable timestamps inconsistently
spaced when things fail will make it harder to write correct rules.

Update status page, requires some refactoring to insert a function.

Change-Id: Ie1c586cca53b8f3b318af8c21c418873063738a8
2014-11-25 17:02:00 +01:00
Julius Volz 1bb7074fec Fix HTTP connection leak upon non-OK status.
Change-Id: Ie7fbd7dcc089b8306b40631be3e3d736c23c1cd3
2014-11-25 17:02:00 +01:00
Bjoern Rabenstein bacc31d5cc Remove work-around that required copying all bytes of a scrape.
Now that the subtle bug in matttproud/golang_protobuf_extensions is
fixed, we do not need to copy the bytes of a scrape into a buffer
first before starting to parse it.

Change-Id: Ib73ecae16173ddd219cda56388a8f853332f8853
2014-11-25 17:01:59 +01:00
Bjoern Rabenstein 8956faeccb Migrate to new client_golang.
This change will only be submitted when the new client_golang has been
moved to the new version.

Change-Id: Ifceb59333072a08286a8ac910709a8ba2e3a1581
2014-11-25 17:01:59 +01:00
Bjoern Rabenstein 814e479723 Treat non-200 HTTP response as error.
Change-Id: I2a9f3b47012b3c4839be53aa44c66d16dd41a24a
2014-11-25 17:01:59 +01:00
Brian Brazil 23255f1499 Fix negative Next Retrieval on status page.
Change-Id: Ifa754034660a251fee71f166dbf057697ec4e872
2014-05-12 15:24:34 +01:00
Bjoern Rabenstein 64811caaec Make Prometheus announce its new super-power: text format!
Change-Id: Ia2ddfb28999c145e4d46c395381a9bf89d43148c
2014-04-22 18:44:52 +02:00
Julius Volz 740d448983 Use custom timestamp type for sample timestamps and related code.
So far we've been using Go's native time.Time for anything related to sample
timestamps. Since the range of time.Time is much bigger than what we need, this
has created two problems:

- there could be time.Time values which were out of the range/precision of the
  time type that we persist to disk, therefore causing incorrectly ordered keys.
  One bug caused by this was:

  https://github.com/prometheus/prometheus/issues/367

  It would be good to use a timestamp type that's more closely aligned with
  what the underlying storage supports.

- sizeof(time.Time) is 192, while Prometheus should be ok with a single 64-bit
  Unix timestamp (possibly even a 32-bit one). Since we store samples in large
  numbers, this seriously affects memory usage. Furthermore, copying/working
  with the data will be faster if it's smaller.

*MEMORY USAGE RESULTS*
Initial memory usage comparisons for a running Prometheus with 1 timeseries and
100,000 samples show roughly a 13% decrease in total (VIRT) memory usage. In my
tests, this advantage for some reason decreased a bit the more samples the
timeseries had (to 5-7% for millions of samples). This I can't fully explain,
but perhaps garbage collection issues were involved.

*WHEN TO USE THE NEW TIMESTAMP TYPE*
The new clientmodel.Timestamp type should be used whenever time
calculations are either directly or indirectly related to sample
timestamps.

For example:
- the timestamp of a sample itself
- all kinds of watermarks
- anything that may become or is compared to a sample timestamp (like the timestamp
  passed into Target.Scrape()).

When to still use time.Time:
- for measuring durations/times not related to sample timestamps, like duration
  telemetry exporting, timers that indicate how frequently to execute some
  action, etc.

*NOTE ON OPERATOR OPTIMIZATION TESTS*
We don't use operator optimization code anymore, but it still lives in
the code as dead code. It still has tests, but I couldn't get all of them to
pass with the new timestamp format. I commented out the failing cases for now,
but we should probably remove the dead code soon. I just didn't want to do that
in the same change as this.

Change-Id: I821787414b0debe85c9fffaeb57abd453727af0f
2013-12-03 09:11:28 +01:00
Julius Volz d69b85e6c9 Add global label support via Ingesters. 2013-08-13 16:54:15 +02:00
Julius Volz 0003027dce Add needed trailing spaces in logs. 2013-08-12 18:22:48 +02:00
Julius Volz aa5d251f8d Use github.com/golang/glog for all logging. 2013-08-12 17:54:36 +02:00
Matt T. Proud a5141e4d0a Depointerize storage conf. and chain ingester.
The storage builders need to work with the assumption that they have
a copy of the underlying configuration data if any mutations are made.
2013-08-12 17:07:03 +02:00
Julius Volz f8b20f30ac Make retrieval work with client's new Ingester interface. 2013-08-12 15:15:41 +02:00
Julius Volz 3b970c5133 Add variable interpolation to notification messages.
This includes required refactorings to enable replacing the http client (for
testing) and moving the NotificationReq type definitions to the "notifications"
package, so that this package doesn't need to depend on "rules" anymore and
that it can instead use a representation of the required data which only
includes the necessary fields.
2013-08-12 12:29:08 +02:00
Julius Volz 35ee2cd3cb Add alertmanager notification support to Prometheus.
Alert definitions now also have mandatory SUMMARY and DESCRIPTION fields
that get sent along a firing alert to the alert manager.
2013-07-30 17:23:41 +02:00
Matt T. Proud f7704af4f8 Code Review: Formatting comments. 2013-07-15 15:12:01 +02:00
Matt T. Proud 06b4a40661 Represent targets in a tabular interface.
This commit represents a target group's endpoints in a tabular fashion for better differentiation
of their state in a concise manner.
2013-07-15 15:12:01 +02:00
Matt T. Proud e20e6980e9 Completely extract response payload for decoding.
This commit forces the extraction framework to read the entire response payload
into a buffer before attempting to decode it, for the underlying Protocol Buffer
message readers do not block on partial messages.
2013-07-14 23:04:08 +02:00
Matt T. Proud b8c7fd8c34 Include Accept header for telemetry request.
This pull request introduces a HTTP Accept header to indicate a
preference for Protocol Buffer-encoded messages.
2013-06-27 18:32:28 +02:00
Matt T. Proud 30b1cf80b5 WIP - Snapshot of Moving to Client Model. 2013-06-25 15:52:42 +02:00
Matt T. Proud 9cde48754b Fix race conditions in TargetPool.
The race condition detector found a few anomalies whereby a
TargetPool could be read during a mutation.  This has been fixed.
2013-06-05 14:44:20 +02:00
Julius Volz 081191afb8 Remember and display last scrape errors in web UI. 2013-05-21 15:31:27 +02:00
Matt T. Proud 8f4c7ece92 Destroy naked returns in half of corpus.
The use of naked return values is frowned upon.  This is the first
of two bulk updates to remove them.
2013-05-16 10:53:25 +03:00
Julius Volz d8110fcd9c Send sample arrays instead of single samples over channels. 2013-04-29 17:24:17 +02:00
Bernerd Schaefer 3929582892 Target uses HTTP transport with deadlines
Instead of externally handling timeouts when scraping a target, we set
timeouts on the HTTP connection. This ensures that we don't leak
goroutines on timeouts.

[fixes #181]
2013-04-29 09:46:40 +02:00
Matt T. Proud a48ab34dd0 Refresh Prometheus client API usage.
The client API has been updated per https://github.com/prometheus/client_golang/pull/9.
2013-04-28 19:40:30 +02:00
Julius Volz d4ff85db5a Add instance label to health (up) timeseries. 2013-04-24 21:50:49 +02:00
Julius Volz ae316415fe Add missing argument to Printf call. 2013-04-19 16:29:58 +02:00
Julius Volz 8c9e9632a8 Record scrape health timeseries per target. 2013-04-16 19:01:26 +02:00
Julius Volz a1ba23038e Fix scrape timestamps to reduce sample time jitter.
We're currently timestamping samples with the time at the end of a scrape
iteration. It makes more sense to use a timestamp from the beginning of the
scrape for two reasons:

a) this time is more relevant to the scraped values than the time at the
   end of the HTTP-GET + JSON decoding work.
b) it reduces sample timestamp jitter if we measure at the beginning, and
   not at the completion of a scrape.
2013-04-13 03:45:37 +02:00
Johannes 'fish' Ziemke 14407a076a Convert addresses pointing to localhost in status.
Until now, targets pointing to localhost in the status view are linked to localhost, so you can't follow those links by clicking on them.
This change converts the links to point to the hostname of the prometheus server.

Before:
<a href="http://localhost:9090/metrics.json">http://localhost:9090/metrics.json</a>

After:
<a href="http://hostname-of-prometheus-server:9090/metrics.json">http://localhost:9090/metrics.json</a>
2013-04-12 15:14:04 +02:00
Matt T. Proud 5a9417f80a Include humanized target state strings.
In the current /status implementation, we cannot divine what the
target's state is but rather get an integer constant for it.  This
commit, stringifies the constants.
2013-03-21 11:52:42 +01:00
Julius Volz f1fc7d717a Allow replacing job targets via HTTP API.
This roughly comprises the following changes:

- index target pools by job instead of scrape interval
- make targets within a pool exchangable while preserving existing
  health state for targets
- allow exchanging targets via HTTP API (PUT)
- show target lists in /status (experimental, for own debug use)
2013-02-28 21:33:29 +01:00
Julius Volz 047eb219e4 Fix target health state update.
Right now, futureState is only used to give hints to the health scheduler, but
nowhere is this future state persisted into the target's state field, so we
don't actually track a target's state over time.
2013-02-25 02:52:52 +01:00
Julius Volz 5d55785936 Fix broken target scrape error propagation. 2013-02-21 19:48:54 +01:00
Matt T. Proud ea54751431 Update import paths to new location.
This repository moved from matttproud/prometheus to
prometheus/prometheus, and all import paths need to be updated.
2013-01-27 18:49:45 +01:00
Matt T. Proud f2ded515b7 Support versioned telemetry providers.
client_golang was updated to support full label-oriented telemetry,
which introduced interface incompatibilities with the previous
version of Prometheus.  To alleviate this, a general fetching and
processing dispatching system has been created, which discriminates
and processes according to the version of input.
2013-01-27 17:45:50 +01:00
Matt T. Proud 88d15373c5 Upgrade Prometheus to new API. 2013-01-23 17:18:45 +01:00
Julius Volz 80cdb0121d Add support for configured job base labels. 2013-01-18 01:10:09 +01:00
Matt T. Proud 9752f1e61d Refactor Target as interface for testability.
Future tests around the ``TargetPool`` and ``TargetManager`` and
friends will be a lot easier when the concrete behaviors of
``Target`` can be extracted out.  Plus, each ``Target``, I suspect,
will have its own resolution and query strategy.
2013-01-15 16:19:51 +01:00
Matt T. Proud efe61c18fa Refactor target scheduling to separate facility.
``Target`` will be refactored down the road to support various
nuanced endpoint types.  Thusly incorporating the scheduling
behavior within it will be problematic.  To that end, the scheduling
behavior has been moved into a separate assistance type to improve
conciseness and testability.

``make format`` was also run.
2013-01-13 10:43:37 +01:00
Julius Volz 56384bf42a Add initial config and rule language implementation. 2013-01-07 23:43:36 +01:00
Matt T. Proud 52f52a7ee2 Include nascent instrumentation of stack. 2013-01-04 23:32:46 +01:00
Matt T. Proud 2922def8d0 Use the `TargetManager` for targets. 2013-01-04 17:17:23 +01:00
Julius Volz 45a3e0a182 Rename Target Frequency -> Interval. 2013-01-04 13:03:32 +01:00
Matt T. Proud 7a9777b4b5 Create `TargetPool` priority queue.
``TargetPool`` is a pool of targets pending scraping.  For now, it
uses the ``heap.Interface`` from ``container/heap`` to provide a
priority queue for the system to scrape from the next target.

It is my supposition that we'll use a model whereby we create a
``TargetPool`` for each scrape interval, into which ``Target``
instances are registered.
2013-01-04 12:17:31 +01:00
Renamed from retrieval/type.go (Browse further)