This fixes the case where a target provider closes the update
channel and exits before the context is canceled.
This should only be true for the static provider but it's safer
to generally handle this case.
This commit simplifies the TargetHealth type and moves the target
status into the target itself. This also removes a race where error
and last scrape time could have been out of sync.
This commit removes the scrapeConfig entirely from Target.
All identity defining parameters are thus immutable now and the mutex
can be removed..
Target identity is now correctly defined by the labels and the full URL.
This in particular includes URL parameters that are not specified in the
label set.
Fingerprint is also removed from hash to remove an unnecessary tight coupling
to the common/model package.
With this commit the scrape pool deduplicates incoming
targets before scraping them. This way multiple target providers
can produce the same target but it will be scraped only once.
This commit updates a target set's scrape configuration
on reload. This will cause all running scrape loops to be
stopped and started again with new parameters.
This commit changes the scraper interface to accept a timestamp
so the reported timestamp by the caller and the timestamp
attached to samples does not differ.
This commit moves Scraper handling into a separate scrapePool type.
TargetSets only manage TargetProvider lifecycles and sync the
retrieved updates to the scrapePool.
TargetProviders are now expected to send a full initial target set
within 5 seconds. The scrapePools preserve target state across reloads
and only drop targets after the initial set was synced.
We group providers by their scrape configuration. Each provider produces
target groups with an unique identifier.
On stopping a set of target providers we cancel the target providers,
stop scraping the targets and wait for the scrapers to finish.
On configuration reload all provider sets are stopped and new ones
are created. This will make targets disappear briefly on configuration
reload. Potentially scrapes are missed but due to the consistent
scrape intervals implemented recently, the impact is minor.
Double acquisition of the RLock usually doesn't blow up, but if the
write lock is called for between the two RLock's, we are deadlocked.
This deadlock does not exist in release-0.17, BTW.
With recent changes to a Target's internal data representation
updating by fullLabels() assigns the additional default
instance label. This breaks target identity comparison and causes
identical targets from service discovery to be constantly swapped.
So far we were using the InstanceIdentifier to compare equality of targets.
This is not always accurate, for example for the blackbox exporter where the
actual target is in the parameter.
For historic reasons we were enforcing a timeout directly
via the TCP dialer. This is no longer necessary for quite a while now.
Switching to context.Context will allow us to properly terminate
requests on shutdown as well.
To evenly distribute scraping load we currently rely on random
jittering. This commit hashes over the target's identity and calculates
a consistent offset. This also ensures that scrape intervals
are constantly spaced between config/target changes.
This gives up on the idea to communicate throuh the Append() call (by
either not returning as it is now or returning an error as
suggested/explored elsewhere). Here I have added a Throttled() call,
which has the advantage that it can be called before a whole _batch_
of Append()'s. Scrapes will happen completely or not at all. Same for
rule group evaluations. That's a highly desired behavior (as discussed
elsewhere). The code is even simpler now as the whole ingestion buffer
could be removed.
Logging of throttled mode has been streamlined and will create at most
one message per minute.
It's actually happening in several places (and for flags, we use the
standard Go time.Duration...). This at least reduces all our
home-grown parsing to one place (in model).
nerve's registration format differs from serverset. With this commit
there is now a dedicated treecache file in util,
and two separate files for serverset and nerve.
Reference:
https://github.com/airbnb/nerve
For the SNMP and blackbox exporters where
the ports tends to not be 80/443 and indeed
there may not be a port this makes the relabelling
a bit simpler as you don't have to figure out this
logic exists and strip off the :80.
This is a breaking change for the example configs of
those exporters.
With the blackbox exporter, the instance label will commonly
be used for things other than hostnames so remove this restriction.
https://example.com or https://example.com/probe/me are some examples.
To prevent user error, check that urls aren't provided as targets
when there's no relabelling that could potentically fix them.
1. static credentials replaced with defaults.DefaultChainCredentials.
This change ensures that credentials are sourced form all possible
providers available with the aws sdk, in the following order:
env variables, shared awsconfig file in user folder, ec2 instance role.
2. Added a few labels: AvailabilityZone, PublicDns, VpcId (if
available), SubnetId (if in Vpc)
The prefixed target provider changed a pointerized target group that was
reused in the wrapped target provider, causing an ever-increasing chain
of source prefixes in target groups from the Consul target provider.
We now make this bug generally impossible by switching the target group
channel from pointer to value type and thus ensuring that target groups
are copied before being passed on to other parts of the system.
I tried to not let the depointerization leak too far outside of the
channel handling (both upstream and downstream) because I tried that
initially and caused some nasty bugs, which I want to minimize.
Fixes https://github.com/prometheus/prometheus/issues/1083
Bump timeouts of tests where we don't want I/O timeouts.
Adjust the full channel test to be much more reliable,
by reducing the ingestion timeout from 1ms to 0.
Move defer resp.Body.Close() up to make sure it's called even when the
HTTP request returns something other than 200 or Decoder construction
fails. This avoids leaking and eventually running out of file descriptors.
* Support multiple masters with retries against each master as required.
* Scrape masters' metrics.
* Add role meta label for node/service/master to make it easier for relabeling.
When the test ends, all files matching the watcher's glob are removed
via defer. In that moment, the draining goroutine may still be running
and then detect no files matching the configured glob just before the
test exits.
This is now solved by waiting for the draining goroutine to finish
before leaving the test function and thus causing the deferred file
removal.
This is with `golint -min_confidence=0.5`.
I left several lint warnings untouched because they were either
incorrect or I felt it was better not to change them at the moment.
merge() closes the channel that handleUpdates() reads from when there
are zero configured target providers in the configuration. In that case,
the for-select loop in handleUpdates() entered a busy loop. It should
exit when the upstream channel is closed.
Instead of only filling __meta_consul_dc when datacenter is set in
consul_sd_config this change fills the label based on what the agent
reports it's current data center is, if datacenter isn't manually set,
otherwise it uses whatever datacenter was set to.
ReplaceAllString only replaces the matching part of the regex,
the unmatched bits around it are left in place. This is not the
expected or desired behaviour as the replacement string should
be everything.
This may break users dependant on this behaviour, but
what they're doing is still possible.
The intended use case is where a user has tags/labels coming
from metadata in Kubernetes or EC2, and wants to make
some subset of them into target labels.
Include position of same SD mechanisms within the same scrape configuration.
Move unique prefixing out of SD implementations and target manager into
its own interface.
Allow scrape_configs to have an optional proxy_url option which specifies
a proxy to be used for all connections to hosts in that config.
Internally this modifies the various client functions to take a *url.URL pointer
which currently must point to an HTTP proxy (but has been left open-ended to
allow the url format to be extended to support others, such as maybe SOCKS if
needed).
In setups where the ServiceAddress is the relevant address for
scraping, users can relabel the `__address__` label to ServiceAddress
+ ":" + ServicePort.
This needs to be documented, of course. Will do once this is LGTM'd.
This takes the modulus of a hash of some labels.
Combined with a keep relabel action, this allows
for sharding of targets across multiple prometheus
servers.
Changes to the UI:
- "Active Since" timestamps are now human-readable.
- Alerting rules are now pretty-printed better.
- Labels are no longer just strings, but alert bubbles (like we do on
the status page for base labels).
- Alert states and target health states are now capitalized in the
presentation layer rather than at the source.
This commit adds the honor_labels and params arguments to the scrape
config. This allows to specify query parameters used by the scrapers
and handling scraped labels with precedence.
The main purpose of this is to allow for blacklisting
of expensive metrics as a tactical option.
It could also find uses for renaming and removing labels
from federation.
The main purpose of this is to allow for blacklisting
of expensive metrics as a tactical option.
It could also find uses for renaming and removing labels
from federation.
Figuring out what's going on with the new service discovery
and labels is difficult. Add a popover with the labels
to the target table to make things simpler, and help
discovery of potentially useful labels.
TargetProviders may flush some last changes to the target manager
before actually stopping. To properly read those form the channel
the target manager must not be locked while stopping a provider.
This calculates how much a counter increases over
a given period of time, which is the area under the curve
of it's rate.
increase(x[5m]) is equivilent to rate(x[5m]) * 300.
This change is conceptually very simple, although the diff is large. It
switches logging from "github.com/golang/glog" to
"github.com/prometheus/log", while not actually changing any log
messages. V(1)-style logging has been changed to be log.Debug*().
Appending to the storage can block for a long time. Timing out
scrapes can also cause longer blocks. This commit avoids that those
blocks affect other compnents than the target itself.
Also the Target interface was removed.
The target implementation and interface contain methods only serving a
specific purpose of the templates. They were moved to the template
as they operate on more fundamental target data.
Some SD configs may have many options. To be readable and consistent, make
all discovery constructors receive the full config rather than the separate
arguments.
This commits adds file based service discovery which reads target
groups from specified files. It detects changes based on file watches
and regular refreshes.
With this commit, sending SIGHUP to the Prometheus process will reload
and apply the configuration file. The different components attempt
to handle failing changes gracefully.
This commit adds a relabelling stage on the set of base
labels from which a target is created. It allows to drop
targets and rewrite any regular or internal label.
This commit changes the configuration interface from job configs to scrape
configs. This includes allowing multiple ways of target definition at once
and moving DNS SD to its own config message. DNS SD can now contain multiple
DNS names per configured discovery.
This commit shifts responsibility for maintaining targets from providers and
pools to the target manager. Target groups have a source name that identifies
them for updates.
/api/targets was undocumented and never used and also broken.
Showing instance and job labels on the status page (next to targets)
does not make sense as those labels are set in an obvious way.
Also add a doc comment to TargetStateToClass.
The one central sample ingestion channel has caused a variety of
trouble. This commit removes it. Targets and rule evaluation call an
Append method directly now. To incorporate multiple storage backends
(like OpenTSDB), storage.Tee forks the Append into two different
appenders.
Note that the tsdb queue manager had its own queue anyway. It was a
queue after a queue... Much queue, so overhead...
Targets have their own little buffer (implemented as a channel) to
avoid stalling during an http scrape. But a new scrape will only be
started once the old one is fully ingested.
The contraption of three pipelined ingesters was removed. A Target is
an ingester itself now. Despite more logic in Target, things should be
less confusing now.
Also, remove lint and vet warnings in ast.go.
The current wording suggests that a target is not reachable at all,
although it might also get set when the target was reachable, but there
was some other error during the scrape (invalid headers or invalid
scrape content). UNHEALTHY is a more general wording that includes all
these cases.
For consistency, ALIVE is also renamed to HEALTHY.
We were using Godep incorrectly (cloning repos from the internet during
build time instead of including Godeps/_workspace in the GOPATH via
"godep go"). However, to avoid even having to fetch "godeps" from the
internet during build, this now just copies the vendored files into the
GOPATH.
Also, the protocol buffer library moved from Google Code to GitHub,
which is reflected in these updates.
This fixes https://github.com/prometheus/prometheus/issues/525
- Increase samplesQueueCapacity.
- Improve docstring for the above.
- Accept a short waiting period for the ingest channel to become
ready. This should depend on the http timeout, but 100ms is probably
good enough to cushion bursts bigger than samplesQueueCapacity,
while it is unlikely that anybody ever will set an HTTP timeout
similarly short.
This is now not even trying to throttle in a benign way, but creates a
fully-fledged error. Advantage: It shows up very visible on the status
page. Disadvantage: The server does not really adjusts to a lower
scraping rate. However, if your ingestion backs up, you are in a very
irregulare state, I'd say it _should_ be considered an error and not
dealt with in a more graceful way.
In different news: I'll work on optimizing ingestion so that we will
not as easily run into that situation in the first place.
The simple algorithm applied here will increase the actual interval
incrementally, whenever and as long as the scrape itself takes longer
than the configured interval. Once it takes shorter again, the actual
interval will iteratively decrease again.
- Move CONTRIBUTORS.md to the more common AUTHORS.
- Added the required NOTICE file.
- Changed "Prometheus Team" to "The Prometheus Authors".
- Reverted the erroneous changes to the Apache License.
The "Address" is actually a URL which may contain username and
password. Calling this Address is misleading so we rename it.
Change-Id: I441c7ab9dfa2ceedc67cde7a47e6843a65f60511
Essentially:
- Remove unused code.
- Make it 'go vet' clean. The only remaining warnings are in generated code.
- Make it 'golint' clean. The only remaining warnings are in gerenated code.
- Smoothed out same minor things.
Change-Id: I3fe5c1fbead27b0e7a9c247fee2f5a45bc2d42c6
Also, fix problems in shutdown.
Starting serving and shutdown still has to be cleaned up properly.
It's a mess.
Change-Id: I51061db12064e434066446e6fceac32741c4f84c
- Always spell out the time unit (e.g. milliseconds instead of ms).
- Remove "_total" from the names of metrics that are not counters.
- Make use of the "Namespace" and "Subsystem" fields in the options.
- Removed the "capacity" facet from all metrics about channels/queues.
These are all fixed via command line flags and will never change
during the runtime of a process. Also, they should not be part of
the same metric family. I have added separate metrics for the
capacity of queues as convenience. (They will never change and are
only set once.)
- I left "metric_disk_latency_microseconds" unchanged, although that
metric measures the latency of the storage device, even if it is not
a spinning disk. "SSD" is read by many as "solid state disk", so
it's not too far off. (It should be "solid state drive", of course,
but "metric_drive_latency_microseconds" is probably confusing.)
- Brian suggested to not mix "failure" and "success" outcome in the
same metric family (distinguished by labels). For now, I left it as
it is. We are touching some bigger issue here, especially as other
parts in the Prometheus ecosystem are following the same
principle. We still need to come to terms here and then change
things consistently everywhere.
Change-Id: If799458b450d18f78500f05990301c12525197d3
Having metrics with variable timestamps inconsistently
spaced when things fail will make it harder to write correct rules.
Update status page, requires some refactoring to insert a function.
Change-Id: Ie1c586cca53b8f3b318af8c21c418873063738a8
Now that the subtle bug in matttproud/golang_protobuf_extensions is
fixed, we do not need to copy the bytes of a scrape into a buffer
first before starting to parse it.
Change-Id: Ib73ecae16173ddd219cda56388a8f853332f8853
So far we've been using Go's native time.Time for anything related to sample
timestamps. Since the range of time.Time is much bigger than what we need, this
has created two problems:
- there could be time.Time values which were out of the range/precision of the
time type that we persist to disk, therefore causing incorrectly ordered keys.
One bug caused by this was:
https://github.com/prometheus/prometheus/issues/367
It would be good to use a timestamp type that's more closely aligned with
what the underlying storage supports.
- sizeof(time.Time) is 192, while Prometheus should be ok with a single 64-bit
Unix timestamp (possibly even a 32-bit one). Since we store samples in large
numbers, this seriously affects memory usage. Furthermore, copying/working
with the data will be faster if it's smaller.
*MEMORY USAGE RESULTS*
Initial memory usage comparisons for a running Prometheus with 1 timeseries and
100,000 samples show roughly a 13% decrease in total (VIRT) memory usage. In my
tests, this advantage for some reason decreased a bit the more samples the
timeseries had (to 5-7% for millions of samples). This I can't fully explain,
but perhaps garbage collection issues were involved.
*WHEN TO USE THE NEW TIMESTAMP TYPE*
The new clientmodel.Timestamp type should be used whenever time
calculations are either directly or indirectly related to sample
timestamps.
For example:
- the timestamp of a sample itself
- all kinds of watermarks
- anything that may become or is compared to a sample timestamp (like the timestamp
passed into Target.Scrape()).
When to still use time.Time:
- for measuring durations/times not related to sample timestamps, like duration
telemetry exporting, timers that indicate how frequently to execute some
action, etc.
*NOTE ON OPERATOR OPTIMIZATION TESTS*
We don't use operator optimization code anymore, but it still lives in
the code as dead code. It still has tests, but I couldn't get all of them to
pass with the new timestamp format. I commented out the failing cases for now,
but we should probably remove the dead code soon. I just didn't want to do that
in the same change as this.
Change-Id: I821787414b0debe85c9fffaeb57abd453727af0f
This adds search domain support by trying to resolve a name by
appending each search domain configured in /etc/resolv.conf until
the query succeeds (NOERROR) and has at least one answer.
Change-Id: Ibdc5138c5d8cc049e11fab90c3d5243d5a06852c
This reverts commit e3bc6fc9dc, reversing
changes made to 1cf9e5840a.
Conflicts:
retrieval/target_provider.go
Change-Id: Icb6e98fb30419e9e2fe9b686c243702ced372014