Commit graph

3827 commits

Author SHA1 Message Date
Tom Wilkie 4d9b917d11 Instrument Prometheus with OpenTracing (#2554)
* Use request.Context() instead of a global map of contexts.

* Add some basic opentracing instrumentation on the query path.

* Remove tracehandler endpoint.
2017-05-02 18:49:29 -05:00
Stephan Erb 0b9fca983b Fix reload of ZooKeeper service discovery config (#2669)
Rational:

* When the config is reloaded and the provider context is canceled, we need to
  exit the current ZK `TargetProvider.Run` method as a new provider will be
  instantiated.
* In case `Stop` is called on the `ZookeeperTreeCache`, the update/events
  channel may not be closed as it is shared by multiple caches and would
  thus be double closed.
* Stopping all `zookeeperTreeCacheNode`s on teardown ensures all associated
  watcher go-routines will be closed eagerly rather than implicityly on
  connection close events.
2017-05-02 18:21:37 -05:00
Fabian Reinartz 86426c0566 Merge pull request #2672 from svend/kubernetes-pods-port-comment
Document what ports are scraped by default in k8s example
2017-05-02 11:12:13 +02:00
Svend Sorensen 94a3e863e4 Document what ports are scraped by default in k8s example
The Kubernetes pod SD creates a target for each declared port, as documented:

https://prometheus.io/docs/operating/configuration/#pod

> The pod role discovers all pods and exposes their containers as targets. For
> each declared port of a container, a single target is generated. If a
> container has no specified ports, a port-free target per container is created
> for manually adding a port via relabeling.

This results in the default port being the declared port, or no port if none are
declared.
2017-05-01 15:58:48 -07:00
Conor Broderick 314b81062d Updated vendoring for log level reporting issue (#2660) 2017-04-27 14:25:13 +01:00
Julius Volz fe11c5933a Fix mutation of active alert elements by notifier (#2656)
This caused the external label application in the notifier to bleed back
into the rule manager's active alerting elements.
2017-04-26 10:29:42 -05:00
Fabian Reinartz 5248118b10 Merge pull request #2654 from dsymonds/master
Add maintainers' GitHub usernames to MAINTAINERS.md.
2017-04-25 08:43:36 +02:00
David Symonds 8bb07490a2 Add maintainers' GitHub usernames to MAINTAINERS.md.
CONTRIBUTING.md instructs people to loop them in using that mechanism,
but nothing lists the right username.
2017-04-25 16:32:23 +10:00
Fabian Reinartz 60d9138b6b Merge pull request #2653 from dsymonds/master
Preserve Alertmanager URLs as *url.URL.
2017-04-25 08:27:31 +02:00
David Symonds 04ad889751 Preserve Alertmanager URLs as *url.URL.
Render a nicer link in the web UI.
2017-04-25 16:17:46 +10:00
Conor Broderick 9eb1a5d6bf Handle invalid query in graph UI (#2652) 2017-04-24 10:50:57 +01:00
Brian Brazil 8b8ba26129 Merge pull request #2644 from prometheus/release-1.6
Merge 1.6.1 release from 1.6 branch
2017-04-19 15:22:24 +01:00
Brian Brazil 8097a3c523 Cut v1.6.1 (#2640) 2017-04-19 14:23:56 +01:00
beorn7 e499ef8cac Merge bug fixes from branch 'release-1.6' 2017-04-18 18:06:01 +02:00
Björn Rabenstein 872ed88166 Merge pull request #2638 from prometheus/beorn7/storage
storage: Don't panic if storage has no FPs even after initial wait
2017-04-18 17:02:07 +02:00
beorn7 1dd737d7c3 storage: Don't panic if storage has no FPs even after initial wait 2017-04-18 15:59:12 +02:00
Matt Layher 1faf33acac Add promlint check for histogram/summary reserved names (#2626) 2017-04-15 22:38:01 +01:00
Tobias Schmidt 09a977a782 Create sha256 checksums file during release 2017-04-15 12:26:51 -03:00
Tobias Schmidt 619cc0e0ff Merge pull request #2625 from mdlayher/promlint-cleanup
Simplify promlint problems gathering, use protobuf accessors
2017-04-14 22:47:30 +02:00
Matt Layher cc4198f421
Simplify promlint problems gathering, use protobuf accessors 2017-04-14 16:40:40 -04:00
Matt Layher 34a4813464 Initial promlint counter _total suffix check (#2624) 2017-04-14 22:09:54 +02:00
Matt Layher 254cb1ec29 Use untyped metrics for some promlint tests (#2623) 2017-04-14 19:38:57 +01:00
Björn Rabenstein 67d511784d Merge pull request #2619 from prometheus/release-1.6
Cut v1.6.0
2017-04-14 20:12:22 +02:00
beorn7 10f6453829 Cut v1.6.0 2017-04-14 19:53:58 +02:00
Jack Neely 896f951e68 Force buckets in a histogram to be monotonic for quantile estimation (#2610)
* Force buckets in a histogram to be monotonic for quantile estimation

The assumption that bucket counts increase monotonically with increasing
upperBound may be violated during:

  * Recording rule evaluation of histogram_quantile, especially when rate()
     has been applied to the underlying bucket timeseries.
  * Evaluation of histogram_quantile computed over federated bucket
     timeseries, especially when rate() has been applied

This is because scraped data is not made available to RR evalution or
federation atomically, so some buckets are computed with data from the N
most recent scrapes, but the other buckets are missing the most recent
observations.

Monotonicity is usually guaranteed because if a bucket with upper bound
u1 has count c1, then any bucket with a higher upper bound u > u1 must
have counted all c1 observations and perhaps more, so that c  >= c1.

Randomly interspersed partial sampling breaks that guarantee, and rate()
exacerbates it. Specifically, suppose bucket le=1000 has a count of 10 from
4 samples but the bucket with le=2000 has a count of 7, from 3 samples. The
monotonicity is broken. It is exacerbated by rate() because under normal
operation, cumulative counting of buckets will cause the bucket counts to
diverge such that small differences from missing samples are not a problem.
rate() removes this divergence.)

bucketQuantile depends on that monotonicity to do a binary search for the
bucket with the qth percentile count, so breaking the monotonicity
guarantee causes bucketQuantile() to return undefined (nonsense) results.

As a somewhat hacky solution until the Prometheus project is ready to
accept the changes required to make scrapes atomic, we calculate the
"envelope" of the histogram buckets, essentially removing any decreases
in the count between successive buckets.

* Fix up comment docs for ensureMonotonic

* ensureMonotonic: Use switch statement

Use switch statement rather than if/else for better readability.
Process the most frequent cases first.
2017-04-14 16:21:49 +02:00
Matt Layher 283756c503 Initial commit of 'promtool check-metrics', promlint package (#2605) 2017-04-13 23:53:41 +02:00
Conor Broderick ee62807b62 Added min/max to graph to accomodate for constant time series (#2612)
Added min/max to graph to accommodate constant time series
2017-04-12 14:25:25 +01:00
Björn Rabenstein 1fb2190eeb Merge pull request #2607 from prometheus/beorn7/storage
Vendoring update prior to 1.6 release
2017-04-11 14:31:58 +02:00
beorn7 c53f256a09 storage: Fix use of counter (Set -> Add) 2017-04-11 12:58:24 +02:00
beorn7 1ae50b1d1b vendoring: Update client_golang/prometheus
This is mostly required to enable summaries without quantiles
2017-04-11 12:58:24 +02:00
beorn7 92d4cf7663 vendoring: Remove unused packages 2017-04-11 12:58:24 +02:00
Brian Brazil 0e0fc5a7f4 Correct example name to adapter. (#2590) 2017-04-10 17:24:53 +01:00
Björn Rabenstein acd72ae1a7 Merge pull request #2591 from prometheus/beorn7/storage
storage: Several optimizations of checkpointing
2017-04-07 20:02:14 +02:00
Goutham Veeramachaneni cffb1acf7f Test Longer Tests in Travis (#2570)
* Test Longer Tests in Travis

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>

* Make test Target Run All Tests

* Add test-short to run short tests

test is running all the tests now as we are running make tests in
CircleCI and I think the base image is shared across Prometheus Org.

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>

* Remove Empty Line

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
2017-04-07 13:46:06 +02:00
beorn7 f20b84e816 flags: Improve doc strings for checkpoint flags 2017-04-07 13:10:12 +02:00
beorn7 f338d791d2 storage: Several optimizations of checkpointing
- checkpointSeriesMapAndHeads accepts a context now to allow
  cancelling.

- If a shutdown is initiated, cancel the ongoing checkpoint. (We will
  create a final checkpoint anyway.)

- Always wait for at least as long as the last checkpoint took before
  starting the next checkpoint (to cap the time spending checkpointing
  at 50%).

- If an error has occurred during checkpointing, don't bother to sync
  the write.

- Make sure the temporary checkpoint file is deleted, even if an error
  has occurred.

- Clean up the checkpoint loop a bit. (The concurrent Timer.Reset(0)
  call might have cause a race.)
2017-04-07 13:10:12 +02:00
Björn Rabenstein 934d86b936 Merge pull request #2593 from prometheus/beorn7/storage2
storage: Recover from corrupted indices for archived series
2017-04-07 12:55:35 +02:00
Goutham Veeramachaneni 0f48d07f95 Fix Map Race by Moving Locking closer to the Write (#2476) 2017-04-07 08:55:01 +02:00
Julius Volz 182d7de9cd Merge pull request #2597 from richardkiene/CMON-53
Add triton zone brand metadata
2017-04-07 01:02:02 +02:00
Björn Rabenstein 38bcba11fe Merge pull request #2594 from prometheus/beorn7/storage3
storage: Guard against a corner case of data corruption
2017-04-07 00:52:28 +02:00
Björn Rabenstein f0076aca01 Merge pull request #2595 from prometheus/beorn7/storage4
storage: Guard against appending to evicted chunk
2017-04-07 00:51:53 +02:00
Tom Wilkie e5d7bbfc3c Remote writes: retry on recoverable errors. (#2552)
* Remote writes: retry on recoverable errors.

* Add comments

* Review feedback

* Comments

* Review feedback

* Final spelling misteak (I hope).  Plus, record failed samples correctly.
2017-04-07 00:15:41 +02:00
Richard Kiene ec692f6161 Add triton zone brand metadata 2017-04-06 21:35:42 +00:00
beorn7 7199a9d9d4 storage: Guard against appending to evicted chunk
Fixes #2480. For certain definition of "fixes".

This is something that should never happen. Sadly, it does happen,
albeit extremely rarely. This could be some weird cornercase we
haven't covered yet. Or it happens as a consequesnce of data
corruption or a crash recovery gone bad.

This is not a "real" fix as we don't know the root cause of the
incident reported in #2480. However, this makes sure the server does
not crash, but deals gracefully with the problem: The series in
question is quarantined, which even makes it available for forensics.
2017-04-06 20:02:52 +02:00
beorn7 3d12906286 storage: Guard against a corner case of data corruption
Fixes #2475.
2017-04-06 19:50:32 +02:00
beorn7 4fcc73a04c storage: Recover from corrupted indices for archived series
An unopenable archived_fingerprint_to_timerange is simply deleted and
will be rebuilt during crash recovery (wich can then take quite some time).

An unopenable archived_fingerprint_to_metric is not deleted but
instructions to the user are logged. A deletion has to be done by the
user explicitly as it means losing all archived series (and a repair
with a 3rd party tool might still be possible).
2017-04-06 19:26:39 +02:00
Julius Volz 9775ad4754 Merge pull request #2588 from prometheus/read-multi
Separate out remote read responses.
2017-04-06 17:10:31 +02:00
Conor Broderick c72692fd75 Fixed issue of partially hidden y-axis values on graph (#2589) 2017-04-06 16:04:44 +01:00
Brian Brazil c813c824d4 Separate out remote read responses.
Fixes #2574
2017-04-06 15:49:47 +01:00
Björn Rabenstein 516a96d9a3 Merge pull request #2587 from prometheus/beorn7/storage2
storage: Mark storage as dirty if indexing fails
2017-04-06 16:42:06 +02:00