prometheus

mirror of https://github.com/prometheus/prometheus.git synced 2024-11-13 09:04:06 -08:00

Author	SHA1	Message	Date
Krasi Georgiev	5fec98d0a7	simplify server error handling (#4006 )	2018-03-25 10:05:59 +01:00
Corentin Chary	60dafd425c	consul: improve consul service discovery (#3814 ) * consul: improve consul service discovery Related to #3711 - Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services` allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`). Tags and nore-meta are also used in `/catalog/service` requests. - Do not require a call to the catalog if services are specified by name. This is important because on large cluster `/catalog/services` changes all the time. - Add `allow_stale` configuration option to do stale reads. Non-stale reads can be costly, even more when you are doing them to a remote datacenter with 10k+ targets over WAN (which is common for federation). - Add `refresh_interval` to minimize the strain on the catalog and on the service endpoint. This is needed because of that kind of behavior from consul: https://github.com/hashicorp/consul/issues/3712 and because a catalog on a large cluster would basically change all the time. No need to discover targets in 1sec if we scrape them every minute. - Added plenty of unit tests. Benchmarks ---------- ```yaml scrape_configs: - job_name: prometheus scrape_interval: 60s static_configs: - targets: ["127.0.0.1:9090"] - job_name: "observability-by-tag" scrape_interval: "60s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 tag: marathon-user-observability # Used in After refresh_interval: 30s # Used in After+delay relabel_configs: - source_labels: [__meta_consul_tags] regex: ^(.,)?marathon-user-observability(,.)?$ action: keep - job_name: "observability-by-name" scrape_interval: "60s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 services: - observability-cerebro - observability-portal-web - job_name: "fake-fake-fake" scrape_interval: "15s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 services: - fake-fake-fake ``` Note: tested with ~1200 services, ~5000 nodes. \| Resource \| Empty \| Before \| After \| After + delay \| \| -------- \|:-----:\|:------:\|:-----:\|:-------------:\| \|/service-discovery size\|5K\|85MiB\|27k\|27k\|27k\| \|`go_memstats_heap_objects`\|100k\|1M\|120k\|110k\| \|`go_memstats_heap_alloc_bytes`\|24MB\|150MB\|28MB\|27MB\| \|`rate(go_memstats_alloc_bytes_total[5m])`\|0.2MB/s\|28MB/s\|2MB/s\|0.3MB/s\| \|`rate(process_cpu_seconds_total[5m])`\|0.1%\|15%\|2%\|0.01%\| \|`process_open_fds`\|16\|1236\|22\|22\| \|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`\|~0\|1\|1\|0.03\| \|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`\|0.1\|80\|0.5\|0.5\| \|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`\|N/A\|200ms\|0.2ms\|0.2ms\| \|Network bandwidth\|~10kbps\|~2.8Mbps\|~1.6Mbps\|~10kbps\| Filtering by tag using relabel_configs uses 100kiB and 23kiB/s per service per job and quite a lot of CPU. Also sends and additional 1Mbps of traffic to consul. Being a little bit smarter about this reduces the overhead quite a lot. Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery. * consul: tweak `refresh_interval` behavior `refresh_interval` now does what is advertised in the documentation, there won't be more that one update per `refresh_interval`. It now defaults to 30s (which was also the current waitTime in the consul query). This also make sure we don't wait another 30s if we already waited 29s in the blocking call by substracting the number of elapsed seconds. Hopefully this will do what people expect it does and will be safer for existing consul infrastructures.	2018-03-23 14:48:43 +00:00
Ben Kochie	0d9fe18f5e	Fix nil context staticcheck error.	2018-03-22 07:59:39 +00:00
Ben Kochie	0f37c02343	Update vendor golang.org/x/... Update vendor golang.org/x/sys/unix Update vendor golang.org/x/net/...	2018-03-22 07:59:39 +00:00
Ben Kochie	2b02fcb0cb	Update vendor github.com/miekg/dns@v1.0.4 Update vendor `github.com/miekg/dns` to `v1.0.4` release. * Add dependent vendor `golang.org/x/crypto/ed25519`. * Add dependent vendor `golang.org/x/crypto/ed25519/internal/edwards25519`. * Add dependent vendor `golang.org/x/net/bpf`. * Add dependent vendor `golang.org/x/net/internal/iana`. * Add dependent vendor `golang.org/x/net/internal/socket`. * Add dependent vendor `golang.org/x/net/ipv4`. * Add dependent vendor `golang.org/x/net/ipv6`.	2018-03-22 07:59:39 +00:00
Marek Siarkowicz	bb86c3f62b	Report internal runtime information on status page (#3921 ) Add information about tsdb, wal and config reload	2018-03-21 16:08:37 +00:00
Aaron Kirkbride	c47fbcb626	Fix moved fsnotify dependency (#3995 )	2018-03-21 15:46:31 +00:00
Brian Brazil	cc39021b2b	Provide custom marshalling for Point Point has a non-standard marshalling, and is also where the vast majority of CPU time is spent so it is worth optimising.	2018-03-21 15:02:01 +00:00
Brian Brazil	f35fca1c3f	Vendor github.com/json-iterator/go	2018-03-21 15:02:01 +00:00
Brian Brazil	299b78a887	Switch to json-iterator for v1 api. This makes queries ~15% faster and cuts cpu time spent on json encoding by ~40%.	2018-03-21 15:02:01 +00:00
Brian Brazil	8ede14b24c	Add unittests for Point json output	2018-03-21 15:02:01 +00:00
Brian Brazil	ecd0a9c6ba	web: Add benchmark for respond()	2018-03-21 15:02:01 +00:00
Anton Tereshchenkov	4cb8f6c260	web: remove unused MetricsPath option (#3964 )	2018-03-21 09:29:40 +00:00
ferhat elmas	ec8e4d8a7c	all: remove unnecessary type conversions (#3992 ) excep promql due to not to create conflict with #3966.	2018-03-21 09:25:22 +00:00
Simon Pasquier	83325c8d82	web: replace deprecated InstrumentHandler() (#3862 ) * web: replace deprecated InstrumentHandler() This change replaces the deprecated InstrumentHandler function by the equivalent functions from the promhttp package. The following metrics are removed: * http_request_duration_microseconds (Summary). * http_request_size_bytes (Summary). * http_requests_total (Counter). And the following metrics are added instead: * prometheus_http_request_duration_seconds (Histogram). * prometheus_http_response_size_bytes (Histogram). * promhttp_metric_handler_requests_in_flight (Gauge). * promhttp_metric_handler_requests_total (Counter). * Update github.com/prometheus/common/route package * web: refactor using the new prometheus/common/route package	2018-03-21 08:16:16 +00:00
James Turnbull	ba5273a0ab	Minor edits to help text (#3990 )	2018-03-20 16:54:36 +00:00
Simon Pasquier	e1fd96db25	cmd: fix help text (#3989 )	2018-03-20 15:58:19 +00:00
Warren Fernandes	d49a3df55b	Parser test cleanup (#3977 ) * parser test cleanup - Test against the exported package functions instead of the private functions. * Improves readability of TestParseSeries - Moves package function closer to parser function	2018-03-20 14:30:52 +00:00
Jeeyoung Kim	5b962c5748	Revert "Feature: Allow getting credentials via EC2 role (#3343 )" (#3985 ) This reverts commit `808f79f00a`.	2018-03-20 12:34:54 +00:00
Warren Fernandes	58e2a31db8	Cleans up test by removing unused function (#3969 )	2018-03-15 08:59:19 +00:00
zjwzte	b7a37a1604	Fix magic number.	2018-03-15 10:15:35 +08:00
Fabian Reinartz	e87c6c8b28	Merge pull request #3963 from mz-techops/fix-query-err-scope promql: propagate storage errors	2018-03-14 11:04:02 -04:00
Anton Tereshchenkov	18bbec050c	promql: propagate storage errors	2018-03-14 15:19:22 +01:00
Fabian Reinartz	bc6058c812	Merge pull request #3952 from prometheus/cut221 *: cut 2.2.1	2018-03-14 10:12:35 -04:00
Fabian Reinartz	f22e5dce1a	*: cut 2.2.1	2018-03-14 10:02:06 -04:00
Fabian Reinartz	a947750dd6	vendor: update tsdb	2018-03-14 10:01:44 -04:00
Fabian Reinartz	0847a605a7	Merge pull request #3959 from prometheus/22-pick-ring Cherrypick #3942 onto release 2.2	2018-03-14 07:02:37 -04:00
Brian Brazil	a8e3d0fc4b	Correctly handle pruning wraparound after ring expansion (#3942 ) Fixes #3939	2018-03-14 08:25:53 +00:00
Fabian Reinartz	fcb8e9ac95	Merge pull request #3951 from prometheus/tsdbup3 vendor: update prometheus/tsdb	2018-03-13 21:47:55 +01:00
Fabian Reinartz	5fb1e27b43	vendor: update prometheus/tsdb	2018-03-13 16:24:37 -04:00
Tom Wilkie	d8cfd8f108	Merge pull request #3950 from prometheus/cherrypick-3941 Cherrypick #3941 "Correctly stop the timer used in the remote write path."	2018-03-13 13:38:15 +00:00
Tom Wilkie	597c17d3e9	Fix nit.	2018-03-13 09:30:51 +00:00
Tom Wilkie	731259afd0	Test sample timeout delivery.	2018-03-13 09:30:50 +00:00
Tom Wilkie	fdb574b608	Review feedback.	2018-03-13 09:30:50 +00:00
Tom Wilkie	97a5fc8cbb	Correctly stop the timer used in the remote write path.	2018-03-13 09:30:50 +00:00
Tom Wilkie	02a154ced6	Merge pull request #3941 from prometheus/3809-correctly-stop-timer Correctly stop the timer used in the remote write path.	2018-03-13 09:05:52 +00:00
Tom Wilkie	dc860e7d0e	Fix nit.	2018-03-12 16:48:51 +00:00
Tom Wilkie	390b018c90	Test sample timeout delivery.	2018-03-12 15:35:43 +00:00
Tom Wilkie	22d820ef8e	Review feedback.	2018-03-12 14:27:48 +00:00
Brian Brazil	a8c22c85cc	Correctly handle pruning wraparound after ring expansion (#3942 ) Fixes #3939	2018-03-12 13:16:59 +00:00
Paul Gier	85a3c974b7	minor yaml indentation consistency fix in example configs (#3946 )	2018-03-11 23:06:13 +00:00
James Turnbull	4486ef013b	Make show annotations checkbox match query history checkbox (#3936 ) After removing the checkbox in #3913 the only remaining element that looked like it was the new Show Annotations checkbox on the Alerts page. Which in turn didn't look like the Enable query history checkout on the graph page. So: 1. This takes the Enable query history button as canonical. 2. Updates the show annotations button code to match it. 3. Simplifies the JS for the checkbox.	2018-03-09 14:39:28 +01:00
James Turnbull	50e6aff3fd	Make job heading on service discovery consistent (#3937 ) The new Service Discovery page uses the CSS/JS from the Targets page but used slightly differently. This makes the job header match in the Service Discovery page for a more consistent look-n-feel.	2018-03-09 14:33:53 +01:00
Tom Wilkie	f8c9d375b6	Correctly stop the timer used in the remote write path.	2018-03-09 12:00:26 +00:00
Matt Palmer	042090a6d3	[dns_sd] Send an EDNS0 query by default (#3586 ) Based on https://groups.google.com/d/topic/prometheus-users/02kezHbuea4/discussion Does not attempt to handle a situation where the server does not understand EDNS0, however that is an unlikely case, and the behaviour of such ancient systems is hard to predict in advance, so if it does come up, it will need to be handled on a case-by-case basis.	2018-03-09 10:21:58 +00:00
James Turnbull	c3f4f2204f	Refactor/redesign Unhealthy checkbox on Targets page (#3913 ) * Added only healthy to Targets This adds a "Only heathly" button to supplement the "Only unhealthy" button. The two are mutually exclusive. I've also added a red/green text color to the buttons. Arguably this could be a toggle instead if folks think this is worthwhile... Happy to modify it. * Moved functions above init * Simplifed code and made prettier * Appeased codeacy * Made buttons square	2018-03-09 11:19:09 +01:00
Yecheng Fu	56ed29fbf7	Map target infos of endpoints to prometheus meta labels. (#3770 )	2018-03-09 10:07:00 +00:00
Brian Brazil	bf7d87aed2	Cleanup storage from all tests. Fixed #3299	2018-03-09 07:53:35 +00:00
Brian Brazil	c0ce35d2d3	Only show debug output on test failure	2018-03-09 07:53:35 +00:00
Brian Brazil	e6ea146c81	Make benchmark tests pass A new query object is needed for each evaulation, as the iterators would otherwise be shared across evaluations.	2018-03-09 07:53:35 +00:00

... 9 10 11 12 13 ...

5374 commits