prometheus

mirror of https://github.com/prometheus/prometheus.git synced 2025-03-05 20:59:13 -08:00

Author	SHA1	Message	Date
Fabian Reinartz	022714b60a	Merge pull request #2341 from mattbostock/patch-1 Correct notifications_dropped description	2017-01-16 09:23:46 +01:00
Matt Bostock	4160892109	Correct notifications_dropped description The current description does not accurately describe when the metric is incremented. Aside from Alertmanger missing from the configuration, `prometheus_notifications_dropped_total` is incremented when errors occur while sending alert notifications to Alertmanager, or because the notifications queue is full, or because the number of notifications to be sent exceeds the queue capacity. I think calling these cases 'errors' in a generic sense is more useful than the current description.	2017-01-13 23:36:00 +00:00
Brian Brazil	f64c231dad	Allow checkpoints and maintenance to happen concurrently. (#2321 ) This is essential on larger Prometheus servers, as otherwise checkpoints prevent sufficient persisting of chunks to disk.	2017-01-13 17:24:19 +00:00
Brian Brazil	1dcb7637f5	Add various persistence related metrics (#2333 ) Add metrics around checkpointing and persistence * Add a metric to say if checkpointing is happening, and another to track total checkpoint time and count. This breaks the existing prometheus_local_storage_checkpoint_duration_seconds by renaming it to prometheus_local_storage_checkpoint_last_duration_seconds as the former name is more appropriate for a summary. * Add metric for last checkpoint size. * Add metric for series/chunks processed by checkpoints. For long checkpoints it'd be useful to see how they're progressing. * Add metric for dirty series * Add metric for number of chunks persisted per series. You can get the number of chunks from chunk_ops, but not the matching number of series. This helps determine the size of the writes being made. * Add metric for chunks queued for persistence Chunks created includes both chunks that'll need persistence and chunks read in for queries. This only includes chunks created for persistence. * Code review comments on new persistence metrics.	2017-01-11 15:11:19 +00:00
Björn Rabenstein	6ce97837ab	Merge pull request #2327 from prometheus/beorn7/vendoring vendoring: Update prometheus/common to pull in bug fixes	2017-01-09 13:28:36 +01:00
beorn7	86ec87b78f	vendoring: Update prometheus/common to pull in bug fixes In particular the one for https://github.com/prometheus/common/issues/72.	2017-01-09 12:25:17 +01:00
Fabian Reinartz	3302bb1eb1	Merge pull request #2323 from prometheus/beorn7/retrieval Retrieval: Avoid copying Target	2017-01-08 06:49:47 +01:00
Björn Rabenstein	ad40d0abbc	Merge pull request #2288 from prometheus/limit-scrape Add ability to limit scrape samples, and related metrics	2017-01-08 01:34:06 +01:00
beorn7	5dc01202d7	Retrieval: Remove some test lines that fail on Travis only These lines exercise an append in TestScrapeLoopWrapSampleAppender. Arguably, append shouldn't be tested there in the first place. Still it's weird why this fails on Travis: ``` --- FAIL: TestScrapeLoopWrapSampleAppender (0.00s) scrape_test.go:259: Expected count of 1, got 0 scrape_test.go:290: Expected count of 1, got 0 2017/01/07 22:48:26 http: TLS handshake error from 127.0.0.1:50716: read tcp 127.0.0.1:40265->127.0.0.1:50716: read: connection reset by peer FAIL FAIL github.com/prometheus/prometheus/retrieval 3.603s ``` Should anybody ever find out why, please revert this commit accordingly.	2017-01-08 00:01:46 +01:00
beorn7	3610331eeb	Retrieval: Do not buffer the samples if no sample limit configured Also, simplify and streamline the code a bit.	2017-01-07 18:18:54 +01:00
André Carvalho	c43dfaba1c	Add max concurrent and current queries engine metrics (#2326 ) * Add max concurrent and current queries engine metrics This commit adds two metrics to the promql/engine: the number of max concurrent queries, as configured by the flag, and the number of current queries being served+blocked in the engine.	2017-01-07 14:41:25 +00:00
beorn7	767c0709b1	Retrieval: Avoid copying Target retreival.Target contains a mutex. It was copied in the Targets() call. This potentially can wreak a lot of havoc. It might even have caused the issues reported as #2266 and #2262 .	2017-01-06 18:43:41 +01:00
Brian Brazil	f9e581907a	Make index queue bigger. (#2322 ) When a large Prometheus starts up fresh it can take many minutes to warmup and clear out the index queue. A larger queue means less blocking, bigger batches and cuts down startup time by ~50%.	2017-01-05 17:57:42 +00:00
Fabian Reinartz	c9f4aea8e2	Merge pull request #2305 from alicebob/favicon Add a favicon to the web GUI	2017-01-04 10:15:27 +01:00
Martin Lehmann	78fae3155f	Make relative links in README.md absolute (#2316 ) The relative links don't work in other pages that render the README (for example https://hub.docker.com/r/prom/prometheus/). As they are (hopefully) not due to change any time soon, I think using absolute links is better.	2017-01-03 20:07:33 +00:00
Julius Volz	90dd216646	Merge pull request #2306 from EdSchouten/sorted-alerts Use lexicographic order to sort alerts by name.	2016-12-31 13:12:30 +01:00
Mitsuhiro Tanda	7e369b9318	expose max memory chunks metrics (#2303 ) * expose max memory chunks metrics	2016-12-27 18:34:07 +00:00
Ed Schouten	b3a39ccd8a	Use lexicographic order to sort alerts by name. Right now the /alerts page of Prometheus sorts alerts by severity (firing, pending, inactive). Once multiple alerts have the same severity, their order seems to correlate to how they are placed in the configuration files, but not always. Looking at the code, we make use of sort.Sort(), which is documented not to provide a stable sort. The Less() function also only takes the alert state into account. This change extends the Less() function to provide a lexicographic order on both the alert state and the name. This means I can finally find the alerts I'm looking for without using my browser's search feature.	2016-12-27 14:28:44 +01:00
Harmen	135d32ea22	make assets	2016-12-27 13:59:20 +01:00
Harmen	dfa4f79bcd	add favicon	2016-12-27 13:58:51 +01:00
Brian Brazil	93b70ee4ea	Evict chunk descs of all unloaded chunks during maintenance. (#2297 ) Keeping these around has two problems: 1) Each desc takes 64 bytes, 10 of them is 640B. This is a lot of overhead on a 1024 byte chunk. 2) It can take well over a week to reach a point where this and thus Prometheus memory usage as a whole enters steady state. This makes RAM estimation very hard for users, and makes it difficult to investigate things like memory fragmentation. Instead we'll wipe them during each memory series maintenance cycle, and if a query pulls them in they'll hang around as cache until the next cycle.	2016-12-22 13:49:03 +00:00
Brian Brazil	bed4635802	Use irate consistently in console template examples. (#2296 ) I must have forgotten my 'g' when switching these.	2016-12-21 13:19:23 +00:00
Fabian Reinartz	d6d03a966f	Merge pull request #2295 from prometheus/fast-path-remote Don't clone the metric if there's no remote writes.	2016-12-21 12:36:41 +01:00
Brian Brazil	1b8a474612	Don't clone the metric if there's no remote writes. The metric clone can't be further optimised, and is a non-trivial memory allocation cost so fast path it if there's no remote writes configured.	2016-12-21 11:34:48 +00:00
Brian Brazil	6c07453ec1	Only clone the metric in the one place relabelling needs it. (#2292 ) This cuts ~17% off memory allocations related to ingesting data in a basic setup.	2016-12-21 10:00:33 +00:00
Brian Brazil	2e3b42ad6c	Correctly handle the end time being 0 in the URL. (#2290 )	2016-12-18 19:30:52 +00:00
Brian Brazil	f421ce0636	Remove label from prometheus_target_skipped_scrapes_total (#2289 ) This avoids it not being intialised, and breaking out by interval wasn't partiuclarly useful. Fixes #2269	2016-12-16 18:00:52 +00:00
Brian Brazil	30448286c7	Add sample_limit to scrape config. This imposes a hard limit on the number of samples ingested from the target. This is counted after metric relabelling, to allow dropping of problemtic metrics. This is intended as a very blunt tool to prevent overload due to misbehaving targets that suddenly jump in sample count (e.g. adding a label containing email addresses). Add metric to track how often this happens. Fixes #2137	2016-12-16 15:10:09 +00:00
Björn Rabenstein	f3f798fbcf	Merge pull request #2283 from tcolgate/ignoredots ignore dotfiles in data directory	2016-12-15 13:32:03 +01:00
Tristan Colgate	30be8e0b8a	ignore dotfiles in data directory	2016-12-15 11:48:23 +00:00
Tristan Colgate-McFarlane	4d9134e6d8	Add labeldrop and labelkeep actions. (#2279 ) Introduce two new relabel actions. labeldrop, and labelkeep. These can be used to filter the set of labels by matching regex - labeldrop: drops all labels that match the regex - labelkeep: drops all labels that do not match the regex	2016-12-14 10:17:42 +00:00
Björn Rabenstein	45570e5972	Merge pull request #2277 from prometheus/beorn7/storage2 storage: Sanity-check number of loaded chunk descs	2016-12-14 02:59:10 +01:00
beorn7	253be23c00	storage: Sanity-check number of loaded chunk descs Two cases: - An unarchived metric must have at least one chunk desc loaded upon unarchival. Otherwise, the file is gone or has size 0, which is an inconsistency (because the series is still indexed in the archive index). Hence, quarantining is triggered. - If loading the chunk descs of a series with a known chunkDescsOffset (i.e. != -1), the number of chunks loaded must be equal to chunkDescsOffset. If not, there is a data corruption. An error is returned, which leads to qurantining. In any case, there is a guard added to not access the 1st element of an empty chunkDescs slice. (That's what triggered the crashes in issue 2249.) A time series with unknown chunkDescsOffset and no chunks in memory and no chunks on disk either could trigger that case. I would assume such a "null series" doesn't exist, but it's not entirely unthinkable and unreasonable to happen (perhaps in future uses of the storage). (Create a series, and then something tries to preload chunks before the first sample is added.)	2016-12-13 23:19:39 +01:00
Björn Rabenstein	5f0c0e43cf	Merge pull request #2276 from prometheus/beorn7/storage storage: Catch data corruption that leads to division by zero	2016-12-13 23:13:39 +01:00
Björn Rabenstein	a4c8292232	Merge pull request #2278 from prometheus/beorn7/style storage: Fix linter issue	2016-12-13 23:13:05 +01:00
beorn7	837c029b16	storage: Fix linter issue Go style tries to avoid indented `else` blocks.	2016-12-13 19:05:30 +01:00
Brian Brazil	c8de1484d5	Add scrape_samples_post_metric_relabeling This reports the number of samples post any keep/drop from metric relabelling.	2016-12-13 17:32:11 +00:00
Brian Brazil	06b9df65ec	Refactor and add unittests to scrape result handling.	2016-12-13 16:49:17 +00:00
Björn Rabenstein	568fd8a8cb	Merge pull request #2155 from prometheus/beorn7/vendoring2 Update vendoring for Azure	2016-12-13 17:10:59 +01:00
beorn7	4719482f5f	storage: Make tests go-vet and golint clean	2016-12-13 17:07:27 +01:00
beorn7	485ac8dff7	storage: Verify validity of byte length when unmarshalling (double)delta chunks This makes sure a division-by-zero crash cannot happen in the Len() method. Fixes #2773	2016-12-13 17:07:27 +01:00
Brian Brazil	b5ded43594	Allow buffering of scraped samples before sending them to storage.	2016-12-13 15:01:35 +00:00
beorn7	906c3a2237	Update vendoring for Azure Also, actually record the vendored version in vendor.json.	2016-12-13 14:21:16 +01:00
tattsun	e714079cf2	storage: fix error message (#2270 ) * storage: add error message	2016-12-09 22:36:27 +00:00
Fabian Reinartz	9ecea36ef9	Merge pull request #2259 from prometheus/federationerr web: don't return federation errors over HTTP	2016-12-06 16:18:03 +01:00
Fabian Reinartz	cef2e04aa3	web: add error counter for federation responses	2016-12-06 16:09:50 +01:00
Fabian Reinartz	0ea0a19848	Merge pull request #2240 from agaoglu/read-timeout Set read-timeout for http.Server	2016-12-06 16:01:45 +01:00
Fabian Reinartz	9d68e81b32	web: don't return federation errors over HTTP We are writing federation responses streaming. So after the first byte we wrote, the status header is fixed. We cannot return an HTTP error for intermediate error but should just abort and log instead.	2016-12-06 15:52:50 +01:00
Erdem Agaoglu	054f8ebbfb	Increase default max-connections	2016-12-06 17:45:19 +03:00
Erdem Agaoglu	2260079c12	Vendor x/net/netutil	2016-12-06 12:52:29 +03:00

1 2 3 4 5 ...

3590 commits