prometheus

mirror of https://github.com/prometheus/prometheus.git synced 2025-03-05 20:59:13 -08:00

Author	SHA1	Message	Date
beorn7	07a294ac15	Doc comment fixes	2016-04-26 01:05:56 +02:00
beorn7	20cba1ed8f	Initialize metric vectors in memorySeriesStorage	2016-04-25 17:08:07 +02:00
beorn7	d566808d40	Bring back logging of discarded samples But only on DEBUG level. Also, count and report the two cases of out-of-order timestamps on the one hand and same timestamp but different value on the other hand separately.	2016-04-25 16:43:52 +02:00
beorn7	db16acd7fb	Never drop a still open head chunk.	2016-04-15 19:18:40 +02:00
beorn7	a90d645378	Checkpoint fingerprint mappings only upon shutdown Before, we checkpointed after every newly detected fingerprint collision, which is not a problem as long as collisions are rare. However, with a sufficient number of metrics or particular nature of the data set, there might be a lot of collisions, all to be detected upon the first set of scrapes, and then the checkpointing after each detection will take a quite long time (it's O(n²), essentially). Since we are rebuilding the fingerprint mapping during crash recovery, the previous, very conservative approach didn't even buy us anything. We only ever read from the checkpoint file after a clean shutdown, so the only time we need to write the checkpoint file is during a clean shutdown.	2016-04-15 01:03:28 +02:00
Jonathan Boulle	38098f8c95	Add missing license headers Prometheus is Apache 2 licensed, and most source files have the appropriate copyright license header, but some were missing it without apparent reason. Correct that by adding it.	2016-04-13 16:08:22 +02:00
Fabian Reinartz	a18639dc2d	Merge pull request #1454 from prometheus/beorn7/fix-test Give TestEvictAndLoadChunkDescs more time to actually evict	2016-04-08 14:58:01 +02:00
beorn7	d09ca03e10	Work around compiler bug Benchmarks don't show any significant changes.	2016-03-29 17:05:28 +02:00
beorn7	865d16f870	Rename Gorilla into varbit	2016-03-23 16:30:41 +01:00
beorn7	4b574e8a61	Switch chunk encoding to type 2 where it was hardcoded type 1 before The chunk encoding was hardcoded there because it mostly doesn't matter what encoding is chosen in that test. Since type 1 is battle-hardened enough, I'm switching to type 2 here so that we can catch unexpected problems as a byproduct. My expectation is that the chunk encoding doesn't matter anyway, as said, but then "unexpected problems" contains the word "unexpected".	2016-03-20 23:32:20 +01:00
beorn7	c72979e3ed	Remove a redundancy from Gorilla-style chunks So far, the last sample in a chunk was saved twice. That's required for adding more samples as we need to know the last sample added to add more samples without iterating through the whole chunk. However, once the last sample was added to the chunk before it's full, there is no need to save it twice. Thus, the very last sample added to a chunk can _only_ be saved in the header fields for the last sample. The chunk has to be identifiable as closed, then. This information has been added to the flags byte.	2016-03-20 23:09:48 +01:00
beorn7	b6dbb826ae	Improve fuzz testing and fix a bug exposed This improves fuzz testing in two ways: (1) More realistic time stamps. So far, the most common case in practice was very rare in the test: Completely regular increases of the timestamp. (2) Verify samples by scanning through the whole relevant section of the series. For Gorilla-like chunks, this showed two things: (1) With more regularly increasing time stamps, BenchmarkFuzz is essentially as fast as with the traditional chunks: ``` BenchmarkFuzzChunkType0-8 2 972514684 ns/op 83426196 B/op 2500044 allocs/op BenchmarkFuzzChunkType1-8 2 971478001 ns/op 82874660 B/op 2512364 allocs/op BenchmarkFuzzChunkType2-8 2 999339453 ns/op 76670636 B/op 2366116 allocs/op ``` (2) There was a bug related to when and how the chunk footer is overwritten to make use for the last sample. This wasn't exposed by random access as the last sample of a chunk is retrieved from the values in the header in that case.	2016-03-20 17:21:28 +01:00
beorn7	9d8fbbe822	Review improvements	2016-03-17 17:31:56 +01:00
beorn7	8cdced3850	Implement Gorilla-inspired chunk encoding This is not a verbatim implementation of the Gorilla encoding. First of all, it could not, even if we wanted, because Prometheus has a different chunking model (constant size, not constant time). Second, this adds a number of changes that improve the encoding in general or at least for the specific use case of Prometheus (and are partially only possible in the context of Prometheus). See comments in the code for details.	2016-03-17 14:47:08 +01:00
beorn7	8e64e8dfca	Fix return statement.	2016-03-17 14:43:00 +01:00
Björn Rabenstein	98c8560851	Merge pull request #1477 from prometheus/beorn7/storage7 Solve the series churn problem...	2016-03-17 14:39:28 +01:00
beorn7	e7ac9c6863	Improvments based on review - Moved returns into the default section of switch statement that can only happen then. - Fix typo.	2016-03-17 14:37:24 +01:00
beorn7	199f309a39	Resurrect and rename invalid preload requests count metric. It is now also used in label matching, so the name of the metric changed from `prometheus_local_storage_invalid_preload_requests_total` to `non_existent_series_matches_total'.	2016-03-13 11:54:24 +01:00
beorn7	e8c1f30ab2	Merge the parallel logic of getSeriesForRange and metricForFingerprint	2016-03-09 21:56:15 +01:00
beorn7	9445c7053d	Add tests for range-limited label matching While doing so, improve getSeriesForRange.	2016-03-09 21:01:03 +01:00
beorn7	47e3c90f9b	Clean up error propagation Only return an error where callers are doing something with it except simply logging and ignoring. All the errors touched in this commit flag the storage as dirty anyway, and that fact is logged anyway. So most of what is being removed here is just log spam. As discussed earlier, the class of errors that flags the storage as dirty signals fundamental corruption, no even bubbling up a one-time warning to the user (e.g. about incomplete results) isn't helping much because _anything_ happening in the storage has to be doubted from that point on (and in fact retroactively into the past, too). Flagging the storage dirty, and alerting on it (plus marking the state in the web UI) is the only way I can see right now. As a byproduct, I cleaned up the setDirty method a bit and improved the logged errors.	2016-03-09 18:56:30 +01:00
beorn7	99854a84d7	Merge branch 'beorn7/storage6' into beorn7/storage7	2016-03-09 17:23:25 +01:00
beorn7	5e4fa96719	Merge branch 'beorn7/storage5' into beorn7/storage6	2016-03-09 17:21:32 +01:00
beorn7	b343e65907	Merge branch 'beorn7/storage4' into beorn7/storage5 erge is necessary,	2016-03-09 17:14:42 +01:00
beorn7	d0a4477446	Merge branch 'beorn7/storage3' into beorn7/storage4 Conflicts: storage/local/preload.go storage/local/storage.go storage/local/storage_test.go	2016-03-09 17:13:16 +01:00
beorn7	55eddab25f	Merge branch 'beorn7/storage2' into beorn7/storage3	2016-03-09 16:48:46 +01:00
beorn7	161eada3ad	Make chunkIterator even leaner.	2016-03-09 16:20:39 +01:00
beorn7	beb36df4bb	De-flag preloadChunksForRange Now there is preloadChunksForRange and preloadChunksForInstant in both, the series and the storage.	2016-03-09 14:50:09 +01:00
beorn7	836f1db04c	Improve MetricsForLabelMatchers WIP: This needs more tests. It now gets a from and through value, which it may opportunistically use to optimize the retrieval. With possible future range indices, this could be used in a very efficient way. This change merely applies some easy checks, which should nevertheless solve the use case of heavy rule evaluations on servers with a lot of series churn. Idea is the following: - Only archive series that are at least as old as the headChunkTimeout (which was already extremely unlikely to happen). - Then maintain a high watermark for the last archival, i.e. no archived series has a sample more recent than that watermark. - Any query that doesn't reach to a time before that watermark doesn't have to touch the archive index at all. (A production server at Soundcloud with the aforementioned series churn and heavy rule evaluations spends 50% of its CPU time in archive index lookups. Since rule evaluations usually only touch very recent values, most of those lookup should disappear with this change.) - Federation with a very broad label matcher will profit from this, too. As a byproduct, the un-needed MetricForFingerprint method was removed from the Storage interface.	2016-03-09 00:25:59 +01:00
beorn7	167b83695c	Merge branch 'beorn7/storage5' into beorn7/storage6	2016-03-08 00:20:44 +01:00
beorn7	01795382c9	Merge branch 'beorn7/storage4' into beorn7/storage5	2016-03-08 00:20:13 +01:00
beorn7	f7fc542db6	Merge branch 'master' into beorn7/storage4 Conflicts: storage/local/persistence.go	2016-03-08 00:14:00 +01:00
beorn7	3d86130d8c	Merge branch 'master' into beorn7/storage3	2016-03-07 23:39:12 +01:00
beorn7	1f30c8de8d	Merge branch 'master' into beorn7/storage2	2016-03-07 23:38:42 +01:00
beorn7	c13b1ecfe9	Make chunk iterators more DRY This finally extracts all the common code of the two chunk iterators into one. Any future chunk encodings with fast access by index can use the same iterator by simply providing an indexAccessor. Other future chunk encodings without fast index access (like Gorilla-style) can still implement the chunkIterator interface as usual.	2016-03-07 20:23:14 +01:00
beorn7	32f280a3cd	Slim down the chunkIterator interface For one, remove unneeded methods. Then, instead of using a channel for all values, use a bufio.Scanner-like interface. This removes the need for creating a goroutine and avoids the (unnecessary) locking performed by channel sending and receiving. This will make it much easier to write new chunk implementations (like Gorilla-style encoding).	2016-03-07 19:50:13 +01:00
beorn7	b6fdb355d7	Move dump-heads into its own tool	2016-03-07 16:30:19 +01:00
beorn7	f193f2b8ef	Add a command to promtool that dumps metadata of heads.db I needed this today for debugging. It can certainly be improved, but it's already quite helpful. I refactored the reading of heads.db files out of persistence, which is an improvement, too. I made minor changes to the cli package to allow outputting via the io.Writer interface.	2016-03-07 16:21:57 +01:00
beorn7	75a6b460ef	Give TestEvictAndLoadChunkDescs more time to actually evict Obviously, it's really bad to depend on timing here. The proper fix would be to have something like WaitForIndexing for other things to wait for, too. For now, let's see if the wait time increase fixes the issue.	2016-03-03 13:29:39 +01:00
beorn7	fc7de5374a	Quarantine series upon problem writing to the series file This fixes https://github.com/prometheus/prometheus/issues/1059 , but not in the obvious way (simply not updating the persist watermark, because that's actually not that simple - we don't really know what has gone wrong exactly). As any errors relevant here are most likely caused by severe and unrecoverable problems with the series file, Using the now quarantine feature is the right step. We don't really have to be worried about any inconsistent state of the series because it will be removed for good ASAP. Another plus is that we don't have to declare the whole storage dirty anymore.	2016-03-03 13:15:02 +01:00
beorn7	0ea5801e47	Handle errors caused by data corruption more gracefully This requires all the panic calls upon unexpected data to be converted into errors returned. This pollute the function signatures quite lot. Well, this is Go... The ideas behind this are the following: - panic only if it's a programming error. Data corruptions happen, and they are not programming errors. - If we detect a data corruption, we "quarantine" the series, essentially removing it from the database and putting its data into a separate directory for forensics. - Failure during writing to a series file is not considered corruption automatically. It will call setDirty, though, so that a crashrecovery upon the next restart will commence and check for that. - Series quarantining and setDirty calls are logged and counted in metrics, but are hidden from the user of the interfaces in interface.go, whith the notable exception of Append(). The reasoning is that we treat corruption by removing the corrupted series, i.e. a query for it will return no results on its next call anyway, so return no results right now. In the case of Append(), we want to tell the user that no data has been appended, though. Minor side effects: - Now consistently using filepath.* instead of path.*. - Introduced structured logging where I touched it. This makes things less consistent, but a complete change to structured logging would be out of scope for this PR.	2016-03-02 23:02:34 +01:00
beorn7	b6840997a7	Merge branch 'beorn7/storage2' into beorn7/storage3	2016-03-02 16:11:25 +01:00
beorn7	ce58fd357b	Merge branch 'beorn7/storage' into beorn7/storage2 Conflicts: storage/local/chunk.go storage/local/interface.go	2016-03-02 16:09:32 +01:00
beorn7	2581648f70	Separate iterators by offset Add test that exposes the problem.	2016-03-02 16:01:03 +01:00
beorn7	c740789ce3	Improve predict_linear Fixes https://github.com/prometheus/prometheus/issues/1401 This remove the last (and in fact bogus) use of BoundaryValues. Thus, a whole lot of unused (and arguably sub-optimal / ugly) code can be removed here, too.	2016-02-25 12:10:55 +01:00
beorn7	4b503ed9a5	Merge branch 'master' into beorn7/storage2	2016-02-24 14:03:49 +01:00
beorn7	059295332f	Merge remote-tracking branch 'origin/master' into beorn7/storage	2016-02-24 14:02:27 +01:00
beorn7	53005c3085	Merge branch 'beorn7/storage' into beorn7/storage2	2016-02-24 14:00:56 +01:00
beorn7	28e9bbc15f	Populate chunkDesc.chunkLastTime during checkpoint loading, too	2016-02-24 13:58:34 +01:00
Björn Rabenstein	a8c79f0a0c	Merge pull request #1422 from prometheus/release-0.17 Merge more commits from 0.17.	2016-02-23 23:07:44 +01:00

1 2 3 4 5 ...

274 commits