Commit graph

3054 commits

Author SHA1 Message Date
beorn7 fc6737b7fb storage: improve index lookups
tl;dr: This is not a fundamental solution to the indexing problem
(like tindex is) but it at least avoids utilizing the intersection
problem to the greatest possible amount.

In more detail:

Imagine the following query:

    nicely:aggregating:rule{job="foo",env="prod"}

While it uses a nicely aggregating recording rule (which might have a
very low cardinality), Prometheus still intersects the low number of
fingerprints for `{__name__="nicely:aggregating:rule"}` with the many
thousands of fingerprints matching `{job="foo"}` and with the millions
of fingerprints matching `{env="prod"}`. This totally innocuous query
is dead slow if the Prometheus server has a lot of time series with
the `{env="prod"}` label. Ironically, if you make the query more
complicated, it becomes blazingly fast:

    nicely:aggregating:rule{job=~"foo",env=~"prod"}

Why so? Because Prometheus only intersects with non-Equal matchers if
there are no Equal matchers. That's good in this case because it
retrieves the few fingerprints for
`{__name__="nicely:aggregating:rule"}` and then starts right ahead to
retrieve the metric for those FPs and checking individually if they
match the other matchers.

This change is generalizing the idea of when to stop intersecting FPs
and go into "retrieve metrics and check them individually against
remaining matchers" mode:

- First, sort all matchers by "expected cardinality". Matchers
  matching the empty string are always worst (and never used for
  intersections). Equal matchers are in general consider best, but by
  using some crude heuristics, we declare some better than others
  (instance labels or anything that looks like a recording rule).

- Then go through the matchers until we hit a threshold of remaining
  FPs in the intersection. This threshold is higher if we are already
  in the non-Equal matcher area as intersection is even more expensive
  here.

- Once the threshold has been reached (or we have run out of matchers
  that do not match the empty string), start with "retrieve metrics
  and check them individually against remaining matchers".

A beefy server at SoundCloud was spending 67% of its CPU time in index
lookups (fingerprintsForLabelPairs), serving mostly a dashboard that
is exclusively built with recording rules. With this change, it spends
only 35% in fingerprintsForLabelPairs. The CPU usage dropped from 26
cores to 18 cores. The median latency for query_range dropped from 14s
to 50ms(!). As expected, higher percentile latency didn't improve that
much because the new approach is _occasionally_ running into the worst
case while the old one was _systematically_ doing so. The 99th
percentile latency is now about as high as the median before (14s)
while it was almost twice as high before (26s).
2016-07-20 17:35:53 +02:00
Brian Brazil 40f8da699e Merge pull request #1815 from prometheus/stddev
Add stddev_over_time and stdvar_over_time.
2016-07-19 15:48:32 +01:00
Brian Brazil 9e58070c04 Merge pull request #1820 from prometheus/console-api
Update example console templates to new HTTP API.
2016-07-18 21:59:21 +01:00
Brian Brazil d458ecd4b9 Update example console templates to new HTTP API.
Fixes #1819
2016-07-18 20:36:47 +01:00
Fabian Reinartz 42a3cb6172 Merge branch 'release-1.0' 2016-07-19 00:51:32 +09:00
Fabian Reinartz e2bb136f4e Merge pull request #1818 from prometheus/fabxc-1.0.0
*: cut 1.0.0
2016-07-18 23:19:29 +09:00
Fabian Reinartz e867944172 *: cut 1.0.0 2016-07-18 22:38:51 +09:00
Brian Brazil 6eb1d5e63c Merge pull request #1816 from prometheus/fabxc-k8sfix
config: validate Kubernetes role correctly.
2016-07-18 14:29:10 +01:00
Fabian Reinartz 7a0b3af0b7 config: validate Kubernetes role correctly. 2016-07-18 22:24:41 +09:00
Brian Brazil 1edd6875f5 Add stddev_over_time and stdvar_over_time. 2016-07-16 00:34:44 +01:00
Fabian Reinartz 0938661db9 Merge pull request #1804 from pydima/master
web: return status code and error message for config resource
2016-07-15 18:26:19 +09:00
Dmitry Vorobev 273e457da4 web: return status code and error message for config resource 2016-07-15 10:15:24 +02:00
Fabian Reinartz 4d0c697548 circle: add tag v-prefix 2016-07-14 11:46:48 +09:00
Fabian Reinartz a6c81f32bc Merge branch 'release-1.0' of github.com:prometheus/prometheus into release-1.0 2016-07-14 10:44:02 +09:00
Fabian Reinartz 675b0184af Merge pull request #1812 from prometheus/fabxc-1.0.0-rc.0
Release 1.0.0-rc.0
2016-07-14 10:43:41 +09:00
Fabian Reinartz 1c4b3ab0e2 *: update changelog for version 1.0.0-rc.0 2016-07-14 10:04:40 +09:00
Fabian Reinartz e3f4df75a8 Merge pull request #1807 from prometheus/am-label
Expand alert templates at eval time.
2016-07-14 10:04:09 +09:00
Fabian Reinartz ca7ab62f40 *: bump version to 1.0.0-rc.0 2016-07-14 09:55:00 +09:00
Fabian Reinartz 919558f601 config: remove deprecated target_groups configuration 2016-07-14 09:55:00 +09:00
Fabian Reinartz 9c3129746c Merge pull request #1807 from prometheus/am-label
Expand alert templates at eval time.
2016-07-13 17:01:42 +02:00
Björn Rabenstein 0622304244 Merge pull request #1798 from prometheus/beorn7/storage2
Crash recovery: Fix an edge case.
2016-07-13 16:53:18 +02:00
Brian Brazil 0509b0f2db Expand alert templates at eval time.
Fixes #1678 #1677
2016-07-12 17:13:55 +01:00
Fabian Reinartz e87d604f94 Merge pull request #1791 from prometheus/fabxc-routepref
web: add -web.route-prefix flag
2016-07-10 12:05:39 +02:00
Fabian Reinartz f8bb0ee91f Merge pull request #1793 from prometheus/count_values
Add count_values() aggregator.
2016-07-08 11:50:42 +02:00
Fabian Reinartz b4660a550c Merge pull request #1797 from prometheus/beorn7/storage
Consistently use the `Seconds()` method for conversion of durations
2016-07-07 17:23:06 +02:00
beorn7 2a75b15328 Crash recovery: Fix an edge case.
If the chunks of a series in the checkpoint are all older then the
latest chunk on disk, the head chunk is persisted and therefore has to
be declared closed.

It would be great to have a test for this, but that would require more
plumbing, subject of #447.
2016-07-07 16:17:38 +02:00
beorn7 064b57858e Consistently use the Seconds() method for conversion of durations
This also fixes one remaining case of recording integral numbers
of seconds only for a metric, i.e. this will probably fix #1796.
2016-07-07 15:24:35 +02:00
Fabian Reinartz 59d26e8536 web: add -web.route-prefix flag
Fixes #1191
2016-07-07 11:49:16 +02:00
Fabian Reinartz b16f49bb44 Merge pull request #1795 from prometheus/keeping_extra
Clean out old keywords
2016-07-07 09:08:37 +02:00
Brian Brazil 875818d060 Clean out old keywords 2016-07-07 05:30:48 +01:00
Brian Brazil 16690736ab Add count_values() aggregator.
This is useful for counting how many instances
of a job are running a particular version/build.

Fixes #622
2016-07-05 17:14:01 +01:00
Fabian Reinartz 6f19e418e1 Merge pull request #1781 from prometheus/fabxc-k8s-sd
Select Kubernetes SD type in configuration
2016-07-05 14:29:46 +02:00
Fabian Reinartz 4591a2623b discovery/kubernetes: filter pod/container, service/endpoint
This change distinguishes and filters by pod/container and
service/endpoint in the respective sub-SDs.
2016-07-05 14:24:17 +02:00
Fabian Reinartz 0ff354341b discovery/kubernetes: remove unused channel 2016-07-05 14:22:12 +02:00
Fabian Reinartz 7221228843 discovery/kubernetes: select between discovery role
This adds `role` field to the Kubernetes SD config, which indicates
which type of Kubernetes SD should be run.
This no longer allows discovering pods and nodes with the same SD
configuration for example.
2016-07-05 14:22:12 +02:00
Fabian Reinartz abdf3536e4 Merge pull request #1788 from prometheus/topk
Make topk/bottomk aggregators.
2016-07-05 11:32:17 +02:00
Fabian Reinartz e0f8caacd7 discovery/kubernetes: extract service endpoint discovery
This extract discovery of services and their endpoints into its own
type.
2016-07-05 10:26:23 +02:00
Brian Brazil 7f23a4a099 Add type check on topk/bottomk parameter. 2016-07-04 18:03:05 +01:00
Brian Brazil fa9cc15573 Add topk/bottomk tests for multiple buckets. 2016-07-04 13:18:28 +01:00
Brian Brazil 3b0c182eee Move topk/bottomk unittests over to aggregators. 2016-07-04 13:18:28 +01:00
Brian Brazil 3e5136e36d Make topk/bottomk aggregators. 2016-07-04 13:18:19 +01:00
Fabian Reinartz 3c1e15087d Merge pull request #1785 from prometheus/fabxc-vendor
Update vendoring
2016-07-04 13:21:50 +02:00
Fabian Reinartz f26823afa7 Merge pull request #1787 from prometheus/fabxc-gitignore
gitignore: clean up
2016-07-04 11:47:44 +02:00
Fabian Reinartz 746d330a23 gitignore: clean up
This removes several outdated or unnecessary ignore patterns.
Especially those that match random words such as 'local' or 'core',
which repeatedly caused weird behavior that's hard to debug, e.g.
invisble vendored files.
2016-07-04 11:34:33 +02:00
Fabian Reinartz 7d441abd7b vendor: update prometheus org dependencies 2016-07-04 11:09:06 +02:00
Fabian Reinartz 7700cff1ff vendor: update golang.org/x/sys 2016-07-04 11:07:02 +02:00
Fabian Reinartz e4e8479716 vendor: add missing liencse/patent notices 2016-07-04 11:06:26 +02:00
Fabian Reinartz bc506ce959 vendor: update goleveldb dependencies 2016-07-04 10:08:49 +02:00
Fabian Reinartz f4398d5bdf Merge pull request #1782 from prometheus/fabxc-testflags
cmd/prometheus: use own flag set
2016-07-04 09:27:10 +02:00
Fabian Reinartz 8c24dfdb86 cmd/prometheus: use own flag set
Fixes #1743
2016-07-03 14:23:31 +02:00