Commit graph

2815 commits

Author SHA1 Message Date
Fabian Reinartz 1a3253e8ed Make scrape time unambigious.
This commit changes the scraper interface to accept a timestamp
so the reported timestamp by the caller and the timestamp
attached to samples does not differ.
2016-03-01 13:48:36 +01:00
Fabian Reinartz 2bb8ef99d1 Test scrape loop behavior. 2016-03-01 13:48:36 +01:00
Fabian Reinartz c7bbe95597 Remove outdated target tests 2016-03-01 13:48:36 +01:00
Fabian Reinartz 05de8b7f8d Extract target scraping into scrape loop.
This commit factors out the scrape loop handling into
its own data structure.
For the transition it will be directly attached to the
target.
2016-03-01 13:48:36 +01:00
Fabian Reinartz cebba3efbb Simplify and fix TargetManager reloading 2016-03-01 13:48:36 +01:00
Fabian Reinartz da99366f85 Consolidate Target.Update into constructor.
The Target.Update method is no longer needed.
2016-03-01 13:48:36 +01:00
Fabian Reinartz d15adfc917 Preserve target state across reloads.
This commit moves Scraper handling into a separate scrapePool type.
TargetSets only manage TargetProvider lifecycles and sync the
retrieved updates to the scrapePool.

TargetProviders are now expected to send a full initial target set
within 5 seconds. The scrapePools preserve target state across reloads
and only drop targets after the initial set was synced.
2016-03-01 13:48:36 +01:00
Fabian Reinartz 5b30bdb610 Change TargetProvider interface.
This commit changes the TargetProvider interface to use a
context.Context and send lists of TargetGroups, rather than
single ones.
2016-03-01 13:48:36 +01:00
Fabian Reinartz bb6dc3ff78 Remove old tests 2016-03-01 13:48:36 +01:00
Fabian Reinartz 5bfa4cdd46 Simplify target update handling.
We group providers by their scrape configuration. Each provider produces
target groups with an unique identifier.

On stopping a set of target providers we cancel the target providers,
stop scraping the targets and wait for the scrapers to finish.

On configuration reload all provider sets are stopped and new ones
are created. This will make targets disappear briefly on configuration
reload. Potentially scrapes are missed but due to the consistent
scrape intervals implemented recently, the impact is minor.
2016-03-01 13:48:36 +01:00
Brian Brazil 671cc59de7 Merge pull request #1440 from fabric8io/kubernetes-discovery
Kubernetes SD: Fix node IP discovery
2016-03-01 12:27:48 +00:00
Jimmi Dyson e59b7c15a3 Kubernetes SD: Fix node IP discovery 2016-03-01 12:24:52 +00:00
Fabian Reinartz bfa8aaa017 Rename notification to notifier 2016-03-01 12:39:08 +01:00
Fabian Reinartz 42a64a7d0b Merge pull request #1434 from igncp/master
Fix function names in comments
2016-03-01 12:32:15 +01:00
Ignacio Carbajo 1b3ea0ea1b Fix function names in comments 2016-02-29 21:58:32 +00:00
Björn Rabenstein e4d0ae9b4e Merge pull request #1432 from prometheus/beorn7/fix-deadlock
Fix a deadlock
2016-02-29 16:46:33 +01:00
beorn7 536e0bc86b Merge branch 'beorn7/fix-deadlock' into beorn7/storage3 2016-02-29 16:37:16 +01:00
beorn7 33a50e69f7 Fix a deadlock
Double acquisition of the RLock usually doesn't blow up, but if the
write lock is called for between the two RLock's, we are deadlocked.

This deadlock does not exist in release-0.17, BTW.
2016-02-29 16:34:29 +01:00
beorn7 c740789ce3 Improve predict_linear
Fixes https://github.com/prometheus/prometheus/issues/1401

This remove the last (and in fact bogus) use of BoundaryValues.

Thus, a whole lot of unused (and arguably sub-optimal / ugly) code can
be removed here, too.
2016-02-25 12:10:55 +01:00
Julius Volz 73399f826a Merge pull request #1427 from prometheus/fix-scrape-timeout
Remove invalid scrape timeout from example config.
2016-02-24 22:27:55 +01:00
Julius Volz 657d65d6d6 Remove invalid scrape timeout from example config.
It can't be greater than the scrape interval. Let's just remove it.
2016-02-24 21:06:36 +01:00
beorn7 4b503ed9a5 Merge branch 'master' into beorn7/storage2 2016-02-24 14:03:49 +01:00
beorn7 059295332f Merge remote-tracking branch 'origin/master' into beorn7/storage 2016-02-24 14:02:27 +01:00
beorn7 53005c3085 Merge branch 'beorn7/storage' into beorn7/storage2 2016-02-24 14:00:56 +01:00
beorn7 28e9bbc15f Populate chunkDesc.chunkLastTime during checkpoint loading, too 2016-02-24 13:58:34 +01:00
Björn Rabenstein a8c79f0a0c Merge pull request #1422 from prometheus/release-0.17
Merge more commits from 0.17.
2016-02-23 23:07:44 +01:00
Björn Rabenstein 5eff37ccbe Merge pull request #1421 from prometheus/beorn7/fix
Fix a very special case of handling the checkpoint timer
2016-02-23 22:25:27 +01:00
beorn7 8fa1560e48 Fix a very special case of handling the checkpoint timer 2016-02-23 16:48:35 +01:00
Björn Rabenstein 17bfe798eb Merge pull request #1419 from prometheus/release-note-fixes
Improve 0.17.0 changelog
2016-02-23 11:21:35 +01:00
Tobias Schmidt b7e6651e06 Improve 0.17.0 changelog
* remove wrong release date until 0.17.0 gets actually released
* fix wrong alertmanager version number
* add example for regex anchor change
2016-02-22 19:49:33 -05:00
Brian Brazil e4e00b6f24 Merge pull request #1418 from igncp/patch-1
Fix minor typo
2016-02-22 23:44:46 +00:00
Ignacio Carbajo 0c537d6af6 Fix minor typo 2016-02-22 23:25:17 +00:00
beorn7 41e44f6ab9 Merge branch 'master' into beorn7/storage2 2016-02-22 16:54:33 +01:00
Björn Rabenstein 888c77cb06 Merge pull request #1416 from prometheus/beorn7/fix-test
Fix a targetmanager test
2016-02-22 16:53:59 +01:00
beorn7 fd5108b038 Fix a targetmanager test 2016-02-22 16:43:48 +01:00
Björn Rabenstein d9eb624322 Merge pull request #1415 from prometheus/release-0.17
Forward-merge release-0.17 into master
2016-02-22 16:39:48 +01:00
Björn Rabenstein 51aad630b6 Merge pull request #1414 from prometheus/beorn7/rushed-race
Fix a race condition in calculatePersistenceUrgencyScore
2016-02-22 16:09:19 +01:00
beorn7 4d1f7b49b6 Fix a race condition in calculatePersistenceUrgencyScore 2016-02-22 15:48:39 +01:00
Brian Brazil 04946afd0a Merge pull request #1412 from prometheus/fingerprintfix
Remove fullLabels method and fix target updating
2016-02-22 12:11:08 +00:00
Fabian Reinartz 6df1f49c13 Remove fullLabels method and fix target updating
With recent changes to a Target's internal data representation
updating by fullLabels() assigns the additional default
instance label. This breaks target identity comparison and causes
identical targets from service discovery to be constantly swapped.
2016-02-22 13:06:30 +01:00
beorn7 454ecf3f52 Rework the way ranges and instants are handled
In a way, our instants were also ranges, just with the staleness delta
as range length. They are no treated equally, just that in one case,
the range length is set as range, in the other the staleness
delta. However, there are "real" instants where start and and time of
a query is the same. In those cases, we only want to return a single
value (the one closest before or at the equal start and end time). If
that value is the last sample in the series, odds are we have it
already in the series object. In that case, there is no need to pin or
load any chunks. A special singleSampleSeriesIterator is created for
that. This should greatly speed up instant queries as they happen
frequently for rule evaluations.
2016-02-22 01:47:18 +01:00
Fabian Reinartz 209c4ad64f Merge pull request #1410 from bluecmd/patch-1
Allow custom ldflags for go build
2016-02-21 10:35:00 +01:00
Christian Svensson 69ebf45649 Allow custom ldflags for go build
This allows users to use CGO and external linker when building Prometheus.
2016-02-20 17:34:36 +01:00
beorn7 b876f8e6a5 Move lastSamplePair method up to memorySeries
This implies a slight change of behavior as only samples added to the
respective instance of a memorySeries are returned. However, this is
most likely anyway what we want.

Following cases:

- Server has been restarted: Given the time it takes to cleanly
  shutdown and start up a server, the series are now stale anyway. An
  improved staleness handling (still to be implemented) will be based
  on tracking if a given target is continuing to expose samples for a
  given time series. In that case, we need a full scrape cycle to
  decide about staleness. So again, it makes sense to consider
  everything stale directly after a server restart.

- Series unarchived due to a read request: The series is definitely
  stale so we don't want to return anything anyway.

- Freshly created time series or series unarchived because of a sample
  append: That happens because appending a sample is imminent. Before
  the fingerprint lock is released, the series will have received a
  sample, and lastSamplePair will always returned the expected value.
2016-02-19 18:16:41 +01:00
beorn7 1e13f89039 Return SamplePair istead of *SamplePair consistently
Formalize ZeroSamplePair as return value for non-existing samples.

Change LastSamplePairForFingerprint to return a SamplePair (and not a
pointer to it), which saves allocations in a potentially extremely
frequent call.
2016-02-19 17:00:40 +01:00
beorn7 d290340367 Fix and improve chunkDesc locking 2016-02-19 16:24:38 +01:00
beorn7 0e202dacb4 Streamline series iterator creation
This will fix issue #1035 and will also help to make issue #1264 less
bad.

The fundamental problem in the current code:

In the preload phase, we quite accurately determine which chunks will
be used for the query being executed. However, in the subsequent step
of creating series iterators, the created iterators are referencing
_all_ in-memory chunks in their series, even the un-pinned ones. In
iterator creation, we copy a pointer to each in-memory chunk of a
series into the iterator. While this creates a certain amount of
allocation churn, the worst thing about it is that copying the chunk
pointer out of the chunkDesc requires a mutex acquisition. (Remember
that the iterator will also reference un-pinned chunks, so we need to
acquire the mutex to protect against concurrent eviction.) The worst
case happens if a series doesn't even contain any relevant samples for
the query time range. We notice that during preloading but then we
will still create a series iterator for it. But even for series that
do contain relevant samples, the overhead is quite bad for instant
queries that retrieve a single sample from each series, but still go
through all the effort of series iterator creation. All of that is
particularly bad if a series has many in-memory chunks.

This commit addresses the problem from two sides:

First, it merges preloading and iterator creation into one step,
i.e. the preload call returns an iterator for exactly the preloaded
chunks.

Second, the required mutex acquisition in chunkDesc has been greatly
reduced. That was enabled by a side effect of the first step, which is
that the iterator is only referencing pinned chunks, so there is no
risk of concurrent eviction anymore, and chunks can be accessed
without mutex acquisition.

To simplify the code changes for the above, the long-planned change of
ValueAtTime to ValueAtOrBefore time was performed at the same
time. (It should have been done first, but it kind of accidentally
happened while I was in the middle of writing the series iterator
changes. Sorry for that.) So far, we actively filtered the up to two
values that were returned by ValueAtTime, i.e. we invested work to
retrieve up to two values, and then we invested more work to throw one
of them away.

The SeriesIterator.BoundaryValues method can be removed once #1401 is
fixed. But I really didn't want to load even more changes into this
PR.

Benchmarks:

The BenchmarkFuzz.* benchmarks run 83% faster (i.e. about six times
faster) and allocate 95% fewer bytes. The reason for that is that the
benchmark reads one sample after another from the time series and
creates a new series iterator for each sample read.

To find out how much these improvements matter in practice, I have
mirrored a beefy Prometheus server at SoundCloud that suffers from
both issues #1035 and #1264. To reach steady state that would be
comparable, the server needs to run for 15d. So far, it has run for
1d. The test server currently has only half as many memory time series
and 60% of the memory chunks the main server has. The 90th percentile
rule evaluation cycle time is ~11s on the main server and only ~3s on
the test server. However, these numbers might get much closer over
time.

In addition to performance improvements, this commit removes about 150
LOC.
2016-02-19 16:24:38 +01:00
Fabian Reinartz fce17b41c5 Merge pull request #1408 from prometheus/hostname
Log argument parse errors
2016-02-19 12:22:12 +01:00
Fabian Reinartz e62677d7ba Log argument parse errors
Fixes #1407
2016-02-19 12:20:10 +01:00
Brian Brazil cd85352fe1 Merge pull request #1403 from igncp/master
Fix minor typo
2016-02-17 22:58:05 +00:00