Commit graph

3931 commits

Author SHA1 Message Date
Fabian Reinartz 5e57fa85c0 Merge pull request #2016 from mattbostock/fix_race_between_reload_and_stop
rules/manager.go: Fix race between reload and stop
2016-09-22 19:30:56 +02:00
Fabian Reinartz 71b332278d Merge pull request #2021 from prometheus/beorn7/storage
Avoid `defer` in seriesMap.get
2016-09-22 18:12:21 +02:00
beorn7 ca98382943 Avoid defer in seriesMap.get
This is related to https://github.com/golang/go/issues/14939 .
It's probably the only occurrence where it matters.
2016-09-22 17:50:58 +02:00
Brian Brazil 2ee00b4461 Merge pull request #2019 from dominikschulz/ec2state
Expose ec2_instance_state
2016-09-22 14:57:19 +01:00
Dominik Schulz f6fbcf9aa2 Expose ec2_instance_state 2016-09-22 15:01:23 +02:00
Fabian Reinartz e5c633ed14 Merge pull request #2015 from mattbostock/fix_typo
cmd/prometheus/main.go: Fix typo in comment
2016-09-21 23:41:06 +02:00
Matt Bostock 926a5ab3dd rules/manager.go: Fix race between reload and stop
On one relatively large Prometheus instance (1.7M series), I noticed
that upgrades were frequently resulting in Prometheus undergoing crash
recovery on start-up.

On closer examination, I found that Prometheus was panicking on
shutdown.

It seems that our configuration management (or misconfiguration thereof)
is reloading Prometheus then immediately restarting it, which I suspect
is causing this race:

    Sep 21 15:12:42 host systemd[1]: Reloading prometheus monitoring system.
    Sep 21 15:12:42 host prometheus[18734]: time="2016-09-21T15:12:42Z" level=info msg="Loading configuration file /etc/prometheus/config.yaml" source="main.go:221"
    Sep 21 15:12:42 host systemd[1]: Reloaded prometheus monitoring system.
    Sep 21 15:12:44 host systemd[1]: Stopping prometheus monitoring system...
    Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=warning msg="Received SIGTERM, exiting gracefully..." source="main.go:203"
    Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=info msg="See you next time!" source="main.go:210"
    Sep 21 15:12:44 host prometheus[18734]: time="2016-09-21T15:12:44Z" level=info msg="Stopping target manager..." source="targetmanager.go:90"
    Sep 21 15:12:52 host prometheus[18734]: time="2016-09-21T15:12:52Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:548"
    Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=warning msg="Error on ingesting out-of-order samples" numDropped=1 source="scrape.go:467"
    Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=error msg="Error adding file watch for \"/etc/prometheus/targets\": no such file or directory" source="file.go:84"
    Sep 21 15:12:56 host prometheus[18734]: time="2016-09-21T15:12:56Z" level=error msg="Error adding file watch for \"/etc/prometheus/targets\": no such file or directory" source="file.go:84"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping rule manager..." source="manager.go:366"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Rule manager stopped." source="manager.go:372"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping notification handler..." source="notifier.go:325"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping local storage..." source="storage.go:381"
    Sep 21 15:13:01 host prometheus[18734]: time="2016-09-21T15:13:01Z" level=info msg="Stopping maintenance loop..." source="storage.go:383"
    Sep 21 15:13:01 host prometheus[18734]: panic: close of closed channel
    Sep 21 15:13:01 host prometheus[18734]: goroutine 7686074 [running]:
    Sep 21 15:13:01 host prometheus[18734]: panic(0xba57a0, 0xc60c42b500)
    Sep 21 15:13:01 host prometheus[18734]: /usr/local/go/src/runtime/panic.go:500 +0x1a1
    Sep 21 15:13:01 host prometheus[18734]: github.com/prometheus/prometheus/rules.(*Manager).ApplyConfig.func1(0xc6645a9901, 0xc420271ef0, 0xc420338ed0, 0xc60c42b4f0, 0xc6645a9900)
    Sep 21 15:13:01 host prometheus[18734]: /home/build/packages/prometheus/tmp/build/gopath/src/github.com/prometheus/prometheus/rules/manager.go:412 +0x3c
    Sep 21 15:13:01 host prometheus[18734]: created by github.com/prometheus/prometheus/rules.(*Manager).ApplyConfig
    Sep 21 15:13:01 host prometheus[18734]: /home/build/packages/prometheus/tmp/build/gopath/src/github.com/prometheus/prometheus/rules/manager.go:423 +0x56b
    Sep 21 15:13:03 host systemd[1]: prometheus.service: main process exited, code=exited, status=2/INVALIDARGUMENT
2016-09-21 22:03:02 +01:00
Matt Bostock dd98766b32 cmd/prometheus/main.go: Fix typo in comment 2016-09-21 21:59:25 +01:00
Matthew Campbell 67d76e3a5d timeseries: store varbit encoded data into cassandra 2016-09-21 17:56:55 +02:00
Julius Volz f92532f254 api: Consolidate web API contexts
This is based on the common/route changes in
https://github.com/prometheus/common/pull/61.
2016-09-21 03:22:20 +02:00
Julius Volz 766074568e Update vendoring of github.com/prometheus/common/route 2016-09-21 03:22:14 +02:00
Tom Wilkie 4520e12440 Add HTTP Basic Auth & TLS support to the generic write path. (#1957)
* Add config, HTTP Basic Auth and TLS support to the generic write path.

- Move generic write path configuration to the config file
- Factor out config.TLSConfig -> tlf.Config translation
- Support TLSConfig for generic remote storage
- Rename Run to Start, and make it non-blocking.
- Dedupe code in httputil for TLS config.
- Make remote queue metrics global.
2016-09-19 22:47:51 +02:00
Julius Volz 6dda28dbd4 Merge pull request #2000 from prometheus/contextify-storage
Contextify storage and PromQL interfaces.
2016-09-19 16:31:12 +02:00
Julius Volz c187308366 storage: Contextify storage interfaces.
This is based on https://github.com/prometheus/prometheus/pull/1997.

This adds contexts to the relevant Storage methods and already passes
PromQL's new per-query context into the storage's query methods.
The immediate motivation supporting multi-tenancy in Frankenstein, but
this could also be used by Prometheus's normal local storage to support
cancellations and timeouts at some point.
2016-09-19 16:29:07 +02:00
Julius Volz ed5a0f0abe promql: Allow per-query contexts.
For Weaveworks' Frankenstein, we need to support multitenancy. In
Frankenstein, we initially solved this without modifying the promql
package at all: we constructed a new promql.Engine for every
query and injected a storage implementation into that engine which would
be primed to only collect data for a given user.

This is problematic to upstream, however. Prometheus assumes that there
is only one engine: the query concurrency gate is part of the engine,
and the engine contains one central cancellable context to shut down all
queries. Also, creating a new engine for every query seems like overkill.

Thus, we want to be able to pass per-query contexts into a single engine.

This change gets rid of the promql.Engine's built-in base context and
allows passing in a per-query context instead. Central cancellation of
all queries is still possible by deriving all passed-in contexts from
one central one, but this is now the responsibility of the caller. The
central query context is now created in main() and passed into the
relevant components (web handler / API, rule manager).

In a next step, the per-query context would have to be passed to the
storage implementation, so that the storage can implement multi-tenancy
or other features based on the contextual information.
2016-09-19 15:38:17 +02:00
Julius Volz c9c2663a54 Merge pull request #2004 from redbaron/no-false-sharing
Avoid having contended mutexes on same cacheline
2016-09-19 00:40:03 +02:00
Maxim Ivanov bdc53098fc Avoid having contended mutexes on same cacheline
CPUs have to serialise write access to a single cache line
effectively reducing level of possible parallelism. Placing
mutexes on different cache lines avoids this problem.

Most gains will be seen on NUMA servers where CPU interconnect
traffic is especially expensive

Before:
go test . -run none -bench BenchmarkFingerprintLocker
BenchmarkFingerprintLockerParallel-4   	 2000000	       932 ns/op
BenchmarkFingerprintLockerSerial-4     	30000000	        49.6 ns/op

After:
go test . -run none -bench BenchmarkFingerprintLocker
BenchmarkFingerprintLockerParallel-4   	 3000000	       569 ns/op
BenchmarkFingerprintLockerSerial-4     	30000000	        51.0 ns/op
2016-09-18 23:32:55 +01:00
Julius Volz 5f5a78e807 Merge pull request #1974 from prometheus/disable-local-storage
Allow disabling local storage.
2016-09-17 18:40:01 +02:00
Julius Volz 06199268b5 Merge pull request #2003 from mattbostock/remove_json_from_accept
Scrape: Remove JSON from Accept request header
2016-09-17 14:41:27 +02:00
Matt Bostock 4fc619b605 Scrape: Remove JSON from Accept request header
JSON is no longer supported as an exposition format [1] [2] [3]. Remove
it from the `Accept` header added to requests when scraping targets.

[1]: https://github.com/prometheus/prometheus/blob/master/CHANGELOG.md#100--2016-07-18
[2]: https://prometheus.io/docs/instrumenting/exposition_formats/#historical-versions
[3]: https://docs.google.com/document/d/1ZjyKiKxZV83VI9ZKAXRGKaUKK2BIWCT7oiGBKDBpjEY/edit?usp=sharing
2016-09-17 10:28:03 +01:00
Julius Volz a9b96be3fd Merge pull request #2002 from grandbora/graphPageUrlClientSideMigration
Graph page url client side migration
2016-09-17 00:26:09 +02:00
Bora Tunca 2e9de70267 generate assets 2016-09-16 18:20:12 -04:00
Bora Tunca 44377dc458 Add backward compatibility to old query format 2016-09-16 18:20:00 -04:00
Brian Brazil 5ac89fbc1f Merge pull request #1985 from tommyulfsparre/resync
Resync state after ZooKeeper failure
2016-09-16 20:48:04 +01:00
Tobias Schmidt 874cb44bb6 Merge pull request #1996 from ton31337/Fix/allow_numbers_as_first_letter
Allow number to be the first letter as well for `job_name`
2016-09-16 11:08:52 -04:00
beorn7 1f2785ebb7 Merge branch 'release-1.1' 2016-09-16 16:33:28 +02:00
Björn Rabenstein ac374aa674 Merge pull request #2001 from prometheus/beorn7/release
Cut v1.1.3
2016-09-16 13:33:59 +02:00
beorn7 f4656acd1f Cut v1.1.3 2016-09-16 13:08:16 +02:00
Donatas Abraitis 1aa8898b66 Allow number to be the first letter as well for job_name 2016-09-16 14:06:47 +03:00
Tobias Schmidt 304f971e3e Merge pull request #1769 from gottwald/gce-discovery
GCE discovery
2016-09-16 03:11:51 -04:00
Ingo Gottwald 3b546d061f Add support for GCE discovery 2016-09-16 08:55:33 +02:00
Ingo Gottwald fefcd6eef2 Add deps for google cloud support 2016-09-16 08:51:58 +02:00
Tobias Schmidt 4b850970a2 Merge pull request #1999 from prometheus/print-promu
Add promu installation logging to Makefile
2016-09-15 19:12:15 -04:00
Julius Volz 099adab253 Merge pull request #1987 from tomwilkie/1982-die-grpc-die
Switch back to protos over HTTP, instead of grpc.
2016-09-16 01:01:47 +02:00
Julius Volz 92d60ba4c0 Add promu installation logging to Makefile
Due to bad GitHub connectivity, "make" frequently got stuck at the promu
step for me, and I was thinking that "format" was taking a long time
because the promu step wasn't logged. All other Makefile targets have
log statements...
2016-09-16 00:59:56 +02:00
Tom Wilkie d83879210c Switch back to protos over HTTP, instead of GRPC.
My aim is to support the new grpc generic write path in Frankenstein.  On the surface this seems easy - however I've hit a number of problems that make me think it might be better to not use grpc just yet.

The explanation of the problems requires a little background.  At weave, traffic to frankenstein need to go through a couple of services first, for SSL and to be authenticated.  So traffic goes:

    internet -> frontend -> authfe -> frankenstein

- The frontend is Nginx, and adds/removes SSL.  Its done this way for legacy reasons, so the certs can be managed in one place, although eventually we imagine we'll merge it with authfe.  All traffic from frontend is sent to authfe.
- Authfe checks the auth tokens / cookie etc and then picks the service to forward the RPC to.
- Frankenstein accepts the reads and does the right thing with them.

First problem I hit was Nginx won't proxy http2 requests - it can accept them, but all calls downstream are http1 (see https://trac.nginx.org/nginx/ticket/923).  This wasn't such a big deal, so it now looks like:

    internet --(grpc/http2)--> frontend --(grpc/http1)--> authfe --(grpc/http1)--> frankenstein

Next problem was golang grpc server won't accept http1 requests (see https://groups.google.com/forum/#!topic/grpc-io/JnjCYGPMUms).  It is possible to link a grpc server in with a normal go http mux, as long as the mux server is serving over SSL, as the golang http client & server won't do http2 over anything other than an SSL connection.  This would require making all our service to service comms SSL.  So I had a go a writing a grpc http1 server, and got pretty far.  But is was a bit of a mess.

So finally I thought I'd make a separate grpc frontend for this, running in parallel with the frontend/authfe combo on a different port - and first up I'd need a grpc reverse proxy.  Ideally we'd have some nice, generic reverse proxy that only knew about a map from service names -> downstream service, and didn't need to decode & re-encode every request as it went through.  It seems like this can't be done with golang's grpc library - see https://github.com/mwitkow/grpc-proxy/issues/1.

And then I was surprised to find you can't do grpc from browsers! See http://www.grpc.io/faq/ - not important to us, but I'm starting to question why we decided to use grpc in the first place?

It would seem we could have most of the benefits of grpc with protos over HTTP, and this wouldn't preclude moving to grpc when its a bit more mature?  In fact, the grcp FAQ even admits as much:

> Why is gRPC better than any binary blob over HTTP/2?
> This is largely what gRPC is on the wire.
2016-09-15 23:21:54 +01:00
Tom Wilkie e0989fde89 Remove grpc vendoring. 2016-09-15 23:15:56 +01:00
Tom Wilkie bcd43e82c6 Add go_import_path to travis so it works on a fork. (#1995) 2016-09-15 17:05:56 -04:00
Tobias Schmidt 3eef95962b Merge pull request #1965 from prometheus/beorn7/federation
federation: Collapse time series of the same name
2016-09-15 17:03:13 -04:00
beorn7 717dd8adac web: add more federation test scenarios 2016-09-15 15:23:55 +02:00
beorn7 784a8ad7c5 web: Inline httptest.NewRequest because it only exists in Go1.7+ 2016-09-15 15:06:36 +02:00
Fabian Reinartz 737ae60cea Merge pull request #1993 from prometheus/grobie/include-go-report
Link to goreport from README
2016-09-15 08:00:14 +02:00
Fabian Reinartz 4f7e6e8bf0 Merge pull request #1994 from prometheus/grobie/fix-small-issues
Fix low hanging code issues
2016-09-15 07:57:55 +02:00
Tobias Schmidt 29ced0090f Fix common english misspellings 2016-09-14 23:23:28 -04:00
Tobias Schmidt e2c12dcdb5 Add missing error check in persistence test 2016-09-14 23:16:47 -04:00
Tobias Schmidt 27074863b4 Print url.URLs correctly in tests 2016-09-14 23:15:18 -04:00
Tobias Schmidt 8f3b62bfe4 Simplify struct initialization 2016-09-14 23:13:27 -04:00
Tobias Schmidt b41a240c36 Link to goreport from README 2016-09-14 23:09:26 -04:00
beorn7 39c4915401 federation: Collapse time series of the same name
This will avoid duplicate MetricFamilies, thereby shrinking the size
of the federation payload and also creating legal text format.

Also, add unit tests for federation. They were also needed for the
previous state of the code, but were missing.
2016-09-14 19:35:20 +02:00
Julius Volz b24e5d63bc Add noop local storage engine.
This adds a flag -storage.local.engine which allows turning off local
storage in Prometheus. Instead of adding if-conditions and nil checks to
all parts of Prometheus that deal with Prometheus's local storage
(including the web interface), disabling local storage simply means
replacing the normal local storage with a noop version that throws
samples away and returns empty query results. We also don't add the noop
storage to the fanout appender to decrease internal overhead.

Instead of returning empty results, an alternate behavior could be to
return errors on any query that point out that the local storage is
disabled. Not sure which one is more preferable, so I went with the
empty result option for now.
2016-09-14 13:18:05 +02:00