mirror of
synced 2025-02-21 03:16:00 -08:00
Normal file
Normal file
@ -0,0 +1,98 @@
title: Alerting rules
sort_rank: 3
# Alerting rules
Alerting rules allow you to define alert conditions based on Prometheus
expression language expressions and to send notifications about firing alerts
to an external service. Whenever the alert expression results in one or more
vector elements at a given point in time, the alert counts as active for these
elements' label sets.
Alerting rules are configured in Prometheus in the same way as [recording
### Defining alerting rules
Alerting rules are defined in the following syntax:
ALERT <alert name>
IF <expression>
[ FOR <duration> ]
[ LABELS <label set> ]
[ ANNOTATIONS <label set> ]
The alert name must be a valid metric name.
The optional `FOR` clause causes Prometheus to wait for a certain duration
between first encountering a new expression output vector element (like an
instance with a high HTTP error rate) and counting an alert as firing for this
element. Elements that are active, but not firing yet, are in pending state.
The `LABELS` clause allows specifying a set of additional labels to be attached
to the alert. Any existing conflicting labels will be overwritten. The label
values can be templated.
The `ANNOTATIONS` clause specifies another set of labels that are not
identifying for an alert instance. They are used to store longer additional
information such as alert descriptions or runbook links. The annotation values
can be templated.
#### Templating
Label and annotation values can be templated using [console templates](https://prometheus.io/docs/visualization/consoles).
The `$labels` variable holds the label key/value pairs of an alert instance
and `$value` holds the evaluated value of an alert instance.
# To insert a firing element's label values:
{{ $labels.<labelname> }}
# To insert the numeric expression value of the firing element:
{{ $value }}
# Alert for any instance that is unreachable for >5 minutes.
ALERT InstanceDown
IF up == 0
FOR 5m
LABELS { severity = "page" }
summary = "Instance {{ $labels.instance }} down",
description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.",
# Alert for any instance that have a median request latency >1s.
ALERT APIHighRequestLatency
IF api_http_request_latencies_second{quantile="0.5"} > 1
FOR 1m
summary = "High request latency on {{ $labels.instance }}",
description = "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)",
### Inspecting alerts during runtime
To manually inspect which alerts are active (pending or firing), navigate to
the "Alerts" tab of your Prometheus instance. This will show you the exact
label sets for which each defined alert is currently active.
For pending and firing alerts, Prometheus also stores synthetic time series of
the form `ALERTS{alertname="<alert name>", alertstate="pending|firing", <additional alert labels>}`.
The sample value is set to `1` as long as the alert is in the indicated active
(pending or firing) state, and a single `0` value gets written out when an alert
transitions from active to inactive state. Once inactive, the time series does
not get further updates.
### Sending alert notifications
Prometheus's alerting rules are good at figuring what is broken *right now*, but
they are not a fully-fledged notification solution. Another layer is needed to
add summarization, notification rate limiting, silencing and alert dependencies
on top of the simple alert definitions. In Prometheus's ecosystem, the
[Alertmanager](https://prometheus.io/docs/alerting/alertmanager/) takes on this
role. Thus, Prometheus may be configured to periodically send information about
alert states to an Alertmanager instance, which then takes care of dispatching
the right notifications. The Alertmanager instance may be configured via the
`-alertmanager.url` command line flag.
Normal file
Normal file
File diff suppressed because it is too large
Load diff
Normal file
Normal file
@ -0,0 +1,4 @@
title: Configuration
sort_rank: 3
Normal file
Normal file
@ -0,0 +1,65 @@
title: Recording rules
sort_rank: 2
# Defining recording rules
## Configuring rules
Prometheus supports two types of rules which may be configured and then
evaluated at regular intervals: recording rules and [alerting
rules](alerting_rules.md). To include rules in Prometheus, create a file
containing the necessary rule statements and have Prometheus load the file via
the `rule_files` field in the [Prometheus configuration](configuration.md).
The rule files can be reloaded at runtime by sending `SIGHUP` to the Prometheus
process. The changes are only applied if all rule files are well-formatted.
## Syntax-checking rules
To quickly check whether a rule file is syntactically correct without starting
a Prometheus server, install and run Prometheus's `promtool` command-line
utility tool:
go get github.com/prometheus/prometheus/cmd/promtool
promtool check-rules /path/to/example.rules
When the file is syntactically valid, the checker prints a textual
representation of the parsed rules to standard output and then exits with
a `0` return status.
If there are any syntax errors, it prints an error message to standard error
and exits with a `1` return status. On invalid input arguments the exit status
is `2`.
## Recording rules
Recording rules allow you to precompute frequently needed or computationally
expensive expressions and save their result as a new set of time series.
Querying the precomputed result will then often be much faster than executing
the original expression every time it is needed. This is especially useful for
dashboards, which need to query the same expression repeatedly every time they
To add a new recording rule, add a line of the following syntax to your rule
<new time series name>[{<label overrides>}] = <expression to record>
Some examples:
# Saving the per-job HTTP in-progress request count as a new set of time series:
job:http_inprogress_requests:sum = sum(http_inprogress_requests) by (job)
# Drop or rewrite labels in the result time series:
new_time_series{label_to_change="new_value",label_to_drop=""} = old_time_series
Recording rules are evaluated at the interval specified by the
`evaluation_interval` field in the Prometheus configuration. During each
evaluation cycle, the right-hand-side expression of the rule statement is
evaluated at the current instant in time and the resulting sample vector is
stored as a new set of time series with the current timestamp and a new metric
name (and perhaps an overridden set of labels).
Normal file
Normal file
@ -0,0 +1,116 @@
title: Template examples
sort_rank: 4
# Template examples
Prometheus supports templating in the summary and description fields of
alerts, as well as in served console pages. Templates have the ability to run
queries against the local database, iterate over data, use conditionals, format
data, etc. The Prometheus templating language is based on the
[Go templating](http://golang.org/pkg/text/template/) system.
## Simple alert field templates
ALERT InstanceDown
IF up == 0
FOR 5m
summary = "Instance {{$labels.instance}} down",
description = "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes.",
Alert field templates will be executed during every rule iteration for each
alert that fires, so keep any queries and templates lightweight. If you have a
need for more complicated templates for alerts, it is recommended to link to a
console instead.
## Simple iteration
This displays a list of instances, and whether they are up:
{{ range query "up" }}
{{ .Labels.instance }} {{ .Value }}
{{ end }}
The special `.` variable contains the value of the current sample for each loop iteration.
## Display one value
{{ with query "some_metric{instance='someinstance'}" }}
{{ . | first | value | humanize }}
{{ end }}
Go and Go's templating language are both strongly typed, so one must check that
samples were returned to avoid an execution error. For example this could
happen if a scrape or rule evaluation has not run yet, or a host was down.
The included `prom_query_drilldown` template handles this, allows for
formatting of results, and linking to the [expression browser](https://prometheus.io/docs/visualization/browser/).
## Using console URL parameters
{{ with printf "node_memory_MemTotal{job='node',instance='%s'}" .Params.instance | query }}
{{ . | first | value | humanize1024}}B
{{ end }}
If accessed as `console.html?instance=hostname`, `.Params.instance` will evaluate to `hostname`.
## Advanced iteration
{{ range printf "node_network_receive_bytes{job='node',instance='%s',device!='lo'}" .Params.instance | query | sortByLabel "device"}}
<tr><th colspan=2>{{ .Labels.device }}</th></tr>
<td>{{ with printf "rate(node_network_receive_bytes{job='node',instance='%s',device='%s'}[5m])" .Labels.instance .Labels.device | query }}{{ . | first | value | humanize }}B/s{{end}}</td>
<td>{{ with printf "rate(node_network_transmit_bytes{job='node',instance='%s',device='%s'}[5m])" .Labels.instance .Labels.device | query }}{{ . | first | value | humanize }}B/s{{end}}</td>
</tr>{{ end }}
Here we iterate over all network devices and display the network traffic for each.
As the `range` action does not specify a variable, `.Params.instance` is not
available inside the loop as `.` is now the loop variable.
## Defining reusable templates
Prometheus supports defining templates that can be reused. This is particularly
powerful when combined with
[console library](template_reference.md#console-templates) support, allowing
sharing of templates across consoles.
{{/* Define the template */}}
{{define "myTemplate"}}
do something
{{/* Use the template */}}
{{template "myTemplate"}}
Templates are limited to one argument. The `args` function can be used to wrap multiple arguments.
{{define "myMultiArgTemplate"}}
First argument: {{.arg0}}
Second argument: {{.arg1}}
{{template "myMultiArgTemplate" (args 1 2)}}
Normal file
Normal file
@ -0,0 +1,114 @@
title: Template reference
sort_rank: 5
# Template reference
Prometheus supports templating in the summary and description fields of
alerts, as well as in served console pages. Templates have the ability to run
queries against the local database, iterate over data, use conditionals, format
data, etc. The Prometheus templating language is based on the
[Go templating](http://golang.org/pkg/text/template/) system.
## Data Structures
The primary data structure for dealing with time series data is the sample, defined as:
type sample struct {
Labels map[string]string
Value float64
The metric name of the sample is encoded in a special `__name__` label in the `Labels` map.
`[]sample` means a list of samples.
`interface{}` in Go is similar to a void pointer in C.
## Functions
In addition to the [default
functions](http://golang.org/pkg/text/template/#hdr-Functions) provided by Go
templating, Prometheus provides functions for easier processing of query
results in templates.
If functions are used in a pipeline, the pipeline value is passed as the last argument.
### Queries
| Name | Arguments | Returns | Notes |
| ------------- | ------------- | -------- | -------- |
| query | query string | []sample | Queries the database, does not support returning range vectors. |
| first | []sample | sample | Equivalent to `index a 0` |
| label | label, sample | string | Equivalent to `index sample.Labels label` |
| value | sample | float64 | Equivalent to `sample.Value` |
| sortByLabel | label, []samples | []sample | Sorts the samples by the given label. Is stable. |
`first`, `label` and `value` are intended to make query results easily usable in pipelines.
### Numbers
| Name | Arguments | Returns | Notes |
| ------------- | --------------| --------| --------- |
| humanize | number | string | Converts a number to a more readable format, using [metric prefixes](http://en.wikipedia.org/wiki/Metric_prefix).
| humanize1024 | number | string | Like `humanize`, but uses 1024 as the base rather than 1000. |
| humanizeDuration | number | string | Converts a duration in seconds to a more readable format. |
| humanizeTimestamp | number | string | Converts a Unix timestamp in seconds to a more readable format. |
Humanizing functions are intended to produce reasonable output for consumption
by humans, and are not guaranteed to return the same results between Prometheus
### Strings
| Name | Arguments | Returns | Notes |
| ------------- | ------------- | ------- | ----------- |
| title | string | string | [strings.Title](http://golang.org/pkg/strings/#Title), capitalises first character of each word.|
| toUpper | string | string | [strings.ToUpper](http://golang.org/pkg/strings/#ToUpper), converts all characters to upper case.|
| toLower | string | string | [strings.ToLower](http://golang.org/pkg/strings/#ToLower), converts all characters to lower case.|
| match | pattern, text | boolean | [regexp.MatchString](http://golang.org/pkg/regexp/#MatchString) Tests for a unanchored regexp match. |
| reReplaceAll | pattern, replacement, text | string | [Regexp.ReplaceAllString](http://golang.org/pkg/regexp/#Regexp.ReplaceAllString) Regexp substitution, unanchored. |
| graphLink | expr | string | Returns path to graph view in the [expression browser](https://prometheus.io/docs/visualization/browser/) for the expression. |
| tableLink | expr | string | Returns path to tabular ("Console") view in the [expression browser](https://prometheus.io/docs/visualization/browser/) for the expression. |
### Others
| Name | Arguments | Returns | Notes |
| ------------- | ------------- | ------- | ----------- |
| args | []interface{} | map[string]interface{} | This converts a list of objects to a map with keys arg0, arg1 etc. This is intended to allow multiple arguments to be passed to templates. |
| tmpl | string, []interface{} | nothing | Like the built-in `template`, but allows non-literals as the template name. Note that the result is assumed to be safe, and will not be auto-escaped. Only available in consoles. |
| safeHtml | string | string | Marks string as HTML not requiring auto-escaping. |
## Template type differences
Each of the types of templates provide different information that can be used to
parameterize templates, and have a few other differences.
### Alert field templates
`.Value` and `.Labels` contain the alert value and labels. They are also exposed
as the `$value` and `$labels` variables for convenience.
### Console templates
Consoles are exposed on `/consoles/`, and sourced from the directory pointed to
by the `-web.console.templates` flag.
Console templates are rendered with
[html/template](http://golang.org/pkg/html/template/), which provides
auto-escaping. To bypass the auto-escaping use the `safe*` functions.,
URL parameters are available as a map in `.Params`. To access multiple URL
parameters by the same name, `.RawParams` is a map of the list values for each
parameter. The URL path is available in `.Path`, excluding the `/consoles/`
Consoles also have access to all the templates defined with `{{define
"templateName"}}...{{end}}` found in `*.lib` files in the directory pointed to
by the `-web.console.libraries` flag. As this is a shared namespace, take care
to avoid clashes with other users. Template names beginning with `prom`,
`_prom`, and `__` are reserved for use by Prometheus, as are the functions
listed above.
Normal file
Normal file
@ -0,0 +1,81 @@
title: Federation
sort_rank: 6
# Federation
Federation allows a Prometheus server to scrape selected time series from
another Prometheus server.
## Use cases
There are different use cases for federation. Commonly, it is used to either
achieve scalable Prometheus monitoring setups or to pull related metrics from
one service's Prometheus into another.
### Hierarchical federation
Hierarchical federation allows Prometheus to scale to environments with tens of
data centers and millions of nodes. In this use case, the federation topology
resembles a tree, with higher-level Prometheus servers collecting aggregated
time series data from a larger number of subordinated servers.
For example, a setup might consist of many per-datacenter Prometheus servers
that collect data in high detail (instance-level drill-down), and a set of
global Prometheus servers which collect and store only aggregated data
(job-level drill-down) from those local servers. This provides an aggregate
global view and detailed local views.
### Cross-service federation
In cross-service federation, a Prometheus server of one service is configured
to scrape selected data from another service's Prometheus server to enable
alerting and queries against both datasets within a single server.
For example, a cluster scheduler running multiple services might expose
resource usage information (like memory and CPU usage) about service instances
running on the cluster. On the other hand, a service running on that cluster
will only expose application-specific service metrics. Often, these two sets of
metrics are scraped by separate Prometheus servers. Using federation, the
Prometheus server containing service-level metrics may pull in the cluster
resource usage metrics about its specific service from the cluster Prometheus,
so that both sets of metrics can be used within that server.
## Configuring federation
On any given Prometheus server, the `/federate` endpoint allows retrieving the
current value for a selected set of time series in that server. At least one
`match[]` URL parameter must be specified to select the series to expose. Each
`match[]` argument needs to specify an
[instant vector selector](querying/basics.md#instant-vector-selectors) like
`up` or `{job="api-server"}`. If multiple `match[]` parameters are provided,
the union of all matched series is selected.
To federate metrics from one server to another, configure your destination
Prometheus server to scrape from the `/federate` endpoint of a source server,
while also enabling the `honor_labels` scrape option (to not overwrite any
labels exposed by the source server) and passing in the desired `match[]`
parameters. For example, the following `scrape_config` federates any series
with the label `job="prometheus"` or a metric name starting with `job:` from
the Prometheus servers at `source-prometheus-{1,2,3}:9090` into the scraping
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
- targets:
- 'source-prometheus-1:9090'
- 'source-prometheus-2:9090'
- 'source-prometheus-3:9090'
Normal file
Normal file
@ -0,0 +1,274 @@
title: Getting started
sort_rank: 1
# Getting started
This guide is a "Hello World"-style tutorial which shows how to install,
configure, and use Prometheus in a simple example setup. You will download and run
Prometheus locally, configure it to scrape itself and an example application,
and then work with queries, rules, and graphs to make use of the collected time
series data.
## Downloading and running Prometheus
[Download the latest release](https://prometheus.io/download) of Prometheus for
your platform, then extract and run it:
tar xvfz prometheus-*.tar.gz
cd prometheus-*
Before starting Prometheus, let's configure it.
## Configuring Prometheus to monitor itself
Prometheus collects metrics from monitored targets by scraping metrics HTTP
endpoints on these targets. Since Prometheus also exposes data in the same
manner about itself, it can also scrape and monitor its own health.
While a Prometheus server that collects only data about itself is not very
useful in practice, it is a good starting example. Save the following basic
Prometheus configuration as a file named `prometheus.yml`:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
monitor: 'codelab-monitor'
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
- targets: ['localhost:9090']
For a complete specification of configuration options, see the
[configuration documentation](configuration/configuration.md).
## Starting Prometheus
To start Prometheus with your newly created configuration file, change to the
directory containing the Prometheus binary and run:
# Start Prometheus.
# By default, Prometheus stores its database in ./data (flag -storage.local.path).
./prometheus -config.file=prometheus.yml
Prometheus should start up. You should also be able to browse to a status page
about itself at [localhost:9090](http://localhost:9090). Give it a couple of
seconds to collect data about itself from its own HTTP metrics endpoint.
You can also verify that Prometheus is serving metrics about itself by
navigating to its metrics endpoint:
The number of OS threads executed by Prometheus is controlled by the
`GOMAXPROCS` environment variable. As of Go 1.5 the default value is
the number of cores available.
Blindly setting `GOMAXPROCS` to a high value can be counterproductive. See the
relevant [Go FAQs](http://golang.org/doc/faq#Why_no_multi_CPU).
Prometheus by default uses around 3GB in memory. If you have a
smaller machine, you can tune Prometheus to use less memory. For details,
see the [memory usage documentation](storage.md#memory-usage).
## Using the expression browser
Let us try looking at some data that Prometheus has collected about itself. To
use Prometheus's built-in expression browser, navigate to
http://localhost:9090/graph and choose the "Console" view within the "Graph"
As you can gather from [localhost:9090/metrics](http://localhost:9090/metrics),
one metric that Prometheus exports about itself is called
`prometheus_target_interval_length_seconds` (the actual amount of time between
target scrapes). Go ahead and enter this into the expression console:
This should return a number of different time series (along with the latest value
recorded for each), all with the metric name
`prometheus_target_interval_length_seconds`, but with different labels. These
labels designate different latency percentiles and target group intervals.
If we were only interested in the 99th percentile latencies, we could use this
query to retrieve that information:
To count the number of returned time series, you could write:
For more about the expression language, see the
[expression language documentation](querying/basics.md).
## Using the graphing interface
To graph expressions, navigate to http://localhost:9090/graph and use the "Graph"
For example, enter the following expression to graph the per-second rate of all
storage chunk operations happening in the self-scraped Prometheus:
Experiment with the graph range parameters and other settings.
## Starting up some sample targets
Let us make this more interesting and start some example targets for Prometheus
to scrape.
The Go client library includes an example which exports fictional RPC latencies
for three services with different latency distributions.
Ensure you have the [Go compiler installed](https://golang.org/doc/install) and
have a [working Go build environment](https://golang.org/doc/code.html) (with
correct `GOPATH`) set up.
Download the Go client library for Prometheus and run three of these example
# Fetch the client library code and compile example.
git clone https://github.com/prometheus/client_golang.git
cd client_golang/examples/random
go get -d
go build
# Start 3 example targets in separate terminals:
./random -listen-address=:8080
./random -listen-address=:8081
./random -listen-address=:8082
You should now have example targets listening on http://localhost:8080/metrics,
http://localhost:8081/metrics, and http://localhost:8082/metrics.
## Configuring Prometheus to monitor the sample targets
Now we will configure Prometheus to scrape these new targets. Let's group all
three endpoints into one job called `example-random`. However, imagine that the
first two endpoints are production targets, while the third one represents a
canary instance. To model this in Prometheus, we can add several groups of
endpoints to a single job, adding extra labels to each group of targets. In
this example, we will add the `group="production"` label to the first group of
targets, while adding `group="canary"` to the second.
To achieve this, add the following job definition to the `scrape_configs`
section in your `prometheus.yml` and restart your Prometheus instance:
- job_name: 'example-random'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
- targets: ['localhost:8080', 'localhost:8081']
group: 'production'
- targets: ['localhost:8082']
group: 'canary'
Go to the expression browser and verify that Prometheus now has information
about time series that these example endpoints expose, such as the
`rpc_durations_seconds` metric.
## Configure rules for aggregating scraped data into new time series
Though not a problem in our example, queries that aggregate over thousands of
time series can get slow when computed ad-hoc. To make this more efficient,
Prometheus allows you to prerecord expressions into completely new persisted
time series via configured recording rules. Let's say we are interested in
recording the per-second rate of example RPCs
(`rpc_durations_seconds_count`) averaged over all instances (but
preserving the `job` and `service` dimensions) as measured over a window of 5
minutes. We could write this as:
avg(rate(rpc_durations_seconds_count[5m])) by (job, service)
Try graphing this expression.
To record the time series resulting from this expression into a new metric
called `job_service:rpc_durations_seconds_count:avg_rate5m`, create a file
with the following recording rule and save it as `prometheus.rules`:
job_service:rpc_durations_seconds_count:avg_rate5m = avg(rate(rpc_durations_seconds_count[5m])) by (job, service)
To make Prometheus pick up this new rule, add a `rule_files` statement to the
`global` configuration section in your `prometheus.yml`. The config should now
look like this:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s # Evaluate rules every 15 seconds.
# Attach these extra labels to all timeseries collected by this Prometheus instance.
monitor: 'codelab-monitor'
- 'prometheus.rules'
- job_name: 'prometheus'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
- targets: ['localhost:9090']
- job_name: 'example-random'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
- targets: ['localhost:8080', 'localhost:8081']
group: 'production'
- targets: ['localhost:8082']
group: 'canary'
Restart Prometheus with the new configuration and verify that a new time series
with the metric name `job_service:rpc_durations_seconds_count:avg_rate5m`
is now available by querying it through the expression browser or graphing it.
Normal file
Normal file
@ -0,0 +1,19 @@
# todo: internal
# Prometheus 1.8
Welcome to the documentation of the Prometheus server.
The documentation is available alongside all the project documentation at
## Content
- [Installing](install.md)
- [Getting started](getting_started.md)
- [Configuration](configuration/configuration.md)
- [Querying](querying/basics.md)
- [Storage](storage.md)
- [Federation](federation.md)
Normal file
Normal file
@ -0,0 +1,102 @@
title: Installation
sort_rank: 2
# Installation
## Using pre-compiled binaries
We provide precompiled binaries for most official Prometheus components. Check
out the [download section](https://prometheus.io/download) for a list of all
available versions.
## From source
For building Prometheus components from source, see the `Makefile` targets in
the respective repository.
NOTE: **Note:** The documentation on this website refers to the latest stable
release (excluding pre-releases). The branch
[next-release](https://github.com/prometheus/docs/compare/next-release) refers
to unreleased changes that are in master branches of source repos.
## Using Docker
All Prometheus services are available as Docker images under the
[prom](https://hub.docker.com/u/prom/) organization.
Running Prometheus on Docker is as simple as `docker run -p 9090:9090
prom/prometheus`. This starts Prometheus with a sample configuration and
exposes it on port 9090.
The Prometheus image uses a volume to store the actual metrics. For
production deployments it is highly recommended to use the
[Data Volume Container](https://docs.docker.com/engine/admin/volumes/volumes/)
pattern to ease managing the data on Prometheus upgrades.
To provide your own configuration, there are several options. Here are
two examples.
### Volumes & bind-mount
Bind-mount your `prometheus.yml` from the host by running:
docker run -p 9090:9090 -v /tmp/prometheus.yml:/etc/prometheus/prometheus.yml \
Or use an additional volume for the config:
docker run -p 9090:9090 -v /prometheus-data \
prom/prometheus -config.file=/prometheus-data/prometheus.yml
### Custom image
To avoid managing a file on the host and bind-mount it, the
configuration can be baked into the image. This works well if the
configuration itself is rather static and the same across all
For this, create a new directory with a Prometheus configuration and a
`Dockerfile` like this:
FROM prom/prometheus
ADD prometheus.yml /etc/prometheus/
Now build and run it:
docker build -t my-prometheus .
docker run -p 9090:9090 my-prometheus
A more advanced option is to render the configuration dynamically on start
with some tooling or even have a daemon update it periodically.
## Using configuration management systems
If you prefer using configuration management systems you might be interested in
the following third-party contributions:
### Ansible
* [griggheo/ansible-prometheus](https://github.com/griggheo/ansible-prometheus)
* [William-Yeh/ansible-prometheus](https://github.com/William-Yeh/ansible-prometheus)
### Chef
* [rayrod2030/chef-prometheus](https://github.com/rayrod2030/chef-prometheus)
### Puppet
* [puppet/prometheus](https://forge.puppet.com/puppet/prometheus)
### SaltStack
* [bechtoldt/saltstack-prometheus-formula](https://github.com/bechtoldt/saltstack-prometheus-formula)
Normal file
Normal file
@ -0,0 +1,417 @@
title: HTTP API
sort_rank: 7
The current stable HTTP API is reachable under `/api/v1` on a Prometheus
server. Any non-breaking additions will be added under that endpoint.
## Format overview
The API response format is JSON. Every successful API request returns a `2xx`
status code.
Invalid requests that reach the API handlers return a JSON error object
and one of the following HTTP response codes:
- `400 Bad Request` when parameters are missing or incorrect.
- `422 Unprocessable Entity` when an expression can't be executed
- `503 Service Unavailable` when queries time out or abort.
Other non-`2xx` codes may be returned for errors occurring before the API
endpoint is reached.
The JSON response envelope format is as follows:
"status": "success" | "error",
"data": <data>,
// Only set if status is "error". The data field may still hold
// additional data.
"errorType": "<string>",
"error": "<string>"
Input timestamps may be provided either in
[RFC3339](https://www.ietf.org/rfc/rfc3339.txt) format or as a Unix timestamp
in seconds, with optional decimal places for sub-second precision. Output
timestamps are always represented as Unix timestamps in seconds.
Names of query parameters that may be repeated end with `[]`.
`<series_selector>` placeholders refer to Prometheus [time series
selectors](basics.md#time-series-selectors) like `http_requests_total` or
`http_requests_total{method=~"^GET|POST$"}` and need to be URL-encoded.
`<duration>` placeholders refer to Prometheus duration strings of the form
`[0-9]+[smhdwy]`. For example, `5m` refers to a duration of 5 minutes.
## Expression queries
Query language expressions may be evaluated at a single instant or over a range
of time. The sections below describe the API endpoints for each type of
expression query.
### Instant queries
The following endpoint evaluates an instant query at a single point in time:
GET /api/v1/query
URL query parameters:
- `query=<string>`: Prometheus expression query string.
- `time=<rfc3339 | unix_timestamp>`: Evaluation timestamp. Optional.
- `timeout=<duration>`: Evaluation timeout. Optional. Defaults to and
is capped by the value of the `-query.timeout` flag.
The current server time is used if the `time` parameter is omitted.
The `data` section of the query result has the following format:
"resultType": "matrix" | "vector" | "scalar" | "string",
"result": <value>
`<value>` refers to the query result data, which has varying formats
depending on the `resultType`. See the [expression query result
The following example evaluates the expression `up` at the time
$ curl 'http://localhost:9090/api/v1/query?query=up&time=2015-07-01T20:10:51.781Z'
"status" : "success",
"data" : {
"resultType" : "vector",
"result" : [
"metric" : {
"__name__" : "up",
"job" : "prometheus",
"instance" : "localhost:9090"
"value": [ 1435781451.781, "1" ]
"metric" : {
"__name__" : "up",
"job" : "node",
"instance" : "localhost:9100"
"value" : [ 1435781451.781, "0" ]
### Range queries
The following endpoint evaluates an expression query over a range of time:
GET /api/v1/query_range
URL query parameters:
- `query=<string>`: Prometheus expression query string.
- `start=<rfc3339 | unix_timestamp>`: Start timestamp.
- `end=<rfc3339 | unix_timestamp>`: End timestamp.
- `step=<duration>`: Query resolution step width.
- `timeout=<duration>`: Evaluation timeout. Optional. Defaults to and
is capped by the value of the `-query.timeout` flag.
The `data` section of the query result has the following format:
"resultType": "matrix",
"result": <value>
For the format of the `<value>` placeholder, see the [range-vector result
The following example evaluates the expression `up` over a 30-second range with
a query resolution of 15 seconds.
$ curl 'http://localhost:9090/api/v1/query_range?query=up&start=2015-07-01T20:10:30.781Z&end=2015-07-01T20:11:00.781Z&step=15s'
"status" : "success",
"data" : {
"resultType" : "matrix",
"result" : [
"metric" : {
"__name__" : "up",
"job" : "prometheus",
"instance" : "localhost:9090"
"values" : [
[ 1435781430.781, "1" ],
[ 1435781445.781, "1" ],
[ 1435781460.781, "1" ]
"metric" : {
"__name__" : "up",
"job" : "node",
"instance" : "localhost:9091"
"values" : [
[ 1435781430.781, "0" ],
[ 1435781445.781, "0" ],
[ 1435781460.781, "1" ]
## Querying metadata
### Finding series by label matchers
The following endpoint returns the list of time series that match a certain label set.
GET /api/v1/series
URL query parameters:
- `match[]=<series_selector>`: Repeated series selector argument that selects the
series to return. At least one `match[]` argument must be provided.
- `start=<rfc3339 | unix_timestamp>`: Start timestamp.
- `end=<rfc3339 | unix_timestamp>`: End timestamp.
The `data` section of the query result consists of a list of objects that
contain the label name/value pairs which identify each series.
The following example returns all series that match either of the selectors
`up` or `process_start_time_seconds{job="prometheus"}`:
$ curl -g 'http://localhost:9090/api/v1/series?match[]=up&match[]=process_start_time_seconds{job="prometheus"}'
"status" : "success",
"data" : [
"__name__" : "up",
"job" : "prometheus",
"instance" : "localhost:9090"
"__name__" : "up",
"job" : "node",
"instance" : "localhost:9091"
"__name__" : "process_start_time_seconds",
"job" : "prometheus",
"instance" : "localhost:9090"
### Querying label values
The following endpoint returns a list of label values for a provided label name:
GET /api/v1/label/<label_name>/values
The `data` section of the JSON response is a list of string label names.
This example queries for all label values for the `job` label:
$ curl http://localhost:9090/api/v1/label/job/values
"status" : "success",
"data" : [
## Deleting series
The following endpoint deletes matched series entirely from a Prometheus server:
DELETE /api/v1/series
URL query parameters:
- `match[]=<series_selector>`: Repeated label matcher argument that selects the
series to delete. At least one `match[]` argument must be provided.
The `data` section of the JSON response has the following format:
"numDeleted": <number of deleted series>
The following example deletes all series that match either of the selectors
`up` or `process_start_time_seconds{job="prometheus"}`:
$ curl -XDELETE -g 'http://localhost:9090/api/v1/series?match[]=up&match[]=process_start_time_seconds{job="prometheus"}'
"status" : "success",
"data" : {
"numDeleted" : 3
## Expression query result formats
Expression queries may return the following response values in the `result`
property of the `data` section. `<sample_value>` placeholders are numeric
sample values. JSON does not support special float values such as `NaN`, `Inf`,
and `-Inf`, so sample values are transferred as quoted JSON strings rather than
raw numbers.
### Range vectors
Range vectors are returned as result type `matrix`. The corresponding
`result` property has the following format:
"metric": { "<label_name>": "<label_value>", ... },
"values": [ [ <unix_time>, "<sample_value>" ], ... ]
### Instant vectors
Instant vectors are returned as result type `vector`. The corresponding
`result` property has the following format:
"metric": { "<label_name>": "<label_value>", ... },
"value": [ <unix_time>, "<sample_value>" ]
### Scalars
Scalar results are returned as result type `scalar`. The corresponding
`result` property has the following format:
[ <unix_time>, "<scalar_value>" ]
### Strings
String results are returned as result type `string`. The corresponding
`result` property has the following format:
[ <unix_time>, "<string_value>" ]
## Targets
> This API is experimental as it is intended to be extended with targets
> dropped due to relabelling in the future.
The following endpoint returns an overview of the current state of the
Prometheus target discovery:
GET /api/v1/targets
Currently only the active targets are part of the response.
$ curl http://localhost:9090/api/v1/targets
"status": "success", [3/11]
"data": {
"activeTargets": [
"discoveredLabels": {
"__address__": "",
"__metrics_path__": "/metrics",
"__scheme__": "http",
"job": "prometheus"
"labels": {
"instance": "",
"job": "prometheus"
"scrapeUrl": "",
"lastError": "",
"lastScrape": "2017-01-17T15:07:44.723715405+01:00",
"health": "up"
## Alertmanagers
> This API is experimental as it is intended to be extended with Alertmanagers
> dropped due to relabelling in the future.
The following endpoint returns an overview of the current state of the
Prometheus alertmanager discovery:
GET /api/v1/alertmanagers
Currently only the active Alertmanagers are part of the response.
$ curl http://localhost:9090/api/v1/alertmanagers
"status": "success",
"data": {
"activeAlertmanagers": [
"url": ""
Normal file
Normal file
@ -0,0 +1,215 @@
title: Querying basics
nav_title: Basics
sort_rank: 1
# Querying Prometheus
Prometheus provides a functional expression language that lets the user select
and aggregate time series data in real time. The result of an expression can
either be shown as a graph, viewed as tabular data in Prometheus's expression
browser, or consumed by external systems via the [HTTP API](api.md).
## Examples
This document is meant as a reference. For learning, it might be easier to
start with a couple of [examples](examples.md).
## Expression language data types
In Prometheus's expression language, an expression or sub-expression can
evaluate to one of four types:
* **Instant vector** - a set of time series containing a single sample for each time series, all sharing the same timestamp
* **Range vector** - a set of time series containing a range of data points over time for each time series
* **Scalar** - a simple numeric floating point value
* **String** - a simple string value; currently unused
Depending on the use-case (e.g. when graphing vs. displaying the output of an
expression), only some of these types are legal as the result from a
user-specified expression. For example, an expression that returns an instant
vector is the only type that can be directly graphed.
## Literals
### String literals
Strings may be specified as literals in single quotes, double quotes or
PromQL follows the same [escaping rules as
Go](https://golang.org/ref/spec#String_literals). In single or double quotes a
backslash begins an escape sequence, which may be followed by `a`, `b`, `f`,
`n`, `r`, `t`, `v` or `\`. Specific characters can be provided using octal
(`\nnn`) or hexadecimal (`\xnn`, `\unnnn` and `\Unnnnnnnn`).
No escaping is processed inside backticks. Unlike Go, Prometheus does not discard newlines inside backticks.
"this is a string"
'these are unescaped: \n \\ \t'
`these are not unescaped: \n ' " \t`
### Float literals
Scalar float values can be literally written as numbers of the form
## Time series Selectors
### Instant vector selectors
Instant vector selectors allow the selection of a set of time series and a
single sample value for each at a given timestamp (instant): in the simplest
form, only a metric name is specified. This results in an instant vector
containing elements for all time series that have this metric name.
This example selects all time series that have the `http_requests_total` metric
It is possible to filter these time series further by appending a set of labels
to match in curly braces (`{}`).
This example selects only those time series with the `http_requests_total`
metric name that also have the `job` label set to `prometheus` and their
`group` label set to `canary`:
It is also possible to negatively match a label value, or to match label values
against regular expressions. The following label matching operators exist:
* `=`: Select labels that are exactly equal to the provided string.
* `!=`: Select labels that are not equal to the provided string.
* `=~`: Select labels that regex-match the provided string (or substring).
* `!~`: Select labels that do not regex-match the provided string (or substring).
For example, this selects all `http_requests_total` time series for `staging`,
`testing`, and `development` environments and HTTP methods other than `GET`.
Label matchers that match empty label values also select all time series that do
not have the specific label set at all. Regex-matches are fully anchored.
Vector selectors must either specify a name or at least one label matcher
that does not match the empty string. The following expression is illegal:
{job=~".*"} # Bad!
In contrast, these expressions are valid as they both have a selector that does not
match empty label values.
{job=~".+"} # Good!
{job=~".*",method="get"} # Good!
Label matchers can also be applied to metric names by matching against the internal
`__name__` label. For example, the expression `http_requests_total` is equivalent to
`{__name__="http_requests_total"}`. Matchers other than `=` (`!=`, `=~`, `!~`) may also be used.
The following expression selects all metrics that have a name starting with `job:`:
### Range Vector Selectors
Range vector literals work like instant vector literals, except that they
select a range of samples back from the current instant. Syntactically, a range
duration is appended in square brackets (`[]`) at the end of a vector selector
to specify how far back in time values should be fetched for each resulting
range vector element.
Time durations are specified as a number, followed immediately by one of the
following units:
* `s` - seconds
* `m` - minutes
* `h` - hours
* `d` - days
* `w` - weeks
* `y` - years
In this example, we select all the values we have recorded within the last 5
minutes for all time series that have the metric name `http_requests_total` and
a `job` label set to `prometheus`:
### Offset modifier
The `offset` modifier allows changing the time offset for individual
instant and range vectors in a query.
For example, the following expression returns the value of
`http_requests_total` 5 minutes in the past relative to the current
query evaluation time:
http_requests_total offset 5m
Note that the `offset` modifier always needs to follow the selector
immediately, i.e. the following would be correct:
sum(http_requests_total{method="GET"} offset 5m) // GOOD.
While the following would be *incorrect*:
sum(http_requests_total{method="GET"}) offset 5m // INVALID.
The same works for range vectors. This returns the 5-minutes rate that
`http_requests_total` had a week ago:
rate(http_requests_total[5m] offset 1w)
## Operators
Prometheus supports many binary and aggregation operators. These are described
in detail in the [expression language operators](operators.md) page.
## Functions
Prometheus supports several functions to operate on data. These are described
in detail in the [expression language functions](functions.md) page.
## Gotchas
### Interpolation and staleness
When queries are run, timestamps at which to sample data are selected
independently of the actual present time series data. This is mainly to support
cases like aggregation (`sum`, `avg`, and so on), where multiple aggregated
time series do not exactly align in time. Because of their independence,
Prometheus needs to assign a value at those timestamps for each relevant time
series. It does so by simply taking the newest sample before this timestamp.
If no stored sample is found (by default) 5 minutes before a sampling timestamp,
no value is assigned for this time series at this point in time. This
effectively means that time series "disappear" from graphs at times where their
latest collected sample is older than 5 minutes.
NOTE: <b>NOTE:</b> Staleness and interpolation handling might change. See
https://github.com/prometheus/prometheus/issues/398 and
### Avoiding slow queries and overloads
If a query needs to operate on a very large amount of data, graphing it might
time out or overload the server or browser. Thus, when constructing queries
over unknown data, always start building the query in the tabular view of
Prometheus's expression browser until the result set seems reasonable
(hundreds, not thousands, of time series at most). Only when you have filtered
or aggregated your data sufficiently, switch to graph mode. If the expression
still takes too long to graph ad-hoc, pre-record it via a [recording
This is especially relevant for Prometheus's query language, where a bare
metric name selector like `api_http_requests_total` could expand to thousands
of time series with different labels. Also keep in mind that expressions which
aggregate over many time series will generate load on the server even if the
output is only a small number of time series. This is similar to how it would
be slow to sum all values of a column in a relational database, even if the
output value is only a single number.
Normal file
Normal file
@ -0,0 +1,83 @@
title: Querying examples
nav_title: Examples
sort_rank: 4
# Query examples
## Simple time series selection
Return all time series with the metric `http_requests_total`:
Return all time series with the metric `http_requests_total` and the given
`job` and `handler` labels:
http_requests_total{job="apiserver", handler="/api/comments"}
Return a whole range of time (in this case 5 minutes) for the same vector,
making it a range vector:
http_requests_total{job="apiserver", handler="/api/comments"}[5m]
Note that an expression resulting in a range vector cannot be graphed directly,
but viewed in the tabular ("Console") view of the expression browser.
Using regular expressions, you could select time series only for jobs whose
name match a certain pattern, in this case, all jobs that end with `server`.
Note that this does a substring match, not a full string match:
To select all HTTP status codes except 4xx ones, you could run:
## Using functions, operators, etc.
Return the per-second rate for all time series with the `http_requests_total`
metric name, as measured over the last 5 minutes:
Assuming that the `http_requests_total` time series all have the labels `job`
(fanout by job name) and `instance` (fanout by instance of the job), we might
want to sum over the rate of all instances, so we get fewer output time series,
but still preserve the `job` dimension:
sum(rate(http_requests_total[5m])) by (job)
If we have two different metrics with the same dimensional labels, we can apply
binary operators to them and elements on both sides with the same label set
will get matched and propagated to the output. For example, this expression
returns the unused memory in MiB for every instance (on a fictional cluster
scheduler exposing these metrics about the instances it runs):
(instance_memory_limit_bytes - instance_memory_usage_bytes) / 1024 / 1024
The same expression, but summed by application, could be written like this:
instance_memory_limit_bytes - instance_memory_usage_bytes
) by (app, proc) / 1024 / 1024
If the same fictional cluster scheduler exposed CPU usage metrics like the
following for every instance:
instance_cpu_time_ns{app="lion", proc="web", rev="34d0f99", env="prod", job="cluster-manager"}
instance_cpu_time_ns{app="elephant", proc="worker", rev="34d0f99", env="prod", job="cluster-manager"}
instance_cpu_time_ns{app="turtle", proc="api", rev="4d3a513", env="prod", job="cluster-manager"}
instance_cpu_time_ns{app="fox", proc="widget", rev="4d3a513", env="prod", job="cluster-manager"}
...we could get the top 3 CPU users grouped by application (`app`) and process
type (`proc`) like this:
topk(3, sum(rate(instance_cpu_time_ns[5m])) by (app, proc))
Assuming this metric contains one time series per running instance, you could
count the number of running instances per application like this:
count(instance_cpu_time_ns) by (app)
Normal file
Normal file
@ -0,0 +1,408 @@
title: Query functions
nav_title: Functions
sort_rank: 3
# Functions
Some functions have default arguments, e.g. `year(v=vector(time())
instant-vector)`. This means that there is one argument `v` which is an instant
vector, which if not provided it will default to the value of the expression
## `abs()`
`abs(v instant-vector)` returns the input vector with all sample values converted to
their absolute value.
## `absent()`
`absent(v instant-vector)` returns an empty vector if the vector passed to it
has any elements and a 1-element vector with the value 1 if the vector passed to
it has no elements.
This is useful for alerting on when no time series exist for a given metric name
and label combination.
# => {job="myjob"}
# => {job="myjob"}
# => {}
In the second example, `absent()` tries to be smart about deriving labels of the
1-element output vector from the input vector.
## `ceil()`
`ceil(v instant-vector)` rounds the sample values of all elements in `v` up to
the nearest integer.
## `changes()`
For each input time series, `changes(v range-vector)` returns the number of
times its value has changed within the provided time range as an instant
## `clamp_max()`
`clamp_max(v instant-vector, max scalar)` clamps the sample values of all
elements in `v` to have an upper limit of `max`.
## `clamp_min()`
`clamp_min(v instant-vector, min scalar)` clamps the sample values of all
elements in `v` to have a lower limit of `min`.
## `count_scalar()`
`count_scalar(v instant-vector)` returns the number of elements in a time series
vector as a scalar. This is in contrast to the `count()`
[aggregation operator](operators.md#aggregation-operators), which
always returns a vector (an empty one if the input vector is empty) and allows
grouping by labels via a `by` clause.
## `day_of_month()`
`day_of_month(v=vector(time()) instant-vector)` returns the day of the month
for each of the given times in UTC. Returned values are from 1 to 31.
## `day_of_week()`
`day_of_week(v=vector(time()) instant-vector)` returns the day of the week for
each of the given times in UTC. Returned values are from 0 to 6, where 0 means
Sunday etc.
## `days_in_month()`
`days_in_month(v=vector(time()) instant-vector)` returns number of days in the
month for each of the given times in UTC. Returned values are from 28 to 31.
## `delta()`
`delta(v range-vector)` calculates the difference between the
first and last value of each time series element in a range vector `v`,
returning an instant vector with the given deltas and equivalent labels.
The delta is extrapolated to cover the full time range as specified in
the range vector selector, so that it is possible to get a non-integer
result even if the sample values are all integers.
The following example expression returns the difference in CPU temperature
between now and 2 hours ago:
`delta` should only be used with gauges.
## `deriv()`
`deriv(v range-vector)` calculates the per-second derivative of the time series in a range
vector `v`, using [simple linear regression](http://en.wikipedia.org/wiki/Simple_linear_regression).
`deriv` should only be used with gauges.
## `drop_common_labels()`
`drop_common_labels(instant-vector)` drops all labels that have the same name
and value across all series in the input vector.
## `exp()`
`exp(v instant-vector)` calculates the exponential function for all elements in `v`.
Special cases are:
* `Exp(+Inf) = +Inf`
* `Exp(NaN) = NaN`
## `floor()`
`floor(v instant-vector)` rounds the sample values of all elements in `v` down
to the nearest integer.
## `histogram_quantile()`
`histogram_quantile(φ float, b instant-vector)` calculates the φ-quantile (0 ≤ φ
≤ 1) from the buckets `b` of a
[histogram](https://prometheus.io/docs/concepts/metric_types/#histogram). (See
[histograms and summaries](https://prometheus.io/docs/practices/histograms) for
a detailed explanation of φ-quantiles and the usage of the histogram metric type
in general.) The samples in `b` are the counts of observations in each bucket.
Each sample must have a label `le` where the label value denotes the inclusive
upper bound of the bucket. (Samples without such a label are silently ignored.)
The [histogram metric type](https://prometheus.io/docs/concepts/metric_types/#histogram)
automatically provides time series with the `_bucket` suffix and the appropriate
Use the `rate()` function to specify the time window for the quantile
Example: A histogram metric is called `http_request_duration_seconds`. To
calculate the 90th percentile of request durations over the last 10m, use the
following expression:
histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m]))
The quantile is calculated for each label combination in
`http_request_duration_seconds`. To aggregate, use the `sum()` aggregator
around the `rate()` function. Since the `le` label is required by
`histogram_quantile()`, it has to be included in the `by` clause. The following
expression aggregates the 90th percentile by `job`:
histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (job, le))
To aggregate everything, specify only the `le` label:
histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[10m])) by (le))
The `histogram_quantile()` function interpolates quantile values by
assuming a linear distribution within a bucket. The highest bucket
must have an upper bound of `+Inf`. (Otherwise, `NaN` is returned.) If
a quantile is located in the highest bucket, the upper bound of the
second highest bucket is returned. A lower limit of the lowest bucket
is assumed to be 0 if the upper bound of that bucket is greater than
0. In that case, the usual linear interpolation is applied within that
bucket. Otherwise, the upper bound of the lowest bucket is returned
for quantiles located in the lowest bucket.
If `b` contains fewer than two buckets, `NaN` is returned. For φ < 0, `-Inf` is
returned. For φ > 1, `+Inf` is returned.
## `holt_winters()`
`holt_winters(v range-vector, sf scalar, tf scalar)` produces a smoothed value
for time series based on the range in `v`. The lower the smoothing factor `sf`,
the more importance is given to old data. The higher the trend factor `tf`, the
more trends in the data is considered. Both `sf` and `tf` must be between 0 and
`holt_winters` should only be used with gauges.
## `hour()`
`hour(v=vector(time()) instant-vector)` returns the hour of the day
for each of the given times in UTC. Returned values are from 0 to 23.
## `idelta()`
`idelta(v range-vector)`
`idelta(v range-vector)` calculates the difference between the last two samples
in the range vector `v`, returning an instant vector with the given deltas and
equivalent labels.
`idelta` should only be used with gauges.
## `increase()`
`increase(v range-vector)` calculates the increase in the
time series in the range vector. Breaks in monotonicity (such as counter
resets due to target restarts) are automatically adjusted for. The
increase is extrapolated to cover the full time range as specified
in the range vector selector, so that it is possible to get a
non-integer result even if a counter increases only by integer
The following example expression returns the number of HTTP requests as measured
over the last 5 minutes, per time series in the range vector:
`increase` should only be used with counters. It is syntactic sugar
for `rate(v)` multiplied by the number of seconds under the specified
time range window, and should be used primarily for human readability.
Use `rate` in recording rules so that increases are tracked consistently
on a per-second basis.
## `irate()`
`irate(v range-vector)` calculates the per-second instant rate of increase of
the time series in the range vector. This is based on the last two data points.
Breaks in monotonicity (such as counter resets due to target restarts) are
automatically adjusted for.
The following example expression returns the per-second rate of HTTP requests
looking up to 5 minutes back for the two most recent data points, per time
series in the range vector:
`irate` should only be used when graphing volatile, fast-moving counters.
Use `rate` for alerts and slow-moving counters, as brief changes
in the rate can reset the `FOR` clause and graphs consisting entirely of rare
spikes are hard to read.
Note that when combining `irate()` with an
[aggregation operator](operators.md#aggregation-operators) (e.g. `sum()`)
or a function aggregating over time (any function ending in `_over_time`),
always take a `irate()` first, then aggregate. Otherwise `irate()` cannot detect
counter resets when your target restarts.
## `label_join()`
For each timeseries in `v`, `label_join(v instant-vector, dst_label string, separator string, src_label_1 string, src_label_2 string, ...)` joins all the values of all the `src_labels`
using `separator` and returns the timeseries with the label `dst_label` containing the joined value.
There can be any number of `src_labels` in this function.
This example will return a vector with each time series having a `foo` label with the value `a,b,c` added to it:
label_join(up{job="api-server",src1="a",src2="b",src3="c"}, "foo", ",", "src1", "src2", "src3")
## `label_replace()`
For each timeseries in `v`, `label_replace(v instant-vector, dst_label string,
replacement string, src_label string, regex string)` matches the regular
expression `regex` against the label `src_label`. If it matches, then the
timeseries is returned with the label `dst_label` replaced by the expansion of
`replacement`. `$1` is replaced with the first matching subgroup, `$2` with the
second etc. If the regular expression doesn't match then the timeseries is
returned unchanged.
This example will return a vector with each time series having a `foo`
label with the value `a` added to it:
label_replace(up{job="api-server",service="a:c"}, "foo", "$1", "service", "(.*):.*")
## `ln()`
`ln(v instant-vector)` calculates the natural logarithm for all elements in `v`.
Special cases are:
* `ln(+Inf) = +Inf`
* `ln(0) = -Inf`
* `ln(x < 0) = NaN`
* `ln(NaN) = NaN`
## `log2()`
`log2(v instant-vector)` calculates the binary logarithm for all elements in `v`.
The special cases are equivalent to those in `ln`.
## `log10()`
`log10(v instant-vector)` calculates the decimal logarithm for all elements in `v`.
The special cases are equivalent to those in `ln`.
## `minute()`
`minute(v=vector(time()) instant-vector)` returns the minute of the hour for each
of the given times in UTC. Returned values are from 0 to 59.
## `month()`
`month(v=vector(time()) instant-vector)` returns the month of the year for each
of the given times in UTC. Returned values are from 1 to 12, where 1 means
January etc.
## `predict_linear()`
`predict_linear(v range-vector, t scalar)` predicts the value of time series
`t` seconds from now, based on the range vector `v`, using [simple linear
`predict_linear` should only be used with gauges.
## `rate()`
`rate(v range-vector)` calculates the per-second average rate of increase of the
time series in the range vector. Breaks in monotonicity (such as counter
resets due to target restarts) are automatically adjusted for. Also, the
calculation extrapolates to the ends of the time range, allowing for missed
scrapes or imperfect alignment of scrape cycles with the range's time period.
The following example expression returns the per-second rate of HTTP requests as measured
over the last 5 minutes, per time series in the range vector:
`rate` should only be used with counters. It is best suited for alerting,
and for graphing of slow-moving counters.
Note that when combining `rate()` with an aggregation operator (e.g. `sum()`)
or a function aggregating over time (any function ending in `_over_time`),
always take a `rate()` first, then aggregate. Otherwise `rate()` cannot detect
counter resets when your target restarts.
## `resets()`
For each input time series, `resets(v range-vector)` returns the number of
counter resets within the provided time range as an instant vector. Any
decrease in the value between two consecutive samples is interpreted as a
counter reset.
`resets` should only be used with counters.
## `round()`
`round(v instant-vector, to_nearest=1 scalar)` rounds the sample values of all
elements in `v` to the nearest integer. Ties are resolved by rounding up. The
optional `to_nearest` argument allows specifying the nearest multiple to which
the sample values should be rounded. This multiple may also be a fraction.
## `scalar()`
Given a single-element input vector, `scalar(v instant-vector)` returns the
sample value of that single element as a scalar. If the input vector does not
have exactly one element, `scalar` will return `NaN`.
## `sort()`
`sort(v instant-vector)` returns vector elements sorted by their sample values,
in ascending order.
## `sort_desc()`
Same as `sort`, but sorts in descending order.
## `sqrt()`
`sqrt(v instant-vector)` calculates the square root of all elements in `v`.
## `time()`
`time()` returns the number of seconds since January 1, 1970 UTC. Note that
this does not actually return the current time, but the time at which the
expression is to be evaluated.
## `vector()`
`vector(s scalar)` returns the scalar `s` as a vector with no labels.
## `year()`
`year(v=vector(time()) instant-vector)` returns the year
for each of the given times in UTC.
## `<aggregation>_over_time()`
The following functions allow aggregating each series of a given range vector
over time and return an instant vector with per-series aggregation results:
* `avg_over_time(range-vector)`: the average value of all points in the specified interval.
* `min_over_time(range-vector)`: the minimum value of all points in the specified interval.
* `max_over_time(range-vector)`: the maximum value of all points in the specified interval.
* `sum_over_time(range-vector)`: the sum of all values in the specified interval.
* `count_over_time(range-vector)`: the count of all values in the specified interval.
* `quantile_over_time(scalar, range-vector)`: the φ-quantile (0 ≤ φ ≤ 1) of the values in the specified interval.
* `stddev_over_time(range-vector)`: the population standard deviation of the values in the specified interval.
* `stdvar_over_time(range-vector)`: the population standard variance of the values in the specified interval.
Note that all values in the specified interval have the same weight in the
aggregation even if the values are not equally spaced throughout the interval.
Normal file
Normal file
@ -0,0 +1,4 @@
title: Querying
sort_rank: 4
Normal file
Normal file
@ -0,0 +1,250 @@
title: Operators
sort_rank: 2
# Operators
## Binary operators
Prometheus's query language supports basic logical and arithmetic operators.
For operations between two instant vectors, the [matching behavior](#vector-matching)
can be modified.
### Arithmetic binary operators
The following binary arithmetic operators exist in Prometheus:
* `+` (addition)
* `-` (subtraction)
* `*` (multiplication)
* `/` (division)
* `%` (modulo)
* `^` (power/exponentiation)
Binary arithmetic operators are defined between scalar/scalar, vector/scalar,
and vector/vector value pairs.
**Between two scalars**, the behavior is obvious: they evaluate to another
scalar that is the result of the operator applied to both scalar operands.
**Between an instant vector and a scalar**, the operator is applied to the
value of every data sample in the vector. E.g. if a time series instant vector
is multiplied by 2, the result is another vector in which every sample value of
the original vector is multiplied by 2.
**Between two instant vectors**, a binary arithmetic operator is applied to
each entry in the left-hand-side vector and its [matching element](#vector-matching)
in the right hand vector. The result is propagated into the result vector and the metric
name is dropped. Entries for which no matching entry in the right-hand vector can be
found are not part of the result.
### Comparison binary operators
The following binary comparison operators exist in Prometheus:
* `==` (equal)
* `!=` (not-equal)
* `>` (greater-than)
* `<` (less-than)
* `>=` (greater-or-equal)
* `<=` (less-or-equal)
Comparison operators are defined between scalar/scalar, vector/scalar,
and vector/vector value pairs. By default they filter. Their behaviour can be
modified by providing `bool` after the operator, which will return `0` or `1`
for the value rather than filtering.
**Between two scalars**, the `bool` modifier must be provided and these
operators result in another scalar that is either `0` (`false`) or `1`
(`true`), depending on the comparison result.
**Between an instant vector and a scalar**, these operators are applied to the
value of every data sample in the vector, and vector elements between which the
comparison result is `false` get dropped from the result vector. If the `bool`
modifier is provided, vector elements that would be dropped instead have the value
`0` and vector elements that would be kept have the value `1`.
**Between two instant vectors**, these operators behave as a filter by default,
applied to matching entries. Vector elements for which the expression is not
true or which do not find a match on the other side of the expression get
dropped from the result, while the others are propagated into a result vector
with their original (left-hand-side) metric names and label values.
If the `bool` modifier is provided, vector elements that would have been
dropped instead have the value `0` and vector elements that would be kept have
the value `1` with the left-hand-side metric names and label values.
### Logical/set binary operators
These logical/set binary operators are only defined between instant vectors:
* `and` (intersection)
* `or` (union)
* `unless` (complement)
`vector1 and vector2` results in a vector consisting of the elements of
`vector1` for which there are elements in `vector2` with exactly matching
label sets. Other elements are dropped. The metric name and values are carried
over from the left-hand-side vector.
`vector1 or vector2` results in a vector that contains all original elements
(label sets + values) of `vector1` and additionally all elements of `vector2`
which do not have matching label sets in `vector1`.
`vector1 unless vector2` results in a vector consisting of the elements of
`vector1` for which there are no elements in `vector2` with exactly matching
label sets. All matching elements in both vectors are dropped.
## Vector matching
Operations between vectors attempt to find a matching element in the right-hand-side
vector for each entry in the left-hand side. There are two basic types of
matching behavior:
**One-to-one** finds a unique pair of entries from each side of the operation.
In the default case, that is an operation following the format `vector1 <operator> vector2`.
Two entries match if they have the exact same set of labels and corresponding values.
The `ignoring` keyword allows ignoring certain labels when matching, while the
`on` keyword allows reducing the set of considered labels to a provided list:
<vector expr> <bin-op> ignoring(<label list>) <vector expr>
<vector expr> <bin-op> on(<label list>) <vector expr>
Example input:
method_code:http_errors:rate5m{method="get", code="500"} 24
method_code:http_errors:rate5m{method="get", code="404"} 30
method_code:http_errors:rate5m{method="put", code="501"} 3
method_code:http_errors:rate5m{method="post", code="500"} 6
method_code:http_errors:rate5m{method="post", code="404"} 21
method:http_requests:rate5m{method="get"} 600
method:http_requests:rate5m{method="del"} 34
method:http_requests:rate5m{method="post"} 120
Example query:
method_code:http_errors:rate5m{code="500"} / ignoring(code) method:http_requests:rate5m
This returns a result vector containing the fraction of HTTP requests with status code
of 500 for each method, as measured over the last 5 minutes. Without `ignoring(code)` there
would have been no match as the metrics do not share the same set of labels.
The entries with methods `put` and `del` have no match and will not show up in the result:
{method="get"} 0.04 // 24 / 600
{method="post"} 0.05 // 6 / 120
**Many-to-one** and **one-to-many** matchings refer to the case where each vector element on
the "one"-side can match with multiple elements on the "many"-side. This has to
be explicitly requested using the `group_left` or `group_right` modifier, where
left/right determines which vector has the higher cardinality.
<vector expr> <bin-op> ignoring(<label list>) group_left(<label list>) <vector expr>
<vector expr> <bin-op> ignoring(<label list>) group_right(<label list>) <vector expr>
<vector expr> <bin-op> on(<label list>) group_left(<label list>) <vector expr>
<vector expr> <bin-op> on(<label list>) group_right(<label list>) <vector expr>
The label list provided with the group modifier contains additional labels from
the "one"-side to be included in the result metrics. For `on` a label can only
appear in one of the lists. Every time series of the result vector must be
uniquely identifiable.
_Grouping modifiers can only be used for
[comparison](#comparison-binary-operators) and
[arithmetic](#arithmetic-binary-operators). Operations as `and`, `unless` and
`or` operations match with all possible entries in the right vector by
Example query:
method_code:http_errors:rate5m / ignoring(code) group_left method:http_requests:rate5m
In this case the left vector contains more than one entry per `method` label
value. Thus, we indicate this using `group_left`. The elements from the right
side are now matched with multiple elements with the same `method` label on the
{method="get", code="500"} 0.04 // 24 / 600
{method="get", code="404"} 0.05 // 30 / 600
{method="post", code="500"} 0.05 // 6 / 120
{method="post", code="404"} 0.175 // 21 / 120
_Many-to-one and one-to-many matching are advanced use cases that should be carefully considered.
Often a proper use of `ignoring(<labels>)` provides the desired outcome._
## Aggregation operators
Prometheus supports the following built-in aggregation operators that can be
used to aggregate the elements of a single instant vector, resulting in a new
vector of fewer elements with aggregated values:
* `sum` (calculate sum over dimensions)
* `min` (select minimum over dimensions)
* `max` (select maximum over dimensions)
* `avg` (calculate the average over dimensions)
* `stddev` (calculate population standard deviation over dimensions)
* `stdvar` (calculate population standard variance over dimensions)
* `count` (count number of elements in the vector)
* `count_values` (count number of elements with the same value)
* `bottomk` (smallest k elements by sample value)
* `topk` (largest k elements by sample value)
* `quantile` (calculate φ-quantile (0 ≤ φ ≤ 1) over dimensions)
These operators can either be used to aggregate over **all** label dimensions
or preserve distinct dimensions by including a `without` or `by` clause.
<aggr-op>([parameter,] <vector expression>) [without|by (<label list>)] [keep_common]
`parameter` is only required for `count_values`, `quantile`, `topk` and
`bottomk`. `without` removes the listed labels from the result vector, while
all other labels are preserved the output. `by` does the opposite and drops
labels that are not listed in the `by` clause, even if their label values are
identical between all elements of the vector. The `keep_common` clause allows
keeping those extra labels (labels that are identical between elements, but not
in the `by` clause).
`count_values` outputs one time series per unique sample value. Each series has
an additional label. The name of that label is given by the aggregation
parameter, and the label value is the unique sample value. The value of each
time series is the number of times that sample value was present.
`topk` and `bottomk` are different from other aggregators in that a subset of
the input samples, including the original labels, are returned in the result
vector. `by` and `without` are only used to bucket the input vector.
If the metric `http_requests_total` had time series that fan out by
`application`, `instance`, and `group` labels, we could calculate the total
number of seen HTTP requests per application and group over all instances via:
sum(http_requests_total) without (instance)
If we are just interested in the total of HTTP requests we have seen in **all**
applications, we could simply write:
To count the number of binaries running each build version we could write:
count_values("version", build_version)
To get the 5 largest HTTP requests counts across all instances we could write:
topk(5, http_requests_total)
## Binary operator precedence
The following list shows the precedence of binary operators in Prometheus, from
highest to lowest.
1. `^`
2. `*`, `/`, `%`
3. `+`, `-`
4. `==`, `!=`, `<=`, `<`, `>=`, `>`
5. `and`, `unless`
6. `or`
Operators on the same precedence level are left-associative. For example,
`2 * 3 % 2` is equivalent to `(2 * 3) % 2`. However `^` is right associative,
so `2 ^ 3 ^ 2` is equivalent to `2 ^ (3 ^ 2)`.
Normal file
Normal file
@ -0,0 +1,357 @@
title: Storage
sort_rank: 5
# Storage
Prometheus has a sophisticated local storage subsystem. For indexes,
it uses [LevelDB](https://github.com/google/leveldb). For the bulk
sample data, it has its own custom storage layer, which organizes
sample data in chunks of constant size (1024 bytes payload). These
chunks are then stored on disk in one file per time series.
This sections deals with the various configuration settings and issues you
might run into. To dive deeper into the topic, check out the following talks:
* [The Prometheus Time Series Database](https://www.youtube.com/watch?v=HbnGSNEjhUc).
* [Configuring Prometheus for High Performance](https://www.youtube.com/watch?v=hPC60ldCGm8).
## Memory usage
Prometheus keeps all the currently used chunks in memory. In addition, it keeps
as many most recently used chunks in memory as possible. You have to tell
Prometheus how much memory it may use for this caching. The flag
`storage.local.target-heap-size` allows you to set the heap size (in bytes)
Prometheus aims not to exceed. Note that the amount of physical memory the
Prometheus server will use is the result of complex interactions of the Go
runtime and the operating system and very hard to predict precisely. As a rule
of thumb, you should have at least 50% headroom in physical memory over the
configured heap size. (Or, in other words, set `storage.local.target-heap-size`
to a value of two thirds of the physical memory limit Prometheus should not
The default value of `storage.local.target-heap-size` is 2GiB and thus tailored
to 3GiB of physical memory usage. If you have less physical memory available,
you have to lower the flag value. If you have more memory available, you should
raise the value accordingly. Otherwise, Prometheus will not make use of the
memory and thus will perform much worse than it could.
Because Prometheus uses most of its heap for long-lived allocations of memory
chunks, the
[garbage collection target percentage](https://golang.org/pkg/runtime/debug/#SetGCPercent)
is set to 40 by default. You can still override this setting via the `GOGC`
environment variable as usual. If you need to conserve CPU capacity and can
accept running with fewer memory chunks, try higher values.
For high-performance set-ups, you might need to adjust more flags. Please read
through the sections below for details.
NOTE: Prior to v1.6, there was no flag `storage.local.target-heap-size`.
Instead, the number of chunks kept in memory had to be configured using the
flags `storage.local.memory-chunks` and `storage.local.max-chunks-to-persist`.
These flags still exist for compatibility reasons. However,
`storage.local.max-chunks-to-persist` has no effect anymore, and if
`storage.local.memory-chunks` is set to a non-zero value _x_, it is used to
override the value for `storage.local.target-heap-size` to 3072*_x_.
## Disk usage
Prometheus stores its on-disk time series data under the directory specified by
the flag `storage.local.path`. The default path is `./data` (relative to the
working directory), which is good to try something out quickly but most likely
not what you want for actual operations. The flag `storage.local.retention`
allows you to configure the retention time for samples. Adjust it to your needs
and your available disk space.
## Chunk encoding
Prometheus currently offers three different types of chunk encodings. The chunk
encoding for newly created chunks is determined by the
`-storage.local.chunk-encoding-version` flag. The valid values are 0, 1,
or 2.
Type 0 is the simple delta encoding implemented for Prometheus's first chunked
storage layer. Type 1 is the current default encoding, a double-delta encoding
with much better compression behavior than type 0. Both encodings feature a
fixed byte width per sample over the whole chunk, which allows fast random
access. While type 0 is the fastest encoding, the difference in encoding cost
compared to encoding 1 is tiny. Due to the better compression behavior of type
1, there is really no reason to select type 0 except compatibility with very
old Prometheus versions.
Type 2 is a variable bit-width encoding, i.e. each sample in the chunk can use
a different number of bits. Timestamps are double-delta encoded, too, but with
a slightly different algorithm. A number of different encoding schemes are
available for sample values. The choice is made per chunk based on the nature
of the sample values (constant, integer, regularly increasing, random…). Major
parts of the type 2 encoding are inspired by a paper published by Facebook
[_Gorilla: A Fast, Scalable, In-Memory Time Series Database_](http://www.vldb.org/pvldb/vol8/p1816-teller.pdf).
With type 2, access within a chunk has to happen sequentially, and the encoding
and decoding cost is a bit higher. Overall, type 2 will cause more CPU usage
and increased query latency compared to type 1 but offers a much improved
compression ratio. The exact numbers depend heavily on the data set and the
kind of queries. Below are results from a typical production server with a
fairly expensive set of recording rules.
Chunk type | bytes per sample | cores | rule evaluation duration
1 | 3.3 | 1.6 | 2.9s
2 | 1.3 | 2.4 | 4.9s
You can change the chunk encoding each time you start the server, so
experimenting with your own use case is encouraged. Take into account, however,
that only newly created chunks will use the newly selected chunk encoding, so
it will take a while until you see the effects.
For more details about the trade-off between the chunk encodings, see
[this blog post](https://prometheus.io/blog/2016/05/08/when-to-use-varbit-chunks/).
## Settings for high numbers of time series
Prometheus can handle millions of time series. However, with the above
mentioned default setting for `storage.local.target-heap-size`, you will be
limited to about 200,000 time series simultaneously present in memory. For more
series, you need more memory, and you need to configure Prometheus to make use
of it as described above.
Each of the aforementioned chunks contains samples of a single time series. A
time series is thus represented as a series of chunks, which ultimately end up
in a time series file (one file per time series) on disk.
A series that has recently received new samples will have an open incomplete
_head chunk_. Once that chunk is completely filled, or the series hasn't
received samples in a while, the head chunk is closed and becomes a chunk
waiting to be appended to its corresponding series file, i.e. it is _waiting
for persistence_. After the chunk has been persisted to disk, it becomes
_evictable_, provided it is not currently used by a query. Prometheus will
evict evictable chunks from memory to satisfy the configured target heap
size. A series with an open head chunk is called an _active series_. This is
different from a _memory series_, which also includes series without an open
head chunk but still other chunks in memory (whether waiting for persistence,
used in a query, or evictable). A series without any chunks in memory may be
_archived_, upon which it ceases to have any mandatory memory footprint.
The amount of chunks Prometheus can keep in memory depends on the flag value
for `storage.local.target-heap-size` and on the amount of memory used by
everything else. If there are not enough chunks evictable to satisfy the target
heap size, Prometheus will throttle ingestion of more samples (by skipping
scrapes and rule evaluations) until the heap has shrunk enough. _Throttled
ingestion is really bad for various reasons. You really do not want to be in
that situation._
Open head chunks, chunks still waiting for persistence, and chunks being used
in a query are not evictable. Thus, the reasons for the inability to evict
enough chunks include the following:
1. Queries that use too many chunks.
2. Chunks are piling up waiting for persistence because the storage layer
cannot keep up writing chunks.
3. There are too many active time series, which results in too many open head
Currently, Prometheus has no defence against case (1). Abusive queries will
essentially OOM the server.
To defend against case (2), there is a concept of persistence urgency explained
in the next section.
Case (3) depends on the targets you monitor. To mitigate an unplanned explosion
of the number of series, you can limit the number of samples per individual
scrape (see `sample_limit` in the [scrape config](configuration/configuration.md#scrape_config)).
If the number of active time series exceeds the number of memory chunks the
Prometheus server can afford, the server will quickly throttle ingestion as
described above. The only way out of this is to give Prometheus more RAM or
reduce the number of time series to ingest.
In fact, you want many more memory chunks than you have series in
memory. Prometheus tries to batch up disk writes as much as possible as it
helps for both HDD (write as much as possible after each seek) and SSD (tiny
writes create write amplification, which limits the effective throughput and
burns much more quickly through the lifetime of the device). The more
Prometheus can batch up writes, the more efficient is the process of persisting
chunks to disk. which helps case (2).
In conclusion, to keep the Prometheus server healthy, make sure it has plenty
of headroom of memory chunks available for the number of memory series. A
factor of three is a good starting point. Refer to the
[section about helpful metrics](#helpful-metrics) to find out what to look
for. A very broad rule of thumb for an upper limit of memory series is the
total available physical memory divided by 10,000, e.g. About 6M memory series
on a 64GiB server.
If you combine a high number of time series with very fast and/or large
scrapes, the number of pre-allocated mutexes for series locking might not be
sufficient. If you see scrape hiccups while Prometheus is writing a checkpoint
or processing expensive queries, try increasing the value of the
`storage.local.num-fingerprint-mutexes` flag. Sometimes tens of thousands or
even more are required.
PromQL queries that involve a high number of time series will make heavy use of
the LevelDB-backed indexes. If you need to run queries of that kind, tweaking
the index cache sizes might be required. The following flags are relevant:
* `-storage.local.index-cache-size.label-name-to-label-values`: For regular
expression matching.
* `-storage.local.index-cache-size.label-pair-to-fingerprints`: Increase the
size if a large number of time series share the same label pair or name.
* `-storage.local.index-cache-size.fingerprint-to-metric` and
`-storage.local.index-cache-size.fingerprint-to-timerange`: Increase the size
if you have a large number of archived time series, i.e. series that have not
received samples in a while but are still not old enough to be purged
You have to experiment with the flag values to find out what helps. If a query
touches 100,000+ time series, hundreds of MiB might be reasonable. If you have
plenty of memory available, using more of it for LevelDB cannot harm. More
memory for LevelDB will effectively reduce the number of memory chunks
Prometheus can afford.
## Persistence urgency and “rushed mode”
Naively, Prometheus would all the time try to persist completed chunk to disk
as soon as possible. Such a strategy would lead to many tiny write operations,
using up most of the I/O bandwidth and keeping the server quite busy. Spinning
disks will appear to be very slow because of the many slow seeks required, and
SSDs will suffer from write amplification. Prometheus tries instead to batch up
write operations as much as possible, which works better if it is allowed to
use more memory.
Prometheus will also sync series files after each write (with
`storage.local.series-sync-strategy=adaptive`, which is the default) and use
the disk bandwidth for more frequent checkpoints (based on the count of “dirty
series”, see [below](#crash-recovery)), both attempting to minimize data loss
in case of a crash.
But what to do if the number of chunks waiting for persistence grows too much?
Prometheus calculates a score for urgency to persist chunks. The score is
between 0 and 1, where 1 corresponds to the highest urgency. Depending on the
score, Prometheus will write to disk more frequently. Should the score ever
pass the threshold of 0.8, Prometheus enters “rushed mode” (which you can see
in the logs). In rushed mode, the following strategies are applied to speed up
persisting chunks:
* Series files are not synced after write operations anymore (making better use
of the OS's page cache at the price of an increased risk of losing data in
case of a server crash – this behavior can be overridden with the flag
* Checkpoints are only created as often as configured via the
`storage.local.checkpoint-interval` flag (freeing more disk bandwidth for
persisting chunks at the price of more data loss in case of a crash and an
increased time to run the subsequent crash recovery).
* Write operations to persist chunks are not throttled anymore and performed as
fast as possible.
Prometheus leaves rushed mode once the score has dropped below 0.7.
Throttling of ingestion happens if the urgency score reaches 1. Thus, the
rushed mode is not _per se_ something to be avoided. It is, on the contrary, a
measure the Prometheus server takes to avoid the really bad situation of
throttled ingestion. Occasionally entering rushed mode is OK, if it helps and
ultimately leads to leaving rushed mode again. _If rushed mode is entered but
the urgency score still goes up, the server has a real problem._
## Settings for very long retention time
If you have set a very long retention time via the `storage.local.retention`
flag (more than a month), you might want to increase the flag value
Whenever Prometheus needs to cut off some chunks from the beginning of a series
file, it will simply rewrite the whole file. (Some file systems support “head
truncation”, which Prometheus currently does not use for several reasons.) To
not rewrite a very large series file to get rid of very few chunks, the rewrite
only happens if at least 10% of the chunks in the series file are removed. This
value can be changed via the mentioned `storage.local.series-file-shrink-ratio`
flag. If you have a lot of disk space but want to minimize rewrites (at the
cost of wasted disk space), increase the flag value to higher values, e.g. 0.3
for 30% of required chunk removal.
## Crash recovery
Prometheus saves chunks to disk as soon as possible after they are
complete. Incomplete chunks are saved to disk during regular
checkpoints. You can configure the checkpoint interval with the flag
`storage.local.checkpoint-interval`. Prometheus creates checkpoints
more frequently than that if too many time series are in a “dirty”
state, i.e. their current incomplete head chunk is not the one that is
contained in the most recent checkpoint. This limit is configurable
via the `storage.local.checkpoint-dirty-series-limit` flag.
More active time series to cycle through lead in general to more chunks waiting
for persistence, which in turns leads to larger checkpoints and ultimately more
time needed for checkpointing. There is a clear trade-off between limiting the
loss of data in case of a crash and the ability to scale to high number of
active time series. To not spend the majority of the disk throughput for
checkpointing, you have to increase the checkpoint interval. Prometheus itself
limits the time spent in checkpointing to 50% by waiting after each
checkpoint's completion for at least as long as the previous checkpoint took.
Nevertheless, should your server crash, you might still lose data, and
your storage might be left in an inconsistent state. Therefore,
Prometheus performs a crash recovery after an unclean shutdown,
similar to an `fsck` run for a file system. Details about the crash
recovery are logged, so you can use it for forensics if required. Data
that cannot be recovered is moved to a directory called `orphaned`
(located under `storage.local.path`). Remember to delete that data if
you do not need it anymore.
The crash recovery usually takes less than a minute. Should it take much
longer, consult the log to find out what is going on. With increasing number of
time series in the storage (archived or not), the re-indexing tends to dominate
the recovery time and can take tens of minutes in extreme cases.
## Data corruption
If you suspect problems caused by corruption in the database, you can
enforce a crash recovery by starting the server with the flag
If that does not help, or if you simply want to erase the existing
database, you can easily start fresh by deleting the contents of the
storage directory:
1. Stop Prometheus.
1. `rm -r <storage path>/*`
1. Start Prometheus.
## Helpful metrics
Out of the metrics that Prometheus exposes about itself, the following are
particularly useful to tweak flags and find out about the required
resources. They also help to create alerts to find out in time if a Prometheus
server has problems or is out of capacity.
* `prometheus_local_storage_memory_series`: The current number of series held
in memory.
* `prometheus_local_storage_open_head_chunks`: The number of open head chunks.
* `prometheus_local_storage_chunks_to_persist`: The number of memory chunks
that still need to be persisted to disk.
* `prometheus_local_storage_memory_chunks`: The current number of chunks held
in memory. If you substract the previous two, you get the number of persisted
chunks (which are evictable if not currently in use by a query).
* `prometheus_local_storage_series_chunks_persisted`: A histogram of the number
of chunks persisted per batch.
* `prometheus_local_storage_persistence_urgency_score`: The urgency score as
discussed [above](#persistence-urgency-and-rushed-mode).
* `prometheus_local_storage_rushed_mode` is 1 if Prometheus is in “rushed
mode”, 0 otherwise. Can be used to calculate the percentage of time
Prometheus is in rushed mode.
* `prometheus_local_storage_checkpoint_last_duration_seconds`: How long the
last checkpoint took.
* `prometheus_local_storage_checkpoint_last_size_bytes`: Size of the last
checkpoint in bytes.
* `prometheus_local_storage_checkpointing` is 1 while Prometheus is
checkpointing, 0 otherwise. Can be used to calculate the percentage of time
Prometheus is checkpointing.
* `prometheus_local_storage_inconsistencies_total`: Counter for storage
inconsistencies found. If this is greater than 0, restart the server for
* `prometheus_local_storage_persist_errors_total`: Counter for persist errors.
* `prometheus_local_storage_memory_dirty_series`: Current number of dirty series.
* `process_resident_memory_bytes`: Broadly speaking the physical memory
occupied by the Prometheus process.
* `go_memstats_alloc_bytes`: Go heap size (allocated objects in use plus allocated
objects not in use anymore but not yet garbage-collected).
Reference in a new issue