Various doc tweaks (#8111)

Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com>
This commit is contained in:
Anthony D'Atri 2020-10-27 02:50:37 -07:00 committed by GitHub
parent 3f8e51738c
commit 2cbc0f9bfe
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
5 changed files with 71 additions and 52 deletions

View file

@ -13,19 +13,18 @@ examples and guides.
Prometheus, a [Cloud Native Computing Foundation](https://cncf.io/) project, is a systems and service monitoring system. It collects metrics
from configured targets at given intervals, evaluates rule expressions,
displays the results, and can trigger alerts if some condition is observed
to be true.
displays the results, and can trigger alerts when specified conditions are observed.
Prometheus's main distinguishing features as compared to other monitoring systems are:
The features that distinguish Prometheus from other metrics and monitoring systems are:
- a **multi-dimensional** data model (time series defined by metric name and set of key/value dimensions)
- A **multi-dimensional** data model (time series defined by metric name and set of key/value dimensions)
- PromQL, a **powerful and flexible query language** to leverage this dimensionality
- no dependency on distributed storage; **single server nodes are autonomous**
- time series collection happens via a **pull model** over HTTP
- **pushing time series** is supported via an intermediary gateway
- targets are discovered via **service discovery** or **static configuration**
- multiple modes of **graphing and dashboarding support**
- support for hierarchical and horizontal **federation**
- No dependency on distributed storage; **single server nodes are autonomous**
- An HTTP **pull model** for time series collection
- **Pushing time series** is supported via an intermediary gateway for batch jobs
- Targets are discovered via **service discovery** or **static configuration**
- Multiple modes of **graphing and dashboarding support**
- Support for hierarchical and horizontal **federation**
## Architecture overview
@ -56,9 +55,9 @@ Prometheus will now be reachable at http://localhost:9090/.
### Building from source
To build Prometheus from the source code yourself you need to have a working
To build Prometheus from source code, first ensure that have a working
Go environment with [version 1.13 or greater installed](https://golang.org/doc/install).
You will also need to have [Node.js](https://nodejs.org/) and [Yarn](https://yarnpkg.com/)
You also need [Node.js](https://nodejs.org/) and [Yarn](https://yarnpkg.com/)
installed in order to build the frontend assets.
You can directly use the `go` tool to download and install the `prometheus`

View file

@ -6,9 +6,9 @@ sort_rank: 1
# Getting started
This guide is a "Hello World"-style tutorial which shows how to install,
configure, and use Prometheus in a simple example setup. You will download and run
configure, and use a simple Prometheus instance. You will download and run
Prometheus locally, configure it to scrape itself and an example application,
and then work with queries, rules, and graphs to make use of the collected time
then work with queries, rules, and graphs to use collected time
series data.
## Downloading and running Prometheus
@ -25,12 +25,12 @@ Before starting Prometheus, let's configure it.
## Configuring Prometheus to monitor itself
Prometheus collects metrics from monitored targets by scraping metrics HTTP
endpoints on these targets. Since Prometheus also exposes data in the same
Prometheus collects metrics from _targets_ by scraping metrics HTTP
endpoints. Since Prometheus exposes data in the same
manner about itself, it can also scrape and monitor its own health.
While a Prometheus server that collects only data about itself is not very
useful in practice, it is a good starting example. Save the following basic
useful, it is a good starting example. Save the following basic
Prometheus configuration as a file named `prometheus.yml`:
```yaml
@ -79,26 +79,26 @@ navigating to its metrics endpoint:
## Using the expression browser
Let us try looking at some data that Prometheus has collected about itself. To
Let us explore data that Prometheus has collected about itself. To
use Prometheus's built-in expression browser, navigate to
http://localhost:9090/graph and choose the "Console" view within the "Graph" tab.
As you can gather from [localhost:9090/metrics](http://localhost:9090/metrics),
one metric that Prometheus exports about itself is called
one metric that Prometheus exports about itself is named
`prometheus_target_interval_length_seconds` (the actual amount of time between
target scrapes). Go ahead and enter this into the expression console and then click "Execute":
target scrapes). Enter the below into the expression console and then click "Execute":
```
prometheus_target_interval_length_seconds
```
This should return a number of different time series (along with the latest value
recorded for each), all with the metric name
recorded for each), each with the metric name
`prometheus_target_interval_length_seconds`, but with different labels. These
labels designate different latency percentiles and target group intervals.
If we were only interested in the 99th percentile latencies, we could use this
query to retrieve that information:
If we are interested only in 99th percentile latencies, we could use this
query:
```
prometheus_target_interval_length_seconds{quantile="0.99"}
@ -129,8 +129,7 @@ Experiment with the graph range parameters and other settings.
## Starting up some sample targets
Let us make this more interesting and start some example targets for Prometheus
to scrape.
Let's add additional targets for Prometheus to scrape.
The Node Exporter is used as an example target, for more information on using it
[see these instructions.](https://prometheus.io/docs/guides/node-exporter/)
@ -151,7 +150,7 @@ http://localhost:8081/metrics, and http://localhost:8082/metrics.
## Configure Prometheus to monitor the sample targets
Now we will configure Prometheus to scrape these new targets. Let's group all
three endpoints into one job called `node`. However, imagine that the
three endpoints into one job called `node`. We will imagine that the
first two endpoints are production targets, while the third one represents a
canary instance. To model this in Prometheus, we can add several groups of
endpoints to a single job, adding extra labels to each group of targets. In
@ -185,8 +184,8 @@ about time series that these example endpoints expose, such as `node_cpu_seconds
Though not a problem in our example, queries that aggregate over thousands of
time series can get slow when computed ad-hoc. To make this more efficient,
Prometheus allows you to prerecord expressions into completely new persisted
time series via configured recording rules. Let's say we are interested in
Prometheus can prerecord expressions into new persisted
time series via configured _recording rules_. Let's say we are interested in
recording the per-second rate of cpu time (`node_cpu_seconds_total`) averaged
over all cpus per instance (but preserving the `job`, `instance` and `mode`
dimensions) as measured over a window of 5 minutes. We could write this as:

View file

@ -5,7 +5,7 @@ sort_rank: 7
# Management API
Prometheus provides a set of management API to ease automation and integrations.
Prometheus provides a set of management APIs to facilitate automation and integration.
### Health check

View file

@ -11,10 +11,10 @@ This document offers guidance on migrating from Prometheus 1.8 to Prometheus 2.0
## Flags
The format of the Prometheus command line flags has changed. Instead of a
The format of Prometheus command line flags has changed. Instead of a
single dash, all flags now use a double dash. Common flags (`--config.file`,
`--web.listen-address` and `--web.external-url`) are still the same but beyond
that, almost all the storage-related flags have been removed.
`--web.listen-address` and `--web.external-url`) remain but
almost all storage-related flags have been removed.
Some notable flags which have been removed:
@ -27,11 +27,11 @@ Some notable flags which have been removed:
- `-query.staleness-delta` has been renamed to `--query.lookback-delta`; Prometheus
2.0 introduces a new mechanism for handling staleness, see [staleness](querying/basics.md#staleness).
- `-storage.local.*` Prometheus 2.0 introduces a new storage engine, as such all
- `-storage.local.*` Prometheus 2.0 introduces a new storage engine; as such all
flags relating to the old engine have been removed. For information on the
new engine, see [Storage](#storage).
- `-storage.remote.*` Prometheus 2.0 has removed the already deprecated remote
- `-storage.remote.*` Prometheus 2.0 has removed the deprecated remote
storage flags, and will fail to start if they are supplied. To write to
InfluxDB, Graphite, or OpenTSDB use the relevant storage adapter.

View file

@ -9,15 +9,22 @@ Prometheus includes a local on-disk time series database, but also optionally in
## Local storage
Prometheus's local time series database stores time series data in a custom format on disk.
Prometheus's local time series database stores data in a custom, highly efficient format on local storage.
### On-disk layout
Ingested samples are grouped into blocks of two hours. Each two-hour block consists of a directory containing one or more chunk files that contain all time series samples for that window of time, as well as a metadata file and index file (which indexes metric names and labels to time series in the chunk files). When series are deleted via the API, deletion records are stored in separate tombstone files (instead of deleting the data immediately from the chunk files).
The block for currently incoming samples is kept in memory and not fully persisted yet. It is secured against crashes by a write-ahead log (WAL) that can be replayed when the Prometheus server restarts after a crash. Write-ahead log files are stored in the `wal` directory in 128MB segments. These files contain raw data that has not been compacted yet, so they are significantly larger than regular block files. Prometheus will keep a minimum of 3 write-ahead log files, however high-traffic servers may see more than three WAL files since it needs to keep at least two hours worth of raw data.
The current block for incoming samples is kept in memory and is not fully
persisted. It is secured against crashes by a write-ahead log (WAL) that can be
replayed when the Prometheus server restarts. Write-ahead log files are stored
in the `wal` directory in 128MB segments. These files contain raw data that
has not yet been compacted; thus they are significantly larger than regular block
files. Prometheus will retain a minimum of three write-ahead log files.
High-traffic servers may retain more than three WAL files in order to to keep at
least two hours of raw data.
The directory structure of a Prometheus server's data directory will look something like this:
A Prometheus server's data directory looks something like this:
```
./data
@ -46,7 +53,12 @@ The directory structure of a Prometheus server's data directory will look someth
```
Note that a limitation of the local storage is that it is not clustered or replicated. Thus, it is not arbitrarily scalable or durable in the face of disk or node outages and should be treated as you would any other kind of single node database. Using RAID for disk availability, [snapshots](querying/api.md#snapshot) for backups, capacity planning, etc, is recommended for improved durability. With proper storage durability and planning, storing years of data in the local storage is possible.
Note that a limitation of local storage is that it is not clustered or
replicated. Thus, it is not arbitrarily scalable or durable in the face of
drive or node outages and should be managed like any other single node
database. The use of RAID is suggested for storage availability, and [snapshots](querying/api.md#snapshot)
are recommended for backups. With proper
architecture, it is possible to retain years of data in local storage.
Alternatively, external storage may be used via the [remote read/write APIs](https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage). Careful evaluation is required for these systems as they vary greatly in durability, performance, and efficiency.
@ -56,37 +68,46 @@ For further details on file format, see [TSDB format](/tsdb/docs/format/README.m
The initial two-hour blocks are eventually compacted into longer blocks in the background.
Compaction will create larger blocks up to 10% of the retention time, or 31 days, whichever is smaller.
Compaction will create larger blocks containing data spanning up to 10% of the retention time, or 31 days, whichever is smaller.
## Operational aspects
Prometheus has several flags that allow configuring the local storage. The most important ones are:
Prometheus has several flags that configure local storage. The most important are:
* `--storage.tsdb.path`: This determines where Prometheus writes its database. Defaults to `data/`.
* `--storage.tsdb.retention.time`: This determines when to remove old data. Defaults to `15d`. Overrides `storage.tsdb.retention` if this flag is set to anything other than default.
* `--storage.tsdb.retention.size`: [EXPERIMENTAL] This determines the maximum number of bytes that storage blocks can use. The oldest data will be removed first. Defaults to `0` or disabled. This flag is experimental and can be changed in future releases. Units supported: B, KB, MB, GB, TB, PB, EB. Ex: "512MB"
* `--storage.tsdb.retention`: This flag has been deprecated in favour of `storage.tsdb.retention.time`.
* `--storage.tsdb.wal-compression`: This flag enables compression of the write-ahead log (WAL). Depending on your data, you can expect the WAL size to be halved with little extra cpu load. This flag was introduced in 2.11.0 and enabled by default in 2.20.0. Note that once enabled, downgrading Prometheus to a version below 2.11.0 will require deleting the WAL.
* `--storage.tsdb.path`: Where Prometheus writes its database. Defaults to `data/`.
* `--storage.tsdb.retention.time`: When to remove old data. Defaults to `15d`. Overrides `storage.tsdb.retention` if this flag is set to anything other than default.
* `--storage.tsdb.retention.size`: [EXPERIMENTAL] The maximum number of bytes of storage blocks to retain. The oldest data will be removed first. Defaults to `0` or disabled. This flag is experimental and may change in future releases. Units supported: B, KB, MB, GB, TB, PB, EB. Ex: "512MB"
* `--storage.tsdb.retention`: Deprecated in favor of `storage.tsdb.retention.time`.
* `--storage.tsdb.wal-compression`: Enables compression of the write-ahead log (WAL). Depending on your data, you can expect the WAL size to be halved with little extra cpu load. This flag was introduced in 2.11.0 and enabled by default in 2.20.0. Note that once enabled, downgrading Prometheus to a version below 2.11.0 will require deleting the WAL.
On average, Prometheus uses only around 1-2 bytes per sample. Thus, to plan the capacity of a Prometheus server, you can use the rough formula:
Prometheus stores an average of only 1-2 bytes per sample. Thus, to plan the capacity of a Prometheus server, you can use the rough formula:
```
needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample
```
To tune the rate of ingested samples per second, you can either reduce the number of time series you scrape (fewer targets or fewer series per target), or you can increase the scrape interval. However, reducing the number of series is likely more effective, due to compression of samples within a series.
To lower the rate of ingested samples, you can either reduce the number of time series you scrape (fewer targets or fewer series per target), or you can increase the scrape interval. However, reducing the number of series is likely more effective, due to compression of samples within a series.
If your local storage becomes corrupted for whatever reason, your best bet is to shut down Prometheus and remove the entire storage directory. You can try removing individual block directories, or WAL directory, to resolve the problem, this means losing a time window of around two hours worth of data per block directory. Again, Prometheus's local storage is not meant as durable long-term storage.
If your local storage becomes corrupted for whatever reason, the best
strategy to address the problenm is to shut down Prometheus then remove the
entire storage directory. You can also try removing individual block directories,
or the WAL directory to resolve the problem. Note that this means losing
approximately two hours data per block directory. Again, Prometheus's local
storage is not intended to be durable long-term storage; external solutions
offer exteded retention and data durability.
CAUTION: Non-POSIX compliant filesystems are not supported by Prometheus' local storage as unrecoverable corruptions may happen. NFS filesystems (including AWS's EFS) are not supported. NFS could be POSIX-compliant, but most implementations are not. It is strongly recommended to use a local filesystem for reliability.
CAUTION: Non-POSIX compliant filesystems are not supported for Prometheus' local storage as unrecoverable corruptions may happen. NFS filesystems (including AWS's EFS) are not supported. NFS could be POSIX-compliant, but most implementations are not. It is strongly recommended to use a local filesystem for reliability.
If both time and size retention policies are specified, whichever policy triggers first will be used at that instant.
If both time and size retention policies are specified, whichever triggers first
will be used.
Expired block cleanup happens on a background schedule. It may take up to two hours to remove expired blocks. Expired blocks must be fully expired before they are cleaned up.
Expired block cleanup happens in the background. It may take up to two hours to remove expired blocks. Blocks must be fully expired before they are removed.
## Remote storage integrations
Prometheus's local storage is limited by single nodes in its scalability and durability. Instead of trying to solve clustered storage in Prometheus itself, Prometheus has a set of interfaces that allow integrating with remote storage systems.
Prometheus's local storage is limited to a single node's scalability and durability.
Instead of trying to solve clustered storage in Prometheus itself, Prometheus offers
a set of interfaces that allow integrating with remote storage systems.
### Overview