-
Notifications
You must be signed in to change notification settings - Fork 89
Metrics Package
The metrics package is a home grown package created during the development of Rend. There's a few reasons behind the decision to make a new package, but the primary driving force going forward is to have the freedom to maintain compatibility with the Atlas metrics system in use at Netflix. The histograms (to be described later) are an example of this.
In general, the package aims to stay out of the way as much as possible. It uses atomic operations on data as much as possible to avoid locks, though they are used in at least one place. The package gathers a variety of different metrics through package level calls (no counter structs exist) and reports all of the metrics created by the application on the HTTP endpoint /metrics. It is expected that the program has an HTTP server set up, which is the case for anyone exposing pprof endpoints in production. As well, if you require some namespaceing, the main method of the application should call metrics.SetPrefix() to se the prefix for all metrics. The data returned on the endpoint includes all system information (gc info and percentiles, memory usage, allocation statistics, etc.) as well as all metrics known to the package.
A metric is declared by using the Add* methods on the package, e.g. metrics.AddCounter("foo", nil). The string passed in is recorded and a token is returned that allows you access to that same counter. The string is also the name of the metric reported from the endpoint (with the prefix attached) so it should make sense externally. The token must be used, passing in the same string to another call to AddCounter does not get you the same token back. The second parameter is that of type metrics.Tags, which is just a map[string]string of custom tags to append to the output of that metric. The output format of the endpoint is covered under the endpoint section.
In general, packages should just declare metrics at the file level in a big var block:
var (
MetricBinaryRequestHeadersParsed = metrics.AddCounter("binary_request_headers_parsed", nil)
MetricBinaryRequestHeadersBadMagic = metrics.AddCounter("binary_request_headers_bad_magic", nil)
MetricBinaryResponseHeadersParsed = metrics.AddCounter("binary_response_headers_parsed", nil)
MetricBinaryResponseHeadersBadMagic = metrics.AddCounter("binary_response_headers_bad_magic", nil)
)The tokens from which can then be used later in the code:
metrics.IncCounter(MetricBinaryRequestHeadersParsed)There are several types of metrics:
- Counters (uint64 only)
- Gauges (uint64 and float64)
- Histograms (uint64 only)
Both types of gauges also have a callback version where the endpoint, while gathering metrics, will call the function given to retrieve the metric value rather than keep it internally. This supports things like the internal usage at Netflix where we need to retrieve RocksDB metrics and cannot directly update them when they change. There is also a bulk callback where the user can return slices of integer and float metrics, but the usage is more complicated because the user must properly set up the tags.
The endpoint returns all of the metrics that are tracked every time the endpoint is called. The assumption made is that an external process will ping the metrics once a minute in order to retrieve the metrics. There's no sense of a time interval within the metrics package for this reason. At Netflix, there is a sidecar process which will read from the endpoint once a minute to retrieve the latest data.
The endpoint will return all of the tracked metrics as well as all of the memory and GC statistics returned from a call to runtime.ReadMemStats. The metrics are returned one per line, starting with the name, then the tags, then the value. The name and tags are joined by vertical bars (|). The name is always first but the tags have no ordering after that. Each tag name is separated from the tag value by an asterisk (*) character. The value of the metric is separated from the name and tags by a space. All metrics include the type and dataType tags, but may include more. The valid values for the type tag are counter and gauge. The valid values for the dataType tag are uint64 and float64.
There's a few places where some extra tags are added based on context, such as percentiles for gc pause times and a special percentile tag for integration with Atlas. The output can be rather large, sometimes over 12k lines, so the system that reads it should probably do a little bit of filtering to remove things like 0 counters and other unchanging metrics.
Here's some example output:
-
rend_alloc_mallocs|type*counter|dataType*uint64|size*32 3959- Name: rend_alloc_mallocs
- Tags: type -> counter, dataType -> uint64, size -> 32
- Value: 3959
- Meaning: The number of allocations since the start of the program in the 32 byte allocation span class.
-
rend_mem_heap_alloc|type*gauge|dataType*uint64 20117504- Name: rend_mem_heap_alloc
- Tags: type -> gauge, dataType -> uint64
- Value: 20117504
- Meaning: The total ram allocated on the heap
-
rend_gc_pause|type*gauge|dataType*float64|statistic*percentile90 454301- Name: rend_gc_pause
- Tags: type -> gauge, dataType -> float64, statistic -> percentile90
- Value: 454301
- Meaning: The 90th percentile gc pause time (for recent GC's, see docs for detail)
-
rend_gc_gc_cpu_frac|type*gauge|dataType*float64 0.168597- Name: rend_gc_gc_cpu_frac
- Tags: type -> gauge, dataType -> float64
- Value: 0.168597
- Meaning: The fraction of time spent in GC
-
rend_bhist_get|type*counter|dataType*uint64|percentile*T002E 0- Name: rend_bhist_get
- Tags: type -> counter, dataType -> uint64, percentile -> T002E
- Value: 0
- Meaning: A bucket in a bucketized histogram (a.k.a. a
PercentileTimerfrom the Spectator library)
Counters are the basic workhorse of metrics. They are a basic atomic counter that is monotonically increasing. There is no API support for decrementing counters, gauges would be a better metric type to use if the value is constantly going up and down. The counter output is not time segmented in the output of the endpoint so the aggregation system needs to do the time aggregation for you. This is a result of the way metrics are ingested within Netflix.
Counters can be registered with the metrics package by using the AddCounter method:
var MetricFoo = metrics.AddCounter("foo", nil)And incremented using either the IncCounter or IncCounterBy methods:
metrics.IncCounter(MetricFoo)
metrics.IncCounterBy(MetricFoo, 4)In the endpoint output they have the type tag set to counter and the dataType tag set to uint64. In the above example, the endpoint would return:
foo|type*counter|dataType*uint64 5
Gauges are meant to track a resource over time that may vary up and down. The downside is that they are fairly low fidelity; you only get one sample per period, the last one. The metrics package provides the ability to produce both integer (uint64) and float (float64) gauges. Like counters, gauges have no concept of time segmentation, the latest value is always returned. Since there are two data types it's important to note that they are two separate spaces as far as the metrics package is concerned. Registering a metric "foo" for both float and int gauges will provide two separate identifiers that are not interchangeable.
Gauges can be registered with the metrics package by using the AddIntGauge and AddFloatGauge methods:
var (
MetricFooInt = metrics.AddIntGauge("foo_int", nil)
MetricFooFloat = metrics.AddFloatGauge("foo_float", nil)
)And set using the SetIntGauge or SetFloatGauge methods as appropriate (make sure to use the right one or you can corrupt your metrics):
metrics.SetIntGauge(MetricFooInt, 4)
metrics.SetFloatGauge(MetricFooFloat, 0.4)In the endpoint output they have the type tag set to gauge and the dataType tag set to uint64 for int gauges and float64 for float gauges. In the above example, the endpoint would return:
foo_int|type*gauge|dataType*uint64 4
foo_float|type*gauge|dataType*float64 0.4
The value of counters and gauges are the same with this type of metric, so the above descriptions still apply. Callback metrics are registered functions that will return the proper value to the metrics package when requested. Gauges are the ones specifically supported with named functions, but other types of metrics can be added through the BulkCallback API.
Callback gauges can be registered with the RegisterIntGaugeCallback and RegisterFloatGaugeCallback functions:
RegisterIntGaugeCallback("foo_int_cb", nil, func() uint64 { return 4 })
RegisterFloatGaugeCallback("foo_float_cb", nil, func() float64 { return 0.4 })The functions passed will be called every time the endpoint is called. The functions should close over whatever they need to access the proper information. There's no methods after this to set the gauges separately from the callbacks being called.
In the endpoint output they have the type tag set to gauge and the dataType tag set to uint64 for int gauges and float64 for float gauges. In the above example, the endpoint would return:
foo_int_cb|type*gauge|dataType*uint64 4
foo_float_cb|type*gauge|dataType*float64 0.4
The BulkCallback API is used when there are many metrics to add all at once which would be either too burdensome to do as separate functions, or simply more efficient to batch. The callback function type returns slices of IntMetric and FloatMetric, which are composite types that already contain the tags. This means that the user of this API is expected to properly tag things as counters or gauges and with the proper data type. The tag keys for type and data type can be accessed as the metrics.TagMetricType and metrics.TagDataType respectively. There are also constants for counter and gauge types as well as uint64 and float64 data types, which are metrics. MetricTypeCounter, metrics. MetricTypeGauge, metrics. DataTypeUint64, and metrics. DataTypeFloat64 respectively.
The bulk API is more complex to use and has less hand holding, so it is recommended to use the regular counters and gauges or callback gauges as needed.
The histograms are primarily meant for recording timing information, though they can be used for anything else that exhibits a distribution. Timing information in Rend is gathered by using the timer package, which predates the changes in Go 1.8 that enabled accurate timing. The histograms in Rend capture the timing distribution for all of the different kinds of requests, including ones to the backend L1 and L2 caches.
Thee are two types of histograms that are simultaneously maintained behind the same API: a locally-significant histogram that stores the raw samples to get exact percentiles for the process, and a collectively significant bucketized histogram that matches the Spectator PercentileTimer's buckets. This allows aggregate percentiles across clusters with a small error. The Atlas system does this for us internally (see: https://github.com/Netflix/atlas/wiki/math-percentiles).
Internally, the histograms contain two buffers that flip every time the histograms are queried by the endpoint. This allows the endpoint to do its processing while blocking observations from being made for the minimal amount of time.
Histograms can be registered using the AddHistogram method:
HistFoo = metrics.AddHistogram("foo", false, nil)The fist parameter is the name of the histogram, the second is whether or not the histogram is sampled, where one in 4 observations are kept, and the last is any custom tags to be added.
Observations can be added to the histogram using the ObserveHist method:
metrics.ObserveHist(HistFoo, 40)The histogram output contains groups of integer counters for the different supported percentiles, which are every 5% from 0 to 100% as well as 99th and 99.9th percentiles. It also includes a count of the observations submitted to the histogram in the period as well as the count of the observations kept.
Regular histogram example output:
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile0 0
rend_hist_get_l1|type*counter|statistic*percentile5|dataType*uint64 758
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile10 768
rend_hist_get_l1|statistic*percentile15|dataType*uint64|type*counter 778
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile20 793
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile25 820
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile30 835
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile35 846
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile40 856
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile45 868
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile50 883
rend_hist_get_l1|type*counter|dataType*uint64|statistic*percentile55 914
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile60 984
rend_hist_get_l1|type*counter|statistic*percentile65|dataType*uint64 1000
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile70 1203
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile75 1326
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile80 1397
rend_hist_get_l1|statistic*percentile85|dataType*uint64|type*counter 1458
rend_hist_get_l1|type*counter|dataType*uint64|statistic*percentile90 1534
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile95 1744
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile100 56389
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile99 3114
rend_hist_get_l1|dataType*uint64|type*counter|statistic*percentile99.9 4721
rend_hist_get_l1|type*counter|dataType*uint64|statistic*count 18650819
rend_hist_get_l1|dataType*uint64|type*counter|statistic*kept 18650819
The bucketized histogram output is a set of 276 counters that represent the pre-determined buckets of the PercentileTimer. The percentile tag is the tag required by the Spectator library to match agains the buckets.
Bucketized hist example output:
rend_bhist_set_l1|dataType*uint64|type*counter|percentile*T0000 0
rend_bhist_set_l1|type*counter|dataType*uint64|percentile*T0001 0
rend_bhist_set_l1|type*counter|dataType*uint64|percentile*T0002 0
rend_bhist_set_l1|type*counter|dataType*uint64|percentile*T0003 0
...
rend_bhist_set_l1|type*counter|dataType*uint64|percentile*T0111 0
rend_bhist_set_l1|type*counter|dataType*uint64|percentile*T0112 0
rend_bhist_set_l1|type*counter|dataType*uint64|percentile*T0113 0