
Perhaps you want some statistic to try and make sense of these values, so you start by averaging them. Let’s say you’re interested in 10 service requests and you kept track of how long the request took. A quantile is a way of describing what fraction of nodes have a certain value range. If you’ve heard of quartiles, percentiles, etc., these are all quantiles. To understand Rothko, first we need to take a quick detour to explain quantiles. What we want are time-series graphs about the overall distributions across reporting devices of each metric. We weren’t satisfied with the trade-off of deciding to ignore some devices and sought another way. We have a massive amount of devices, and at the scale we operate, measuring how our devices are behaving comes with new challenges. Here at Vivint, we’re on the front lines of putting devices into people’s homes. If you’re coming from a cloud-only environment, imagine your metrics system was collecting time-series data from millions of servers! To get Graphite to scale to millions of reporting services means selecting some (maybe a lot) of metrics to simply ignore. Graphite, for example, lets you choose the time window for sampling, but the fundamental data model assumes you want to store data for every reporting service. If you have millions of devices reporting metrics, millions of metrics coming in every second, it gets expensive really quickly to keep track of all of those metrics separately. These are all great systems, but a key rule for scaling any metric reporting system is sampling - at a certain point, it is impossible to save everything, and so you must choose what to save and what to throw away in a hopefully statistically bias-free way. If you’ve heard of Graphite, InfluxDB, Prometheus, Atlas, etc., these are common tools for displaying and querying metrics over time (the graph above was generated with the Graphite frontend Grafana). Systems report metrics, and operators store these metrics over time in time-series graphs.Īn example time-series graph of whole-system data transfer rates over time
#Grafana timeslice not working software#
Metrics are generally counters and stats about software that is easier than logs to find quantiles of, find outliers, find medians, aggregate, and so on.

But an increasing amount of organizations have moved to metrics. Many engineering organizations start with log aggregation log aggregation collects logs from your system and then puts those logs into some sort of log search tool to try and make sense of them. At the forefront is this idea that you can’t improve what you don’t measure, and measuring how your software is doing in the field is vital to improving it. Metrics, logs, traces, telemetry, observability, monitoring, oh my! If you’ve been paying attention to how engineering organizations have increasingly been instrumenting their software to understand what is working and what isn’t at scale, you’ll have heard some of these terms. We’re excited to announce our new metrics collection system, Rothko! Metrics? Time-series Histograms with Rothko - Metrics Collection for Large Deployments
