Moving a large-scale metrics pipeline from StatsD to OpenTelemetry / Prometheus
Posted by jmarbach 1 day ago
Full disclosure - I formerly worked for Grafana Labs.
The size of this Grafana Mimir deployment would rank it in the top echelon of customers. The irony is that this may be a $0 revenue user for Grafana Labs.
Comments
Comment by dig1 1 day ago
Why is that ironic? Since Mimir is open-source, $0 revenue users are expected. AFAIK, Grafana Labs relies heavily on go, typescript, and linux, without necessarily being their top financial contributor. They could have kept Mimir proprietary like Splunk, but whether that would have attracted the same level of adoption or community contribution is another matter.
Comment by camel_gopher 1 day ago
Comment by skrtskrt 21 hours ago
It’s to retain customers that grew big enough on Grafana Cloud to justify having their own in-house team run the tools instead. So Grafana offers them a pricing where the Grafana engineers operate the platform within the customer’s cloud account. Very large customers get to keep not having to operate and build/hire for the expertise, and save some money.
Sure some companies are big enough to make it worth it and still want to run their own OSS observability stack, but it’s generally not going to be popular with executive decision-makers, so it likely will remain rare. And if they do run it, Grafana still benefits from their contributions to AGPL code.
On the low-spending end, OSS users not buying cloud would not really be a serious revenue concern. They just don’t spend enough. You use cloud if tou have super broad product usage, so you don’t have to run and maintain Grafana, Mimir, Loki, Tempo, Pyroscope, k6, etc. all yourself. If you don’t want or need all that, you run Loki+Grafana yourself and enjoy.
Comment by codeduck 1 day ago
I have used Prometheus a lot. Reliable is not a word I would associate with it.
Comment by pahae 1 day ago
Both Prom and VM are exceptionally stable in my opinion, even on _very_ large scales. There were times when I had a single (Prom, later VM) and not-overly-large instances scrape 2Mio samples/s without any issues. In addition to fairly spiky query loads.
However, if something does go wrong, the single most impactful difference between VM and Prom is simply the difference in startup time. Prometheus with 2TB of metrics takes _forever_ to start up. We're talking up to 2 hours on SSD while VM just... starts.
Comment by porridgeraisin 1 day ago
Comment by Serhii-Set 1 day ago
Comment by hagen1778 1 day ago
Comment by codeduck 1 day ago
Comment by Serhii-Set 1 day ago
Comment by blueybingo 1 day ago
Comment by valyala 21 hours ago
Comment by hagen1778 21 hours ago
Comment by jameson 1 day ago
Comment by valyala 21 hours ago
Comment by awoimbee 1 day ago
Comment by igor47 1 day ago
Comment by zbentley 1 day ago
That's a very professional way of saying "Wait, everyone just lives with this? What the fuck?!"
Many such cases in the Prometheus ecosystem.
Comment by fgfhf 1 day ago