SIOS-metrics being updated soon with our process table sampler
By joe
- 3 minutes read - 512 wordsI needed to look at processes on the machine I’d been spending time debugging, in terms of what was running, what the state, the allocations, the IO, etc. Something was causing a hard panic, and it seemed correlated with an application issue. I didn’t have a process space sampler, so I wrote one. Takes one sample per second right now (configurable) across the whole process space. Uses 1% CPU or so normally. I filter out a number of things I don’t care about (kernel threads and related worker bits). Using this, I was able to get the aggregate memory for each type of application, along with the once a second play by play on attempted allocations, VmPeak/VmSize numbers, IO by each process, etc. This is very illuminating. And, as with something that generates quite a bit of output, I caught some interesting bugs in SIOS-metrics itself, and fixed them (well, mostly). One of the major things was the way I grabbed persistent metric code output. I had created a simple sync frame boundary that made detecting the last output in a stream very easy. But … one of the more interesting aspects with the amount of data I was generating was that my sampling rate was such that I might copy over only a partial buffer (not a problem for smaller IO output). I noticed this after some of the metric lines were truncated, only to have the rest of the line show up in the next time stamp. So I put an end marker in, and between the sync and end markers, I have a data frame of arbitrary size. I guess I could simplify my parsing code even more by computing the size and putting it in the sync and end marker. But this would complicate the metrics a bit, and I want the metric side to be as lightweight as possible. Once I fixed that, I was able to fix a few other bugs as well. Expect a commit on that later tomorrow. This said, I caught an interesting er … feature … in influxdb on aggregation queries as you downsample. The larger the range, if you use a sum query, it will sum everything in the smallest interval, and present that back as the result. So it won’t be a sum over the time or an average of sum over the smallest interval. Which makes the graphs based upon it problematic at best. I am looking at using kdb+ for this (32 bit version), and splayed tables for the storage (inbound data upsert to permanent storage), with a separate query engine using those files for the graphs using grafana. Grafana 3 is coming out soon with nice support for adding additional data sources, so my plan is to get a simple feed going from SIOS metrics directly into kdb+ (this part shouldn’t be too painful), and then working out the grafana logic to talk to kdb+. Seems the grafana folk are quite interested in this themselves, so hopefully I can get something going quickly in my “free” time.