Wednesday, October 30, 2013

Collectd + Bucky, lessons learned

I recently set out to have collectd capture metrics and pass them off to a centralized bucky instance. Since I use CentOS 6.*, I installed collectd 4.10.9-1 from the EPEL repo onto a test machine for sending metrics. This was the last stable release in the 4 series. I then installed python-bucky 0.2.6-3 also from the EPEL repo, and collectd 4.10.9-1 onto a test machine for storing metrics. I then connected the two machines together using the network plugin. Together things were harmonious. Metrics were collected and stored just the way I wanted them.

The trouble began when I wanted to use a feature of collectd found only in the 5 series. I built collectd 5.4 from source and packaged nicely into RPMs, put the RPMs in my local yum repo and set off upgrading collectd on my test machine for sending metrics. I then modified the collectd *.conf files to account for the new version. Collectd started up smoothly and off to the races.

Watching bucky turned up some problems. It seemed that collectd wasn't sending metrics for certain plugins. I had configured collectd to send metrics for 7 plugins, yet I was only receiving about 4 of them. It also seemed like collectd was sending the metrics was receiving at seemingly random intervals. This sent me into a tizzy.

I was stumped. Did I mess up the compile? Did I miss something in the RPM build? I went over those steps again to see if I forgot anything, I couldn't find anything. I popped onto IRC and asked if anyone had any troubles like this before. Not much help there, but a key point came out that set me on the right path. Someone suggested using the CSV plugin rather than the network plugin to see if the collectd collectors were indeed working correctly. This proved that the collectors were in fact working and collecting metrics properly. So what could it be?

I fired up tcpdump on both machines to see if the packets were being sent and received. They were. I then changed my tcpdump strategy to so that it would decode the UDP packets. I could see all the metrics being sent. I noticed that the order of the metrics in the packets changed with each packet sent. Bucky would read the packet, find any metrics until it came across one of the "bad metrics" and then cease processing the pay load. When the packet pay load started with a "bad metric" nothing would be processed and would look like bucky never received anything.

I started pouring through bucky's code thinking that perhaps the collectd network binary protocol had changed and bucky simply wasn't updated to handle it. After some time, I came to the conclusion that, no, bucky was fine. Its protocol decoding was good. What the heck else could it be?

Then it hit me, while looking at the bucky source code, I saw reference to the collectd types.db which comes with collectd. Types.db is a specification for the metrics collected by collectd. It basically describes each metric and how it's data should be handled.
Well, bucky was installed on a machine that had collectd 4.10.9-1 installed, not the newer 5.4. I opened up the types.db for both 4.10.9-1 and for 5.4 and saw that the specification for the metrics I was losing, the "bad metrics", had changed. Doh!

I upgraded the collectd on the test machine for storing metrics, which upgraded the types.db file. Restarted bucky so that it could read the new types.db. VoilĂ ! Metrics started to pour in again!

Lesson learned. Bucky is highly dependent on the version of collectd both installed elsewhere as well as the local install. They should/must match otherwise some metrics could be misinterpreted, or worse, skipped entirely.

Hopefully this post will help someone having the same problem!