Wednesday, October 30, 2013

Collectd + Bucky, lessons learned

I recently set out to have collectd capture metrics and pass them off to a centralized bucky instance. Since I use CentOS 6.*, I installed collectd 4.10.9-1 from the EPEL repo onto a test machine for sending metrics. This was the last stable release in the 4 series. I then installed python-bucky 0.2.6-3 also from the EPEL repo, and collectd 4.10.9-1 onto a test machine for storing metrics. I then connected the two machines together using the network plugin. Together things were harmonious. Metrics were collected and stored just the way I wanted them.

The trouble began when I wanted to use a feature of collectd found only in the 5 series. I built collectd 5.4 from source and packaged nicely into RPMs, put the RPMs in my local yum repo and set off upgrading collectd on my test machine for sending metrics. I then modified the collectd *.conf files to account for the new version. Collectd started up smoothly and off to the races.

Watching bucky turned up some problems. It seemed that collectd wasn't sending metrics for certain plugins. I had configured collectd to send metrics for 7 plugins, yet I was only receiving about 4 of them. It also seemed like collectd was sending the metrics was receiving at seemingly random intervals. This sent me into a tizzy.

I was stumped. Did I mess up the compile? Did I miss something in the RPM build? I went over those steps again to see if I forgot anything, I couldn't find anything. I popped onto IRC and asked if anyone had any troubles like this before. Not much help there, but a key point came out that set me on the right path. Someone suggested using the CSV plugin rather than the network plugin to see if the collectd collectors were indeed working correctly. This proved that the collectors were in fact working and collecting metrics properly. So what could it be?

I fired up tcpdump on both machines to see if the packets were being sent and received. They were. I then changed my tcpdump strategy to so that it would decode the UDP packets. I could see all the metrics being sent. I noticed that the order of the metrics in the packets changed with each packet sent. Bucky would read the packet, find any metrics until it came across one of the "bad metrics" and then cease processing the pay load. When the packet pay load started with a "bad metric" nothing would be processed and would look like bucky never received anything.

I started pouring through bucky's code thinking that perhaps the collectd network binary protocol had changed and bucky simply wasn't updated to handle it. After some time, I came to the conclusion that, no, bucky was fine. Its protocol decoding was good. What the heck else could it be?

Then it hit me, while looking at the bucky source code, I saw reference to the collectd types.db which comes with collectd. Types.db is a specification for the metrics collected by collectd. It basically describes each metric and how it's data should be handled.
Well, bucky was installed on a machine that had collectd 4.10.9-1 installed, not the newer 5.4. I opened up the types.db for both 4.10.9-1 and for 5.4 and saw that the specification for the metrics I was losing, the "bad metrics", had changed. Doh!

I upgraded the collectd on the test machine for storing metrics, which upgraded the types.db file. Restarted bucky so that it could read the new types.db. VoilĂ ! Metrics started to pour in again!

Lesson learned. Bucky is highly dependent on the version of collectd both installed elsewhere as well as the local install. They should/must match otherwise some metrics could be misinterpreted, or worse, skipped entirely.

Hopefully this post will help someone having the same problem!

Graphite Logins, part of the story

There seems to be a lack of information, or rather, a disconnect of information regarding graphite and graphite logins. 

The problem, seems to me, is that graphite indicates that it has it's own logins, but doesn't seem to give the ability to create them. During installation, you do indeed create one superuser, and you can indeed log in with it, but, how do you create more logins for anyone other than the admin?

Enter Django. I knew graphite was built on top of Django, however, having no prior experience with Django, I had no idea how they were coupled and what Django brought to the table. Well it turns out that graphite, being a Django app, has an admin panel. You can find the panel located at:
http://www.example.com/graphite/admin/
or some variation on the above URL. The key part is the suffix of /admin/. To get into this panel for the first time, you can use the superuser you created at graphite installation time. From here it's just a few simple clicks to add groups and users. It's also give you handy access to saved dashboards in case you want to hand modify them.

If you've forgotten the master user/pass you created at installation time, no worries, we can create a new one. For this, though, we'll need to reach for the command line (If there is another way of doing this please let me know). First we need to locate the Django management script called manage.py. I'm not going to presume to know how you installed graphite, you could have installed it from source, or like me, from a repo. What I did was run the command:
locate manage.py
and picked out the manage.py that pertained to graphite. In my case:
/usr/lib/python2.6/site-packages/graphite/manage.py
 If you followed the default installation guide from the graphite docs it might be located at
/opt/graphite/webapp/graphite/manage.py
Now that we've found the correct manage.py we can go ahead and create a superuser via the command:
manage.py createsuperuser 
This will prompt you for username, email, and password. Once you are done, you'll have a new superuser that you can use to log into the graphite admin panel outlined above.


Further Reading:
The Django project has loads of good documentation and the most relevant to this post is the Django Authentication System where you can find out about creating users via the Python API, and a whole lot more.