Chris's Blog

Devops Shokunin

Using Open Source to Provide Infrastructure Services

1 Comment »

Operations Teams need to provide eight critical services to the developers and users of their environment.  At my current employer, I use open source software to provide these services that allow our developers to be more productive and our customers to experience stable, responsive service.

Click for Full Image

Source Code Management

Keep all of our bespoke software, configurations and notes under strict version control.

Sofware: Git
Pros: Fast
Stable
Many developers familiar with it due to Github’s popularity
Cons: Steep learning curve
Somewhat cryptic commands
Option: Subversion

Continuous Integration

Build, test, version and package our software so that it may be quickly and safely deployed to our staging environment

Sofware: Jenkins
Pros: Easy Integration with GIT
Nice GUI
Flexible enough to meet our needs
Cons: Configuration limited to GUI
Written in Java*
Option: Cruise Control

 

Provisioning

Spin up nodes to become part of the processing farm and decommission nodes no longer required

Sofware: Custom scripts using Fog
Pros: Simple scripts
Easy to customize
Support multiple cloud providers
Cons: Custom tool
Option: Cobbler, RightAWS

 

Configuration Management

Ensure that all nodes are automatically and correctly configured and remain in a known configured state

Sofware: Puppet
Pros: Easy configuration language
Well supported
Active community
Cons: Have to learn said configuration language
Requires serious investment of time
Option: Chef

 

Monitoring

Check on services and nodes to ensure that things are behaving as expected before the customer notices

 

Sofware: Icinga
Pros: Can be easily auto-configured by Puppet
Well understood
Nagios syntax
Works well with nagios checks and plugins
Cons: Requires serious investment of time and constant care
Option: Nagios, Zenoss

 

Capacity/Performance Management

Collect system metrics for assessing performance and capacity planning.  Some organizations have monitoring perform this role, but I have very strong opinions on this being kept separate.

 

Sofware: Collectd/Visage
Pros: Light, fast daemon on each box
Flexible server
Many plugins availble
Cons: Separate process to run
Requires a lot of disk and disk I/O
Option: Ganglia

 

Log Collection

Centrally collect, store and monitor system and application logs

Sofware: Rsyslog/Graylog2
Pros: Rsyslog provides flexible configs
MongoDB backed server performs well
Easy front end for log viewing
Cons: Takes a while to learn Mongo
Harder to pull/backup then text logfiles
Options: Syslog-ng Logstash

 

Deployment Management

Allow developers and technical staff to deploy and monitor application activity.  Since each infrastructure is unique, it makes sense to build a custom solution to this problem.

 

Software: Mcollective/Sinatra/ActiveMQ
Pros: Sinatra makes it easy to write simple web applications
Mcollective is extremely fast
ActiveMQ is very flexible and resilient
Cons: Sinatra is not a full featured as Rails
Mcollective requires a change of thinking about command/control
ActiveMQ is Java*
Options: Control Tier

 

* I list Java as a con because we do not have extensive in-house Java expertise and it rquires us to install something we would not have normally

How mcollective and puppet play nice

1 Comment »

At work, I have invested a lot of time in two tools that have made configuration and deployment as close to a painless process as I think is possible.

Puppet (available from Puppet Labs) is an amazing configuration tool that I have been working with for over a year.  Since my place of work is cloud based, I need to spin up dozens of virtual machines that need to be identically configured automatically.  Puppet allows you to achieve consistency over time as machines are configured by runs into a known state.

Now I have greater than 100 nodes and I want to perform some action on them to collect data or perform some action on each node in real time.  SSH loops are fine for a couple of machine with a static list, but I have many nodes spread out in different locations and I am not a patient individual.  Mcollective makes it possible to run massively parallel jobs across my infrastructure in seconds as opposed to minutes.

The use case that got me started was a co-worker says “We just got a call from customer XYZ and they say there is a problem.  Quick – do something”.  Because all of the nodes are in puppet and my configurations are in source control, I can immediately be sure of the state of my system configurations.  I could log into monitoring and check each host and wait for some information, but instead I just run my mcollective check that goes out to each box and performs all monitoring checks in real time to see if there is some failure*.  Within 30 seconds, I am confident that I can rule out the main two causes of trouble – configuration drift and network/host level issues and concentrate on the application itself.  In the past this might have taken 10s of minutes to ascertain system state and it was most likely the culprit as to the current outage.

When I’m asked why you need both Puppet and Mcollective, I use the following shopping analogy to explain the relationship:

Puppet is the weekly shopping trip where you buy necessities and follow a list to ensure you have everything you need for a well stocked pantry of basic ingredients and what you need for dinner.

Mcollective is the quick run to the store to pick up a wine to compliment dinner.

The food is great, but the wine puts it over the top and the wine, while certainly nice by itself, lacks the foundation of a good meal.

Mcollective now handles deploys of software, monitoring checks, audits and many other functions on my company’s infrastructure when immediate action is required and is itself installed and configured by Puppet.  It does require a significant upfront investment in time and a change in the way you think about processing requests, but is, in my opinion, necessary to grow your infrastructure and be responsive to business needs.

* for speed reference on part of my company’s infrastructure I can run approximately 1736 monitoring checks over 129 hosts in the following time

Finished processing 129 / 129 hosts in 3411.43 ms