Chris's Blog

Devops Shokunin

Getting Puppet Stats into Graphite

3 Comments »

Graphs are awesome.

At work I provide all kinds of graphs to the front end/support teams and Graphite is rapidly becoming my tool of choice.  In the past, I have relied heavily on RRD.  However, the easy to use front end, scalability and ease of data injection into Graphite is unparalleled.

Since puppet is such a large part of my infrastructure, I want lots of graphs to glance over in the event I think there is a problem.  Puppet outputs stats to syslog and instead of changing that I decided to pull syslog into Graphite.

 

 

My company has allowed me to make this code publicly available on our GitHub site, with a short explanation on how to make it work.  It uses Ruby and EventMachine to handle all of the requests and has an example of some simple calculations that can be done to aggregate data.

The Priority 0 Rule

Comments Off on The Priority 0 Rule

Many years ago when I worked for a Japanese shipping company, Taga-san bought a little red rubber stamp that she used that said “urgent”.  She would stamp documents with the “urgent” stamp to try and garner more attention.  Soon almost every document had the “important” stamp, so the diligent Taga-san went out and bought a “very urgent” stamp.  This too began to appear on documents more and more frequently.

It was only when I joked with her that I visited Kinokuniya Stationary store and noticed they were “having a sale on super duper special urgent stamps” and being confronted with the tears of laughter streaming down her co-workers faces that Taga-san backed down with the stamps.

 

I apologized for the cruel joke with some very expensive cake from a local bakery and she told me that she was often frustrated trying to communicate her priorities to other people.  So for about $5 in rubber stamps and $25 in cake we learned a valuable lesson – setting priorities is difficult and communicating them is even more difficult.

 

In my current position as “Operations Team”, I bounce around many different things everyday.  Communicating my priorities is still extremely difficult.  Inspired by the Ironport Rule 0 : “Don’t do anything stupid”, I have my own Priority 0

 

Priority 0: Production Works

 

Simple, but not easy.

Priority 0 is only one thing and it never changes =>  Our production environment is running and earning revenue for the company.

  • All requests get dropped at a moments notice for any Priority 0 issue.
  • All projects get delayed for Priority 0 issues.
  • All issues are resolved after Priority 0 issues.

 

Whiz-bang new feature, new flavor of the month data store test server, staging issues all have to take a back seat to Priority 0.

 

Priority 0 costs vary by company.  I’ve worked at places where Priority 0 cost 90% of time and places where it’s been as low as 20%.  It’s never free and rarely taken into consideration, but setting it and making sure it’s understood by others is critical.

 

Taga-san, moshiwake gozaimasen

 

 

Load Balancing Puppet with Nginx

2 Comments »

Due to the holidays, I’ve had to add a large number of new nodes to our infrastructure. This started putting too much CPU and memory load on the puppet master. Instead of moving to a larger instance, I looked to spread out to multiple boxes.

This presented the problem of how the ops team could run tests against their own environments, how to handle the revocation and issuance of certs and keeping the manifests on the backends in sync.

Using nginx as a software load balancer solved all of these issues.

After talking with an ex-collegue ( I owe you some ramen eric0 ) I took a closer look at the URL paths being requested by the puppet clients.

Certificate requests start with /production/certificate so get routed to the puppet instance that only serves up certificates.

10.10.0.235 - - [14/Nov/2011:20:02:03 +0000]
  "GET /production/certificate/machine123.example.com HTTP/1.1" 404 60 "-" "-"

Each ops team member has their own environment for testing and the URLs start with the environment name

10.170.25.2 - - [14/Nov/2011:17:24:02 +0000]
 "GET /chris/file_metadata/modules/unixbase/fixes/file.conf HTTP/1.1" 200 330 "-" "-"

Everything else gets routed to a group of puppet backend servers.

The full nginx.conf file is available from GitHub.

Configurations are tested on the ops dev server then checked into a git repo that is pulled by all of the puppet backend servers.

Mcollective Use Case – Operational Dashboard

Comments Off on Mcollective Use Case – Operational Dashboard

I’ve been asked a few times about use cases for mcollective.

 

One of the biggest wins at my company has been using mcollective to build “Oppy” an operational dashboard. Oppy allows developers and support staff to perform deploys on staging servers as well as to audit and monitor client environments in real time.  Developers and Support staff do not have access to production or staging environments for a variety of reasons, so it was necessary to provide a tool that could quickly and efficiently provide all of the information and access that these teams require.

 

 

 

Capabilites

  1. Deployment – Developers can deploy code to staging environments by connecting to an mcollective agent that installs the latest gem packaged version of our software, cleans out the old versions and restarts the application.
  2. Auditing – Support and Developers can run auditing scripts on all nodes in a certain class and check to make sure that software versions, monitoring settings and software settings are as expected.
  3. Corrective Actions – Support and Developers can flush the varnish caches by triggering a run of a varnish agent or reset the application on demand.
  4. Debugging – Developers can run an agent to turn on debug logging for the application and collect the logs by pulling them from the centralized logging server.

 

 

Stack

Oppy was built on top of Sinatra, which allows for extremely rapid development of basic web applications.   A simple example of using Sinatra to run mcollective agents is on my Github account.

The above screen shot is the results of running the nrpe agent, which runs every single monitoring check on a group of hosts and reports back.

All actions are logged, so that there is an audit trail

 

Plotting Time Series Data with Gnuplot

2 Comments »

When dealing with external customers and non-technical people I find it beneficial to provide some sort of visual representation.

Dumping a ton of data on people rarely conveys the message effectively.

My go to tool for generating graphs, especially of time-series data, is gnuplot.  It’s free, flexible and runs everywhere.

Data files httpa.reqs and httpb.reqs are comma separated with the first column as time in epoch seconds and the second the captured value.

1317940390,2.41390134529148
1317942609,1.40112107623318
1319116460,4.9790134529148
1319118679,3.86807174887892
1319120898,3.3390134529148

create the following file and save as http.gnuplot

set datafile separator ","
set terminal png size 900,400
set title "SITE.com Web Traffic"
set ylabel "Requests per second"
set xlabel "Date"
set xdata time
set timefmt "%s"
set format x "%m/%d"
set key left top
set grid
plot "httpa.reqs" using 1:2 with lines lw 2 lt 3 title 'hosta', \
     "httpb.reqs" using 1:2 with lines lw 2 lt 1 title 'hostb'

Generate the graph and save it

gnuplot < http.gnuplot  > requests.png

The break down of the lines is as folows:

set datafile separator ","

set the field delimiter to a comma , the default is a space. Just do not include this line for space separated data.

set terminal png size 900,400

Have gnuplot output a PNG file with the size specified. You can also run gnuplot in interactive mode where you do no need this line.

set title "SITE.com Web Traffic"

Set the title of the graph at the top.

set ylabel "Requests per second"
set xlabel "Date"

Always label your graph so that when it gets passed around to people unfamiliar with the history of the request it is readily apparent to what is going on. This will also keep my high school math teacher quiet.

set xdata time

Tell gnuplot that the x axis will be time data. This allows for more flexible time series manipulations.

set timefmt "%s"

Let gnuplot know what format the time string will be in. “%s” is epoch seconds, “%m/%d/%Y:%H:%M:%S” will read 01/28/2011:00:01:14 in as a time value. Not having to convert date formats with some script first on my data is one of the big wins from using gnuplot.

set format x "%m/%d"

Set the date output format for the x axis.

set key left top

Set the legend to the top left corner.

set grid

Turn on grid lines, they are off by default.

plot "httpa.reqs" using 1:2 with lines lw 2 lt 3 title 'hosta', 
"httpb.reqs" using 1:2 with lines lw 2 lt 1 title 'hostb'

1:2 is the field numbers. The first is the x-axis value and the second the y-axis values.

“with lines” uses lines instead of simply data points.

lw is the line width, with 1 being the default.

lt is the line type or color, gnuplot will pick colors for you automatically if you do not specify them.

title is for the legend.

You can plot multiple data sources on the same graph.

The gnuplot documentation is available here

An excellent gallery of the amazing capabilities of gnuplot is here

Packaging – Deploying Ruby Applications Like an Adult – Part 2

1 Comment »

Continuing from Part 1

Build gems!

 

It’s not that hard and your efforts will be rewarded.

Here are my arguments for learning packaging

 

What’s running on your system now?

When you’re running hundreds of servers you need a programatic way of auditing what is running on your system. Compiling from source will not give this to you for every single package on your system. Git is wonderful as a source code management system but did a4f85e72894895a8269d65cb3fa2ab012804d3ef come before or after aa7c72e6a15ae37db7beb6450f4db3d30069a7dd and what developer or product manager would be able to give you a git hash as to the version they want running on production? Even with tags going back and forth is hard.

Are all of your dependencies met and consistent?

What if some dependency of a dependency is updated causing a bug? Deploying from source code and running bundler to handle dependencies means you might have different gem versions running between the time you brought up the original server to when you added a new node into the cluster. It happens and it is very time consuming to troubleshoot.

How long does it take you to deploy an application?

Takes me 20 seconds to release across a 100 node cluster. It can take up to 10 minutes to download and install all of the dependencies on my old system and then there are plenty of failures due to network issues or rate-limiting from the upstream server. Internal and external customers don’t do delayed gratification.

Can I give a gem version to a developer and be sure they’re running what’s on production so they can troubleshoot?

Yes

 


I’m still waiting for a good argument against packaging.

There are excellent gem tutorials available

Example Gemspec file for building

Before building the gem, I take another step and use

bundle install --deployment

this downloads all of the gems and compiles all of the extensions necessary to run the gem in the vendor directory. Now when you start your application with

bundle exec START_COMMND

it will use only those gems in the vendor folder. You can view the full Rakefile here

Deploying Ruby Applications Like an Adult

Comments Off on Deploying Ruby Applications Like an Adult

“Push button deploy” is something that is often hear people requesting or mentioning as something they would like to have. What’s more important, in my opinion, is to provide a reliable and scalable system for both internal and external developers to deploy code to the staging environment for clients to QA. Staging deployments should be as simple as possible. Production releases are slightly more complicated as Operations needs to carefully monitor and coordinate with external parties, but should still use the same base system.

Larger Image

Requirements for a deployment system

Scalable

  • deploying a package to 1 server or 100 servers should take the same amount of time and effort

Sane

  • Deploy only sanity checked code.
  • Break loudly.
  • Fit with developer culture.

Fast

  • Everyone likes thing to happen quickly.
  • Clients don’t do delayed gratification.

Audit-able

  • What is running right now?
  • How can I trace back to a change in the source code?
  • Is what I think really running?
  • Logs, logs, logs

Reliable

  • It’s Ops, so no one else will be available at 3am to fix it
  • Have to be able to quickly troubleshoot

Flexible

  • Requirements will change over time.
  • Owned by operations, so changes can be separate from production releases.

 

Here is what I came up with:

Larger Image

The criteria for the components chosen is described here
The next posts will go into more detail on individual components.

DNSRR – rewriting DNS for testing web applications

Comments Off on DNSRR – rewriting DNS for testing web applications

When testing web applications, it is often necessary to rewrite DNS entries to avoid XSS Javascript warnings.

Building on Rubydns my company has open sourced a quick ruby script to easily rewrite DNS queries for web testing

Available on Github

Dashboard Example with Sinatra and Mcollective

Comments Off on Dashboard Example with Sinatra and Mcollective

Having a dashboard to provide real time data to users helps minimize interruptions at work.

The combination of Sinatra handling the incoming HTTP requests and Mcollective pulling real time data from the infrastructure provides the responsiveness and self-service that saves everyone time and effort.

The example code is available on Github

Here are the screen shots running on my internal network.

Welcome Screen

Filtering Form

Results from Monitoring Agent

Results from Puppetd Agent

Using Open Source to Provide Infrastructure Services

1 Comment »

Operations Teams need to provide eight critical services to the developers and users of their environment.  At my current employer, I use open source software to provide these services that allow our developers to be more productive and our customers to experience stable, responsive service.

Click for Full Image

Source Code Management

Keep all of our bespoke software, configurations and notes under strict version control.

Sofware: Git
Pros: Fast
Stable
Many developers familiar with it due to Github’s popularity
Cons: Steep learning curve
Somewhat cryptic commands
Option: Subversion

Continuous Integration

Build, test, version and package our software so that it may be quickly and safely deployed to our staging environment

Sofware: Jenkins
Pros: Easy Integration with GIT
Nice GUI
Flexible enough to meet our needs
Cons: Configuration limited to GUI
Written in Java*
Option: Cruise Control

 

Provisioning

Spin up nodes to become part of the processing farm and decommission nodes no longer required

Sofware: Custom scripts using Fog
Pros: Simple scripts
Easy to customize
Support multiple cloud providers
Cons: Custom tool
Option: Cobbler, RightAWS

 

Configuration Management

Ensure that all nodes are automatically and correctly configured and remain in a known configured state

Sofware: Puppet
Pros: Easy configuration language
Well supported
Active community
Cons: Have to learn said configuration language
Requires serious investment of time
Option: Chef

 

Monitoring

Check on services and nodes to ensure that things are behaving as expected before the customer notices

 

Sofware: Icinga
Pros: Can be easily auto-configured by Puppet
Well understood
Nagios syntax
Works well with nagios checks and plugins
Cons: Requires serious investment of time and constant care
Option: Nagios, Zenoss

 

Capacity/Performance Management

Collect system metrics for assessing performance and capacity planning.  Some organizations have monitoring perform this role, but I have very strong opinions on this being kept separate.

 

Sofware: Collectd/Visage
Pros: Light, fast daemon on each box
Flexible server
Many plugins availble
Cons: Separate process to run
Requires a lot of disk and disk I/O
Option: Ganglia

 

Log Collection

Centrally collect, store and monitor system and application logs

Sofware: Rsyslog/Graylog2
Pros: Rsyslog provides flexible configs
MongoDB backed server performs well
Easy front end for log viewing
Cons: Takes a while to learn Mongo
Harder to pull/backup then text logfiles
Options: Syslog-ng Logstash

 

Deployment Management

Allow developers and technical staff to deploy and monitor application activity.  Since each infrastructure is unique, it makes sense to build a custom solution to this problem.

 

Software: Mcollective/Sinatra/ActiveMQ
Pros: Sinatra makes it easy to write simple web applications
Mcollective is extremely fast
ActiveMQ is very flexible and resilient
Cons: Sinatra is not a full featured as Rails
Mcollective requires a change of thinking about command/control
ActiveMQ is Java*
Options: Control Tier

 

* I list Java as a con because we do not have extensive in-house Java expertise and it rquires us to install something we would not have normally