Chris's Blog

Devops Shokunin

Opensource Infrastructure Revisited

Comments Off on Opensource Infrastructure Revisited

In a previous article, I detailed the open source projects that I used to implement a PaaS infrastructure.

Since that time the number of instances in the infrastructure has grown by 2.5X and several of the components needed to be rethought.

Capacity/Performance Management

Previous: Collectd/Visage
Replacement: Collectd/Graphite
Reasons: The collectd backend was too slow and I/O heavy
Graphite graphs are easily to embed in dashboard applications
Ability to easily transform metrics, such as average CPU across a cluster of servers

Continuous Integration

Previous: Selenium
Replacement: Custom Tests
Reasons: Selenium tests failed too often for undiscernable reasons
False positives slowed development too often

Log Collection

Previous: Rsyslog/Graylog2
Replacement: Logstash/ElasticSearch/Kibana
Reasons: Mongodb too slow in EC2 for storing and searching

Logstash offers better parsing and indexing of logs with powerful filtersElasticSearch is super fast and scales horizontally on EC2

Kibana is simple to use and allows Developers to quickly find the relevant information

All of these components are easily integrated into our dashboard application

These changes not only allow the infrastructure to scale, but provide APIs that allow easy integration with custom dashboards.

Upgrading to Puppet 3.0

1 Comment »

After spending a few days at PuppetConf and talking with Eric S. and Jeff McC. of puppetlabs, I felt compelled to upgrade the latest version of our architecture to Puppet 3.0 from 2.7. Mainly to fix plugin sync issues and for the increased performance.

Here is my list of gotchas:

0) You cannot run both and ENC and stored configs at the same time. Filed a bug (#16698) on this

1) On my puppet master I run as user puppet under unicorn+nginx. When you don’t run as root, puppet only looks in ~/.puppet for the configuration file (Bug #16637). A note is also in the release notes. I temporarily got around this by:

mv ~/.puppet ~/.puppet.old
ln -s /etc/puppet ~/.puppet
#also grab the new rack config
cp /puppet-3.0.0-rc8/ext/rack/files/config.ru .

and restarting my unicorn processes. I am much happier than I was on Passenger+Apache.

2) Ignoring deprication warnings. The following will fail

  file { '/etc/nginx/nginx.conf':
    ensure  => present,
    source  => [ "puppet:///nginx/${::fqdn}.nginx.conf",
                "puppet:///nginx/${::role}.nginx.conf",
                'puppet:///nginx/default.nginx.conf'],
  }

with this message:

err: /Stage[pre]/Nginx::Config/File[/etc/nginx/nginx.conf]: 
Could not evaluate: Error 400 on SERVER: Not authorized to call find on
/file_metadata/nginx/foo.bar.com.nginx.conf Could not retrieve file metadata for
puppet:///file_metadata/nginx/foo.bar.com.nginx.conf: Error 400 on SERVER: 
Not authorized to call find on /file_metadata/nginx/foo.bar.com.nginx.conf at 
/etc/puppet/modules/nginx/config.pp:86

You need to be sure to include the modules path between in the source.

  file { '/etc/nginx/nginx.conf':
    ensure  => present,
    source  => [ "puppet:///modules/nginx/${::fqdn}.nginx.conf",
                "puppet:///modules/nginx/${::role}.nginx.conf",
                'puppet:///modules/nginx/default.nginx.conf'],
  }

Jeff even helped by suggesting the error message was somewhat cryptic and filed Bug #16667

3) When I write the facts.yml file for mcollective I needed to remove the map call so that I did not get the following error after moving to ruby 1.9:

--- a/modules/mcollective/manifests/config.pp
+++ b/modules/mcollective/manifests/config.pp
@@ -34,7 +34,7 @@ class mcollective::config inherits mcollective {
     group    => root,
     mode     => '0400',
     loglevel => debug,
-    content  => inline_template('<%= Hash[scope.to_hash.reject 
- { |k,v| k.to_s =~ /
- (uptime_seconds|uptime|timestamp|free|path|ec2_metrics_vhostmd|
- servername|ec2_public_keys_0_openssh_key|sshrsakey|sshdsakey|serverip)/ 
- }.sort.map].to_yaml - %>'),
+    content  => inline_template('<%= Hash[scope.to_hash.reject 
+ { |k,v| k.to_s =~ /
+ (uptime_seconds|uptime|timestamp|free|path|ec2_metrics_vhostmd|
+ servername|ec2_public_keys_0_openssh_key|sshrsakey|sshdsakey|serverip)/ 
+ }.sort].to_yaml %>'),
     require  => Class['mcollective::package'],
   }

Thanks to the Eric and Jeff for helping me to roll this out.
So far, I’m super happy with the significantly faster catalog compile times and the awesome support that the PuppetLabs Community team has provided.

Sync Puppet Certs between EC2 regions

Comments Off on Sync Puppet Certs between EC2 regions

In the past I have used nginx to route all cert requests to a single cert server. This worked fine when I had limited my puppet infrastructure to a single EC2 region. However, I recently decided to have puppet masters on separate coasts.

Keeping the certs in sync requires a two-way sync, so I ruled out just rsyncing files around. I tried deploying various drbd+clustered_file_system solutions, and while the tests worked within a region, I could not get them working well through the NAT of the two regions.

A helpful IRC regular (semiosis, thanks!) suggested unison. The issue was that a cron job might be too slow and I may run into issues performing the unison sync. There’s a very useful program that monitors file systems for changes and performs actions based on inode level changes called incron. So the obvious solution was to monitor the filesystem for changes then force a unison sync.

The final solution looks like this:

Install Unison

apt-get install unison

make sure ssh works

unison -batch -auto /etc/puppet/ssl/ca/signed \
ssh://puppet@OTHERPUPPETHOST//etc/puppet/sslca/signed

write simplescript on each host

#!/bin/bash
 
/usr/bin/unison -batch -auto /etc/puppet/ssl/ca/signed \
ssh://puppet@OTHERPUPPETHOST//etc/puppet/ssl/ca/signed > /tmp/sync.log

Set the right mode

 chmod +x /bin/puppet_cert_sync

add a crontab entry to make sure it stays kosher

31 * * * * /bin/puppet_cert_sync

install incron

sudo apt-get install incron

configure it to allow user puppet

echo "puppet" >> /etc/incron.allow

add the incrontab entry

export EDITOR=vi
incrontab -e
 
/etc/puppet/ssl/ca/signed IN_CLOSE_WRITE,IN_CREATE,IN_DELETE /bin/puppet_cert_sync

Then test on one host by running

watch -n 1 ls /etc/puppet/ssl/ca/signed/testhost.pem

And on the other host run

sudo puppetca --clean testhost

Caveats: This may not work in an environment with a many new certs being created very close to each other in both environments. It is also not as highly performant as a clustered file system, but seems to work well in my use case. In addition, the default puppet ssl directory is different, so adjust as necessary.

Puppet ENCs and Automating Monitoring Setups with Puppet

Comments Off on Puppet ENCs and Automating Monitoring Setups with Puppet

Patrick Buckley came and spoke about how he uses Puppet Node Classifiers – Slideshare

After him, I spoke about using Puppet to configure monitoring. – Slideshare

Getting Puppet Stats into Graphite

3 Comments »

Graphs are awesome.

At work I provide all kinds of graphs to the front end/support teams and Graphite is rapidly becoming my tool of choice.  In the past, I have relied heavily on RRD.  However, the easy to use front end, scalability and ease of data injection into Graphite is unparalleled.

Since puppet is such a large part of my infrastructure, I want lots of graphs to glance over in the event I think there is a problem.  Puppet outputs stats to syslog and instead of changing that I decided to pull syslog into Graphite.

 

 

My company has allowed me to make this code publicly available on our GitHub site, with a short explanation on how to make it work.  It uses Ruby and EventMachine to handle all of the requests and has an example of some simple calculations that can be done to aggregate data.

Load Balancing Puppet with Nginx

2 Comments »

Due to the holidays, I’ve had to add a large number of new nodes to our infrastructure. This started putting too much CPU and memory load on the puppet master. Instead of moving to a larger instance, I looked to spread out to multiple boxes.

This presented the problem of how the ops team could run tests against their own environments, how to handle the revocation and issuance of certs and keeping the manifests on the backends in sync.

Using nginx as a software load balancer solved all of these issues.

After talking with an ex-collegue ( I owe you some ramen eric0 ) I took a closer look at the URL paths being requested by the puppet clients.

Certificate requests start with /production/certificate so get routed to the puppet instance that only serves up certificates.

10.10.0.235 - - [14/Nov/2011:20:02:03 +0000]
  "GET /production/certificate/machine123.example.com HTTP/1.1" 404 60 "-" "-"

Each ops team member has their own environment for testing and the URLs start with the environment name

10.170.25.2 - - [14/Nov/2011:17:24:02 +0000]
 "GET /chris/file_metadata/modules/unixbase/fixes/file.conf HTTP/1.1" 200 330 "-" "-"

Everything else gets routed to a group of puppet backend servers.

The full nginx.conf file is available from GitHub.

Configurations are tested on the ops dev server then checked into a git repo that is pulled by all of the puppet backend servers.

Deploying Ruby Applications Like an Adult

Comments Off on Deploying Ruby Applications Like an Adult

“Push button deploy” is something that is often hear people requesting or mentioning as something they would like to have. What’s more important, in my opinion, is to provide a reliable and scalable system for both internal and external developers to deploy code to the staging environment for clients to QA. Staging deployments should be as simple as possible. Production releases are slightly more complicated as Operations needs to carefully monitor and coordinate with external parties, but should still use the same base system.

Larger Image

Requirements for a deployment system

Scalable

  • deploying a package to 1 server or 100 servers should take the same amount of time and effort

Sane

  • Deploy only sanity checked code.
  • Break loudly.
  • Fit with developer culture.

Fast

  • Everyone likes thing to happen quickly.
  • Clients don’t do delayed gratification.

Audit-able

  • What is running right now?
  • How can I trace back to a change in the source code?
  • Is what I think really running?
  • Logs, logs, logs

Reliable

  • It’s Ops, so no one else will be available at 3am to fix it
  • Have to be able to quickly troubleshoot

Flexible

  • Requirements will change over time.
  • Owned by operations, so changes can be separate from production releases.

 

Here is what I came up with:

Larger Image

The criteria for the components chosen is described here
The next posts will go into more detail on individual components.

How mcollective and puppet play nice

1 Comment »

At work, I have invested a lot of time in two tools that have made configuration and deployment as close to a painless process as I think is possible.

Puppet (available from Puppet Labs) is an amazing configuration tool that I have been working with for over a year.  Since my place of work is cloud based, I need to spin up dozens of virtual machines that need to be identically configured automatically.  Puppet allows you to achieve consistency over time as machines are configured by runs into a known state.

Now I have greater than 100 nodes and I want to perform some action on them to collect data or perform some action on each node in real time.  SSH loops are fine for a couple of machine with a static list, but I have many nodes spread out in different locations and I am not a patient individual.  Mcollective makes it possible to run massively parallel jobs across my infrastructure in seconds as opposed to minutes.

The use case that got me started was a co-worker says “We just got a call from customer XYZ and they say there is a problem.  Quick – do something”.  Because all of the nodes are in puppet and my configurations are in source control, I can immediately be sure of the state of my system configurations.  I could log into monitoring and check each host and wait for some information, but instead I just run my mcollective check that goes out to each box and performs all monitoring checks in real time to see if there is some failure*.  Within 30 seconds, I am confident that I can rule out the main two causes of trouble – configuration drift and network/host level issues and concentrate on the application itself.  In the past this might have taken 10s of minutes to ascertain system state and it was most likely the culprit as to the current outage.

When I’m asked why you need both Puppet and Mcollective, I use the following shopping analogy to explain the relationship:

Puppet is the weekly shopping trip where you buy necessities and follow a list to ensure you have everything you need for a well stocked pantry of basic ingredients and what you need for dinner.

Mcollective is the quick run to the store to pick up a wine to compliment dinner.

The food is great, but the wine puts it over the top and the wine, while certainly nice by itself, lacks the foundation of a good meal.

Mcollective now handles deploys of software, monitoring checks, audits and many other functions on my company’s infrastructure when immediate action is required and is itself installed and configured by Puppet.  It does require a significant upfront investment in time and a change in the way you think about processing requests, but is, in my opinion, necessary to grow your infrastructure and be responsive to business needs.

* for speed reference on part of my company’s infrastructure I can run approximately 1736 monitoring checks over 129 hosts in the following time

Finished processing 129 / 129 hosts in 3411.43 ms



Puppet – so now what?? (Part 1 – Git It)

1 Comment »

Keep your puppet manifests under some sort of source code management.
There I said it.
Rolling back will save your bacon at least once, probably more than that.

Here’s how you setup puppet under git on a remote host.

Install gitosis ( a tool for easily managing git repos) – The installation creates user gitosis with homedir /srv/gitosis

sudo apt-get install gitosis
cp ~/.ssh/id_rsa.pub /tmp/.a
chmod 644 /tmp/.a
sudo su - gitosis -c "gitosis-init  < /tmp/.a"
rm /tmp/.a

Generate an ssh key as the user puppet on your puppet master

sudo su - puppet
ssh-keygen

Gitosis is configured as a git repository, so checkout the admin repo and add in the puppet user and the puppet project.

cd SOMEDIR
git clone gitosis@:gitosis-admin.git
cd gitosis-admin/
cat << EOF  >> gitosis.conf

[group puppetmasters]
members = chris@chimp puppet@PUPPET_MASTER
writable = puppet
EOF

copy over the puppet key /home/puppet/.ssh/id_rsa.pub into the keydir and push the changes to git

scp puppet@puppet:/home/puppet/.ssh/id_rsa.pub keydir/puppet@PUPPET_MASTER.pub
git commit -a -m "puppet added to gitosis"
git push

#make sure /etc/puppet is owned by the user puppet on the puppet master

sudo chown -R puppet:puppet /etc/puppet
cd /etc/puppet
git init
git remote add origin gitosis@:puppet.git
git add *
git commit -m "initial add"
git push origin master:refs/heads/master

Finally, add in the following to crontab on your puppet master to make sure changes get checked out every two minutes

MAITO=chris@EXAMPLE.com
*/2 * * * * cd /etc/puppet &&
/usr/bin/git pull origin master:refs/heads/master > /tmp/puppetmaster.log 2>&1

This might seem like a lot of trouble to go to, but it will prove useful in the future,especially when I get to the subjects of environments and testing.

Git will also prove useful when managing large file trees and can also be used as a puppet provider.

Puppet – so now what?? (Introduction)

Comments Off on Puppet – so now what?? (Introduction)

After installing puppet a lot of people ask – “Now what?”

My plan is to write a few posts on things that I have found useful in puppet other than the actual configuration of your nodes.

These are things that have saved me trouble and/or made my infrastructure run more smoothly.