Chris's Blog

Devops Shokunin

Opensource Infrastructure Revisited

Comments Off on Opensource Infrastructure Revisited

In a previous article, I detailed the open source projects that I used to implement a PaaS infrastructure.

Since that time the number of instances in the infrastructure has grown by 2.5X and several of the components needed to be rethought.

Capacity/Performance Management

Previous: Collectd/Visage
Replacement: Collectd/Graphite
Reasons: The collectd backend was too slow and I/O heavy
Graphite graphs are easily to embed in dashboard applications
Ability to easily transform metrics, such as average CPU across a cluster of servers

Continuous Integration

Previous: Selenium
Replacement: Custom Tests
Reasons: Selenium tests failed too often for undiscernable reasons
False positives slowed development too often

Log Collection

Previous: Rsyslog/Graylog2
Replacement: Logstash/ElasticSearch/Kibana
Reasons: Mongodb too slow in EC2 for storing and searching

Logstash offers better parsing and indexing of logs with powerful filtersElasticSearch is super fast and scales horizontally on EC2

Kibana is simple to use and allows Developers to quickly find the relevant information

All of these components are easily integrated into our dashboard application

These changes not only allow the infrastructure to scale, but provide APIs that allow easy integration with custom dashboards.

Wrap a unix command for mcollective code example

Comments Off on Wrap a unix command for mcollective code example

Writing plugins to wrap a command in mcollective is very easy.

What is difficult is figuring out all of the conventions that you need to follow.

I’m going to walk you through an example with some tips.

Conventions you need to follow

1) Agents need to go in the agent directory.  You can find it with the following command

sudo find / -name rpcutil.ddl -exec dirname {} \;

2) The file name should be all lowercase with a file mode of 644

test@mcollectivemaster:/usr/share/mcollective/plugins/mcollective/agent$ ls -l wrapit.rb
-rw-r--r-- 1 root root 1127 2012-04-13 19:05 wrapit.rb

3)  The class name needs to match the file name with at least the first letter capitalized.

4) To more easily debug make sure the following two lines are present in your /etc/mcollective/server.cfg so you can find errors in syslog

logger_type = syslog
loglevel = debug

5) Restart mcollective between changes to the agent file

An example agent

module MCollective
  module Agent
    class Wrapit < RPC::Agent
###################################################################
      metadata :name        => "My Agent",
               :description => "Example of how to wrap a command",
               :author      => "Me <me@example.com>",
               :license     => "DWYWI",
               :version     => "0.1",
               :url         => "http://blog.mague.com",
               :timeout     => 10
###################################################################
#functions
 
      def run_command(cmd)
	cmdrun = IO.popen(cmd)
	output = cmdrun.read
	cmdrun.close
	if $?.to_i > 0
	  logger.error "WRAPIT: #{cmd}: #{output}"
	  reply.fail! "ERROR: failed with #{output}"
	end
	output
      end
###################################################################
#actions
 
      action "ping" do
	reply[:info] = run_command("/bin/ping -c 5 #{request[:target]}")
      end
 
###################################################################
    end #class
  end   #agent
end     #module

Mcollective uses the request data structure to receive information from the client and reply to send information back to the client. The actions are methods that the client calls to run.

When running commands be sure to wrap with IO.popen so that you can capture both the output and pass the error code. Do not forget to close the popen or you will leak file handles and cause trouble.

An Example Client

 
#!/usr/bin/env ruby
require 'rubygems' if RUBY_VERSION < "1.9"
require 'mcollective'
 
include MCollective::RPC
 
args = {}
action=""
 
options = rpcoptions do |parser, options|
 
  parser.on('-K', '--targ TARGET', 'Target to test against') do |v|
    args[:target] = v
    end 
 
  parser.on('-S', '--action ACTION', 'action') do |v|
    action = v
  end   
 
end
 
mc = rpcclient("wrapit", :options => options)
#mc.fact_filter "servertype", "/myservertype/"
mc.timeout = 10
mc.progress = false
 
mc.send(action, args).each do |resp|
  if resp[:statusmsg] == "OK"
    puts resp[:sender]
    puts resp[:data][:info]
  else
    puts "ERROR:"
    p resp
  end
end

The RPC includes all kinds of switches so be sure to run your script with -h to make sure none of your arguments overlap as they will be overwritten by the default switches.

You can add your own fact_filters and turn off the progress bar which is useful with web interfaces.

Be sure to check the statusmsg sent back from the agent, so you can catch any errors.

Since developer access is limited in my work environment, I wrote lots of customer agents to allow troubleshooting and deployment to be coordinated. Putting them behind a Sinatra web front end has made life easier for all.

Update:It was kindly pointed out to me by R.I. Pienaar that instead of using popen, there is a built-in run() function that is even more flexible.

cmd_output = []
cmd_error = ""
run ( "/bin/ping -c #{request[:target}",
      :environment => {"MYSETTING" => "ABC123"}
      :cwd         => "/home/testguy",
      :stdout      => cmd_output,
      :sdterr      => cmd_error,
)

How-To: Mcollective/RabbitMQ on Ubuntu

Comments Off on How-To: Mcollective/RabbitMQ on Ubuntu

1 ) Install RabbitMQ Prerequisites

apt-get install -y erlang-base erlang-nox

2 ) Install RabbitMQ from the Download Site

wget http://www.rabbitmq.com/releases/rabbitmq-server/v2.8.1/rabbitmq-server_2.8.1-1_all.deb
dpkg -i rabbitmq-server_2.8.1-1_all.deb

3 ) Enable the stomp and AMQ plugins

rabbitmq-plugins enable amqp_client
rabbitmq-plugins enable rabbitmq_stomp

4 ) Create the rabbitmq config file in /etc/rabbitmq/rabbitmq.config

%% Single broken configuration

[
  {rabbitmq_stomp, [{tcp_listeners, [{"0.0.0.0", 6163},
                                     {"::1",       6163}]}]}
].

5 ) Create the MCollective Users

rabbitmqctl add_user mcollective PASSWORD
rabbitmqctl set_user_tags mcollective administrator
rabbitmqctl set_permissions -p / mcollective ".*" ".*" ".*"

6 ) Restart RabbitMQ

/etc/init.d/rabbitmq-server restart

7 ) Edit /etc/mcollective/client.conf to point to the local host

#####
# Example config

topicprefix = /topic/
main_collective = mcollective
collectives = mcollective
libdir = /usr/share/mcollective/plugins
logfile = /dev/null
loglevel = info

# Plugins
securityprovider = psk
plugin.psk = unset

connector = stomp
plugin.stomp.host = localhost
plugin.stomp.port = 6163
plugin.stomp.user = mcollective
plugin.stomp.password = PASSWORD

# Facts
factsource = yaml
plugin.yaml = /etc/mcollective/facts.yaml

8 ) Edit /etc/mcollective/server.cfg to point to the localhost

####
# Example server.cfg
topicprefix = /topic/
main_collective = mcollective
collectives = mcollective
libdir = /usr/share/mcollective/plugins
logger_type = syslog
loglevel = info
daemonize = 1
registerinterval = 30
classesfile = /var/lib/puppet/state/classes.txt

# Plugins
securityprovider = psk
plugin.psk = unset

connector = stomp
plugin.stomp.host = 127.0.0.1
plugin.stomp.port = 6163
plugin.stomp.user = mcollective
plugin.stomp.password = PASSWORD

# Facts
factsource = yaml
plugin.yaml = /etc/mcollective/facts.yaml

Mcollective Use Case – Operational Dashboard

Comments Off on Mcollective Use Case – Operational Dashboard

I’ve been asked a few times about use cases for mcollective.

 

One of the biggest wins at my company has been using mcollective to build “Oppy” an operational dashboard. Oppy allows developers and support staff to perform deploys on staging servers as well as to audit and monitor client environments in real time.  Developers and Support staff do not have access to production or staging environments for a variety of reasons, so it was necessary to provide a tool that could quickly and efficiently provide all of the information and access that these teams require.

 

 

 

Capabilites

  1. Deployment – Developers can deploy code to staging environments by connecting to an mcollective agent that installs the latest gem packaged version of our software, cleans out the old versions and restarts the application.
  2. Auditing – Support and Developers can run auditing scripts on all nodes in a certain class and check to make sure that software versions, monitoring settings and software settings are as expected.
  3. Corrective Actions – Support and Developers can flush the varnish caches by triggering a run of a varnish agent or reset the application on demand.
  4. Debugging – Developers can run an agent to turn on debug logging for the application and collect the logs by pulling them from the centralized logging server.

 

 

Stack

Oppy was built on top of Sinatra, which allows for extremely rapid development of basic web applications.   A simple example of using Sinatra to run mcollective agents is on my Github account.

The above screen shot is the results of running the nrpe agent, which runs every single monitoring check on a group of hosts and reports back.

All actions are logged, so that there is an audit trail

 

Deploying Ruby Applications Like an Adult

Comments Off on Deploying Ruby Applications Like an Adult

“Push button deploy” is something that is often hear people requesting or mentioning as something they would like to have. What’s more important, in my opinion, is to provide a reliable and scalable system for both internal and external developers to deploy code to the staging environment for clients to QA. Staging deployments should be as simple as possible. Production releases are slightly more complicated as Operations needs to carefully monitor and coordinate with external parties, but should still use the same base system.

Larger Image

Requirements for a deployment system

Scalable

  • deploying a package to 1 server or 100 servers should take the same amount of time and effort

Sane

  • Deploy only sanity checked code.
  • Break loudly.
  • Fit with developer culture.

Fast

  • Everyone likes thing to happen quickly.
  • Clients don’t do delayed gratification.

Audit-able

  • What is running right now?
  • How can I trace back to a change in the source code?
  • Is what I think really running?
  • Logs, logs, logs

Reliable

  • It’s Ops, so no one else will be available at 3am to fix it
  • Have to be able to quickly troubleshoot

Flexible

  • Requirements will change over time.
  • Owned by operations, so changes can be separate from production releases.

 

Here is what I came up with:

Larger Image

The criteria for the components chosen is described here
The next posts will go into more detail on individual components.

Dashboard Example with Sinatra and Mcollective

Comments Off on Dashboard Example with Sinatra and Mcollective

Having a dashboard to provide real time data to users helps minimize interruptions at work.

The combination of Sinatra handling the incoming HTTP requests and Mcollective pulling real time data from the infrastructure provides the responsiveness and self-service that saves everyone time and effort.

The example code is available on Github

Here are the screen shots running on my internal network.

Welcome Screen

Filtering Form

Results from Monitoring Agent

Results from Puppetd Agent

How mcollective and puppet play nice

1 Comment »

At work, I have invested a lot of time in two tools that have made configuration and deployment as close to a painless process as I think is possible.

Puppet (available from Puppet Labs) is an amazing configuration tool that I have been working with for over a year.  Since my place of work is cloud based, I need to spin up dozens of virtual machines that need to be identically configured automatically.  Puppet allows you to achieve consistency over time as machines are configured by runs into a known state.

Now I have greater than 100 nodes and I want to perform some action on them to collect data or perform some action on each node in real time.  SSH loops are fine for a couple of machine with a static list, but I have many nodes spread out in different locations and I am not a patient individual.  Mcollective makes it possible to run massively parallel jobs across my infrastructure in seconds as opposed to minutes.

The use case that got me started was a co-worker says “We just got a call from customer XYZ and they say there is a problem.  Quick – do something”.  Because all of the nodes are in puppet and my configurations are in source control, I can immediately be sure of the state of my system configurations.  I could log into monitoring and check each host and wait for some information, but instead I just run my mcollective check that goes out to each box and performs all monitoring checks in real time to see if there is some failure*.  Within 30 seconds, I am confident that I can rule out the main two causes of trouble – configuration drift and network/host level issues and concentrate on the application itself.  In the past this might have taken 10s of minutes to ascertain system state and it was most likely the culprit as to the current outage.

When I’m asked why you need both Puppet and Mcollective, I use the following shopping analogy to explain the relationship:

Puppet is the weekly shopping trip where you buy necessities and follow a list to ensure you have everything you need for a well stocked pantry of basic ingredients and what you need for dinner.

Mcollective is the quick run to the store to pick up a wine to compliment dinner.

The food is great, but the wine puts it over the top and the wine, while certainly nice by itself, lacks the foundation of a good meal.

Mcollective now handles deploys of software, monitoring checks, audits and many other functions on my company’s infrastructure when immediate action is required and is itself installed and configured by Puppet.  It does require a significant upfront investment in time and a change in the way you think about processing requests, but is, in my opinion, necessary to grow your infrastructure and be responsive to business needs.

* for speed reference on part of my company’s infrastructure I can run approximately 1736 monitoring checks over 129 hosts in the following time

Finished processing 129 / 129 hosts in 3411.43 ms