“I need to know what this will do to my production systems before I run it.” – Ask a Systems Administrator why they want dry-run mode in a management tool, and this is the answer you’ll get almost every single time.
Historically, we have been able to use dry-run as a risk mitigation strategy before applying changes to machines. Dry-run is supposed to report what a tool would do, so that the administrator can determine if it is safe to run. Unfortunately, this only works if the reporting can be trusted as accurate.
In this post, I’ll show why modern configuration management tools behave differently than the classical tool set, and why their dry-run reporting is untrustworthy. While useful for development, it should never be used in place of proper testing.
Many tools in a sysadmin’s belt have a dry-run mode. Common utilities like make, rsync, rpm, and apt all have it. Many databases will let you simulate updates, and most disk utilities can show you changes before making them.
The make
utility is the earliest example I can find of an automation tool with
a dry-run option. Dry-run in make -n
works by building a list of
commands, then printing instead of executing them. This is useful
because it can be trusted that make
will always run the exact same
list in real-run mode. Rsync and others behave the same way.
Convergence based tools, however, don’t build lists of commands. They build sets of convergent operators instead.
Convergent operators ensure state. They have a subject, and two sets of instructions. The first set are tests that determine if the subject is in the desired state, and the second set takes corrective actions if needed. Types are made by grouping common tests and actions. This allows us to talk about things like users, groups, files, and services abstractly.
CFEngine promise bundles, Puppet manifests, and Chef recipes are all sets of these data structures. Putting them into a feedback loop lets them cooperate over multiple runs, and enables the self-healing behavior that is essential when dealing with large amounts of complexity.
During each run, ordered sets of convergent operators are applied against the system. How order is determined varies from tool to tool, but it is ordered none the less.
CFEngine models Promise Theory as a way of doing systems management. While Puppet and Chef do not model promise theory explicitly, it is still useful to borrow its vocabulary and metaphors and think about individual, autonomous agents that promise to fix the things they’re concerned with.
When writing policy, imagine every resource statement as a simple little robot. When the client runs, a swarm of these robots run tests, interrogate package managers, inspect files, and examine process tables. Corrective action is taken only when necessary.
When dealing with these agents, it can sometimes seem like they’re lying to you. This raises a few questions. Why would they lie? Under what circumstances are they likely to lie? What exactly is a lie anyway?
A formal examination of promises does indeed include the notion of lies. Lies can be outright deceptions, which are the lies of the rarely-encountered Evil Robots. Lies can also be “non-deceptions”, which are the lies of occasionally-encountered Broken Robots. Most often though, we experience lies from the often-encountered Merely Mis-informed Robots.
The best you can possibly hope to do in a dry-run mode is to build the operator sequences, then interrogate each one about what they would do to repair the system at that exact moment. The problem with this is, in real-run mode, the the system is changing between the tests. Quite often, the results of any given test will be affected by a preceeding action.
Configuration operations can have rather large side effects. Sending signals to processes can change files on disk. Mounting a disk will change an entire branch of a directory tree. Packages can drop off one or a million different files, and will often execute pre or post-installation scripts. Installing the Postfix package on an Ubuntu system will not only write the package contents to disk, but also create users and disable Exim before automatically starting the service.
Throw in some notifications and boolean checks and things can get really interesting.
To experiment with dry-run mode, I wrote a Chef cookbook that configures a machine with initial conditions, then drops off CFEngine and Puppet policies for dry-running.
Three configuration management systems, each with conflicting policies, wreaking havoc on a single machine sounds like a fun way to spend the evening. Lets get weird.
If you already have a Ruby and Vagrant environment setup on your workstation and would like to follow along, feel free. Otherwise, you can just read the code examples by clicking on the provided links as we go.
Clone out the dry-run-lies cookbook from Github, then bring up a Vagrant box with Chef.
1 2 3 4 5 |
|
When Chef is done configuring the machine, log into it and switch to
root. We can test the /tmp/lies-1.cf
policy file by running cf-agent
with the -n
flag.
1 2 3 |
|
Dry-run mode reports that it would run an echo command in bundle_one.
Let’s remove -n
and see what happens.
1 2 3 4 5 |
|
Wait a sec… What’s all this bundle_three business? Did dry-run just lie to me?
Examine the lies-1.cf
file here.
The policy said three things. First, “echo hello from bundle one if /usr/bin/puppet does NOT exist”. Second, “make sure the puppet package is installed”. Third, “echo hello from bundle three if /usr/bin/puppet exists.”
In dry-run mode, each agent was interrogated individually. This resulted in a report leading us to believe that only one “echo hello” would be made, when in reality, there were two.
Let’s give Puppet a spin. We can test the policy at /tmp/lies-1.pp
with the
--noop
flag to see what Puppet thinks it will do.
1 2 3 4 5 6 |
|
Dry-run reports that there is one resource to fix. Excellent. Let’s
remove the --noop
flag and see what happens.
1 2 3 4 5 6 7 |
|
Like the CFEngine example, we have the real-run doing things that were not listed in the dry-run report.
The Chef policy that set up the initial machine state mounted an NFS
directory into /mnt/nfssrv
. When interrogated during dry-run, the
tests in the file
resources saw that the files were present, so they
did not report that they needed to be fixed. During the real-run,
Puppet unmounts the directory, changing the view of the filesystem and
the outcome of the tests.
Check out the policy here.
It should be noted that Puppet’s resource graph model does nothing to enable noop functionality, nor can it affect its accuracy. It is used only for the purposes of ordering and ensuring non-conflicting node names within its model.
Finally, we’ll run the original Chef policy with the -W
flag to see if it lies like the others.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
Seems legit. Let’s remove the --why-run
flag and do it for real.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
Right. “HACKING THE PLANET” was definitely not in the dry-run output. Let’s go figure out what happened. See the Chef policy here.
Previously, our CFEngine policy had installed Puppet on the machine. Our Puppet policy ensured nmap was absent. Chef will install nmap, but only if the Puppet binary is present in /usr/bin.
Running Chef in --why-run
mode, the test for the 'package[nmap]'
resource
succeeds because of the pre-conditions set up by the CFEngine policy.
Had we not applied that policy, the 'execute[hack the planet]'
resource would still not have fired because nothing installs nmap
along the way. In real-run mode, it succeeds because Chef changes the
machine state between tests, but would have failed if we had never ran
the Puppet policy.
Yikes.
The robots were not trying to be deceptive. Each autonomous agent told us what it honestly thought it should do in order to fix the system. As far as they could see, everything was fine when we asked them.
As we automate the world around us, it is important to know how the systems we build fail. We are going to need to fix them, after all. It is even more important to know how our machines lie to us. The last thing we need is an army of lying robots wandering around.
Luckily, there are a number of techniques for testing and introducing change that can be used to help ensure nothing bad happens.
Testing needs to be about observation, not interrogation. In each case, the system converged to the policy, regardless of whether dry-run got confused or not. If we can setup up test machines that reproduce a system’s state, we can real-run the policy and observe the behavior. Integration tests can then be written to ensure that the policy achieves what it is supposed to.
Ideally, machines are modeled with policy from the ground up, starting with Just Enough Operating System to allow them to run Chef. This ensures all the details of a system have been captured and are reproducible.
Other ways of reproducing state work, but come with the burden of having to drag that knowledge around with you. Snapshots, kickstart or bootstrap scripts, and even manual configuration will all work as long as you can promise they’re accurate.
There are some situations where reproducing a test system is impossible, or modeling it from the ground up is not an option. In this case, a slow, careful, incremental application of policy, aided by dry-run mode and human intuition is the safest way to start. Chef’s why-run mode can help aide intuition by publishing assumptions about what’s going on. “I would start the service, assuming the software had been previously installed” helps quite a bit during development.
Finally, increasing the resolution of our policies will help the most in the long term. The more robots the better. Ensuring the contents of your configuration files is good. Making sure that they are only ones present in a conf.d directory is better. As a community, we need to produce as much high quality, trusted, tested, and reuseable policy as possible.
Good luck, and be careful out there.
-s
]]>Since they’re both written in Ruby, people tend to compare Puppet and Chef. This is natural since they have a lot in common. Both are convergence based configuration management tools inspired by CFEngine. Both have stand alone discovery agents (facter and ohai, respectively), as well as RESTful APIs for gleaning node information from the server. It turns out, however, that Chef actually has a lot more in common with CFEngine.
Like CFEngine, Chef copies policy from the server and evaluates it on the edges. This allows for high scalability, since the server isn’t doing very much. Think of web application that does most of its work in the browser instead of on the server.
A Chef recipe is a collection of convergent resource statements, and serves as the basic unit of intent. This is analogous to a CFEngine promise bundle. The Chef run list is how recipe ordering is defined, and is directly comparible to CFEngine’s bundlesqeuence. Using this approach makes it easy to reason about what’s going on when writing infrastructure as code.
While it’s true that Chef is just “pure ruby” and therefore imperative, to say that Chef is imperative without considering the declarative interface to resources is disingenuous at best. Using nothing but Chef resources, recipes look very much like their CFEngine and Puppet counterparts. The non-optimally ordered Chef version of NTP converges in the same number of runs as the CFEngine example from the first installment. This is because the underlying science of convergent operators is the same.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
When and where order matters, imperative ordering isolated within a recipe is the most intuitive way for sysadmins to accomplish tasks within the convergent model. “Install a package, edit a config file, and start the service” is how most people think about the task. Imperative ordering of declarative statements give the best of both worlds. When order does NOT matter, it’s safe to re-arrange recipe ordering in the Chef run list.
The real trick to effective Chef cookbook development is to understand the Anatomy of a Chef Run. When a Chef recipe is evaluated in the compilation phase, encountered resources are added to the Resource Collection, which is an array of evaluated resources with deferred execution.
The compile phase of this recipe would add 99 uniquely named, 12 oz, convergent beer_bottles to the collection, and the configure phase would take them down and pass them around. Subsequent runs would do nothing.
1 2 3 4 5 6 7 8 |
|
The idea is that you can take advantage of the full power of Ruby to make decisions about what to declare about your resources. Most people just use the built in Chef APIs to consult chef-server for topology information about their infrastructure. However, there’s nothing stopping you from importing random Ruby modules and accessing existing SQL databases instead.
Want to name name servers after your Facebook friends? Go for it. Want your MOTD to list all James Brown albums released between 1980 and 1990? Not a problem. The important part is that things are ultimately managed with a declarative, idempotent, and convergent resource interface.
Let’s take a look at the recipe that gave us our original CFEngine server.
File /Users/someara/src/someara.github.com/source/affs-blog/cookbooks/cfengine/recipes/server.rb could not be found
When a node is bootstrapped with Chef, a run list of roles or recipes is requested by the node itself. After that, the host is found by recipes running elsewhere in the infrastructure by searching for roles or attributes. This is contrasted from the CFEngine and Puppet techniques of matching classes based on a hostname, FQDN, IP, or other found information.
This approach has the effect of decoupling a node’s name from its functionality. Line 10 in cfengine.rb
above searches out node objects and later be passes them to the promises-server.cf.erb
template for authorization.
So there you have it folks. Chef making CFEngine making Puppet making Chef. These tools can be used to automate literally anything, and they’re pretty easy to use once you figure out how they work. I was going to throw some Bcfg2 and LCFG in there just for fun, but I only had some much free time =)
Configuration mangement is like a portal.
-s
]]>Puppet at its core works like CFEngine. Statements in Puppet are convergent operators, in that they are declarative (and therefore idempotent), and convergent in that they check a resource’s state before taking any action. Like the NTP example from the CFEngine installment, non-optimally ordered execution will usually work itself out after repeated Puppet runs.
Unlike CFEngine, where policy is copied and evaluated on the edges, Puppet clients connect to the Puppet server where configuration is determined based on a certificate CN. A catalog of serialized configuration data is shipped back to the client for execution. This catalog is computed based on the contents of the manifests stored on the server, as well as a collection of facts collected from the clients. Puppet facts, like CFEngine hard classes, are discoverable things about a node such as OS version, hostname, kernel version, network information, etc.
Puppet works a bit like the food replicators in Star Trek. Resources make up the basic atoms of a system, and the precise configuration of each must be defined. If a resource is defined twice in a manifest with conflicting states, Puppet refuses to run.
Ordering can be specified though require
statements that set up relations between resources. These are used to build a directed graph, which Puppet sorts topologically and uses to determine the final ordering. If a resource in a chain fails for some reason, dependent resources down the graph will be skipped.
This allows for isolation of non-related resources collections. For example, if a package repository for some reason fails to deliver the ‘httpd’ package, its dependent configuration file and service resources will be skipped. This has nothing to do with an SSH resource collection, so the resources concerning that service will be executed even though the httpd collection had previously failed.
Just be careful not to create the coffee without the cup.
Let’s examine a Puppet manifest that creates a Chef server on Centos 6.
File /Users/someara/src/someara.github.com/source/affs-blog/cookbooks/cfengine/files/default/server/puppet/manifests/classes/chef.pp could not be found
Line 1 is a Puppet class definition. This groups the resource statments between together, allowing us to assign chef-server
to a node based on its hostname. This can be accomplished with an explicit nodes.pp definition, or with an external node classifier.
Line 3 is an exec
resource, which we can later refer to with its name: rbel6-release
. When using exec
resources, it’s up to you to specify a convergence check. In this case, we used the unless
keyword to check the return status of an rpm command. The same goes for command
promise types in CFEngine, or an execute
resources in Chef.
Line 9 is an example of an array variable, which is iterated over in line 21, much like a CFEngine slist.
Everything else is a standard Puppet resource declaration, each of which have a type, a name, and an argument list. Like CFEngine promises, each type has various intentions available under the hood. Packages can be installed. Services can be running or stopped, and files can be present with certain contents and permissions.
Refer to the Puppet documentation for more details.
1
|
|
Over the past few years, the topic of Infrastructure Automation has received a huge amount of attention. The three most commonly used tools for doing this (in order of appearance) are CFEngine, Puppet, and Chef. This article explores each of them by using one to set up another. If you have a chef-server or Hosted Chef account, you can follow along by following the instructions in the setup section. (Full disclosure: I work for Opscode, creators of Chef.)
“Infrastructure” turns out to be the hardest thing to explain when discussing automation, yet is the most critical to understand. In this context, Infrastructure isn’t anything physical (or virtualized) like servers or networks. Instead, what we’re talking about is all the “stuff” that is configured across machines to enable an application or service.
In practice, “stuff” translates to operating system baselines, kernel settings, disk mounts, OS user accounts, directories, symlinks, software installations, configuration files, running processes, etc. People of the ITIL persuasion may think of these as Configuration Items. Units of management are composed into larger constructs, and complexity arises as these arrangements become more intricate.
Services running in an Infrastructure need to communicate with each other, and do so via networks. Even when running on a single node, things still communicate over a loopback address or a Unix domain socket. This means that Infrastructure has a topology, which is in itself yet another thing to manage.
Here is a picture of a duck.
This duck happens to be an automaton. An automaton is a self-operating machine. This one pretends to digest grain. It interacts with its environment by taking input and producing output. To continue operating, the duck requires maintenance. It needs to be wound, cleaned, and repaired. Automated services running on a computer are no different.
Once turned on, an automated service takes input, does something useful, then leaves logs and other data in its wake. Its machinery is the arrangement of software installation, configuration, and the running state of a process. Maintenance is performed in a control loop, where an agent comes around at regular intervals inspecting its parts and fixing anything that’s broken.
In automated configuration management, the name of the game is hosting policy. The agents that build and maintain systems pull down blueprints and set to work building our automatons. When systems come back up from maintenance or new ones spring into existence, they configure themselves by downloading policy from the server.
If you’d like to follow along by configuring your own machines with knife, follow the setup instructions here. The setup will get your Chef workstation configured, code checked out from my blog git repo, and uploaded to chef-server for use. Otherwise, you can just browse the source here
CFEngine is a system based on promise theory. Promises are the basic atoms of the CFEngine universe. They have names, types, and intentions (among other things), and each acts as a convergent operator to move its subject toward an intended state. Like the parts in our duck, promises are assembled to create a larger whole.
Promises of various types are capable of different things. Promises of type “package” can interact with a package manager to make sure somthing is installed or removed, while a promise of type “file”, can copy, edit, and set permissions. Processes can be started or stopped, and commands can be ran if needed. Read all about them in the CFEngine reference manual.
Promises provide a declarative interface to resources under management, which has the remarkably handy attribute of being idempotent. An idempotent function gives the same result when applied multiple times. This allows our duck repairing maintence loop (in the form of cf-agent on a cron) to come around and safely execute instructions without having to worry about side effects. Consider “the line ‘foo’ should exist in the file” vs “append ‘foo’ to the end of the file”; the non-declarative ‘append’ would not be safe to repeat.
Convergent maintenance refers to the continuous repair of a system towards a desired state. At the individual promise level, convergence happens in a single run of the maintenance loop. If a package is supposed to be installed but isn’t, action will be taken to fix it. If a process is not running but should be, action will be taken again. Convergence in a larger system of promises can take multiple runs if things are processed in a non-optimal order. Consider the following:
1 2 3 |
|
Assuming a system with a base install, the first promise would fail to be kept. The NTP binary is not available, since we haven’t installed its package yet. The second promise would write the configuration file, but fail to restart the service. The third promise would succeed, assuming an appropriate package repo was available and functioning properly. After the first run is complete, the system has converged closer to where we want it to be, but isn’t quite there yet. Applying the functions again gets us closer to our goal.
The second run of the loop would succeed in starting the service, but would be using the wrong configuration file. The package install from the previous loop clobbered the one written previously. Promise number two would fix the config and restart the service, and the third would do nothing because the package is already installed. Finally, we’ve converged to our desired system state. A third loop would take no actions at all.
To set up a CFEngine server, invoke the following Chef command:
1
|
|
When Chef is done doing its thing, you’ll end up with a functioning CFEngine policy host, happily promising to serve policy. Log into the freshly configured machine and check it out. Three things have happened. First, the cfengine package itself has been installed. Second, two directories have been created and populated: /var/cfengine/inputs
, and /var/cfengine/masterfiles
.
The inputs
directory contains configuration for the CFEngine itself, which includes a promise to make the contents of masterfiles
available for distribution. When a CFEngine client comes up, it will copy the contents of /var/cfengine/masterfiles
from the server into its own inputs
directory.
CFEngine’s main configuration file is promises.cf
, from which everything else flows. Here’s a short snippet:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
The bundlesequence section tells cf-agent what promise bundles to execute, and in what order. The one we’re examining today is named puppet_server, found in puppet.cf
File /Users/someara/src/someara.github.com/source/affs-blog/cookbooks/cfengine/templates/default/inputs/puppet.cf.erb could not be found
A promise bundle is CFEngine’s basic unit of intent. It’s a place to logically group related promises. Within a bundle, CFEngine processes things with normal ordering. That is, variables are converged first, then classes, then files, then packages, and so on. I wrote the bundle sections in normal order to make it easier to read, but they could be rearranged and still have the same effect. Without going into too much detail about the language, I’ll give a couple hints to help with groking the example.
First, in CFEngine, the word ‘class’ does not mean what it normally does in other programming languages. Instead, classes are boolean flags that describe context. Classes can be ‘hard classes’, which are discovered attributes about the environment (hostname, operating system, time, etc), or ‘soft classes’, which are defined by the programmer. In the above example, puppetmaster_enabled and iptables_enabled are soft classes set based on the return status of a command. In the place of if
or case
statements, boolean checks on classes are used.
Second, there are no control statements like for
or while
. Instead, when lists are encountered they are automatically iterated. Check out the packages section for examples of both class decisions and list iteration. Given those two things, you should be able to work your way through the example. However, there’s really no getting around reading the reference manual if you want to learn CFEngine.
Finally, let’s go ahead and use Chef to bring up a CFEngine client, which will be turned into a Puppet server.
1
|
|
The first run will fail, since the host’s IP isn’t yet in the cfengine server’s allowed hosts lists. Complete the convergence by running these commands:
1 2 3 |
|
And viola! A working Puppet server, serving policy.
]]>Stephen Nelson Smith, I salute you, sir.
I’m quite firmly in the “Let your CM tool handle your config files” camp. To explain why, I think it’s worth briefly examining the evolution of configuration management strategies.
In order to keep this post as vague and heady as possible, no distinction between “system” and “application” configurations shall be made.
Configuration files are text files that control the behavior of programs on a machine. That’s it. They are usually read once, when a program is started from a prompt or init script. A process restart or HUP is typically required for changes to take effect.
When thinking about configuration management, especially across multiple machines, it is easy to equate the task to file management. Configs do live in files, after all. Packages are remarkably good at file management, so it’s natural to want to use them.
However, the task goes well beyond that.
An important attribute of an effective management strategy, config or otherwise, is that it reduces the amount of complexity (aka work) that humans need to deal with. But what is the work that we’re trying to avoid?
Two tasks that systems administrators concern themselves with doing are dependency analysis and runtime configuration.
Within the context of a single machine, dependency analysis usually concerns software installation. Binaries depend on libraries and scripts depend on binaries. When building things from source, headers and compilers are needed. Keeping the details of all this straight is no small task. Packages capture these relationships in their metadata, the construction of which is painstaking and manual. Modern linux distributions can be described as collections of packages and the metadata that binds them. Go out and hug a package maintainer today.
Within the context of infrastructure architecture, dependency analysis involves stringing together layers of services and making individual software components act in concert. A typical web application might depend on database, caching, and email relay services being available on a network. A VPN or WiFi service might rely on PKI, Radius, LDAP and Kerberos services.
Runtime configuration is the process of taking all the details gathered from dependency analysis and encoding them into the system. Appropriate software needs to be installed, configuration files need to be populated, and kernels need to be tuned. Processes need to be started, and of course, it should all still work after a reboot.
Once upon a time, all systems were configured manually. This strategy is the easiest to understand, but the hardest one to execute. It typically happens in development and small production environments where configuration details are small enough to fit into a wiki or spreadsheet. As a network’s size and scope increases, management efforts became massive, time consuming, and prone to human error. Details end up in the heads of a few key people and reproducibility is abysmal. This is obviously unsustainable.
The natural progression away from manual configuration was custom scripting. Scripting reduced management complexity by automating things using languages like Bash and Perl. Tutorials and documentation instruction like “add the following line to your /etc/sshd_config” were turned into automated scripts that grepped, sed’ed, appended, and clobbered. These scripts were typically very brittle and would only produce desired outcome after their first run.
File distribution was the next logical tactic. In this scheme, master copies of important configuration files are kept in a centralized location and distributed to machines. Distribution is handled in various ways. RDIST, NFS mounts, scp-on-a-for-loop, and rsync pulls are all popular methods.
This is nice for a lot of reasons. Centralization enables version control and reduces the time it takes to make changes across large groups of hosts. Like scripting, file distribution lowers the chance of human error by automating repetitive tasks.
However, these methods have their drawbacks. NFS mounts introduce single points of failure and brittleness. Push based methods miss hosts that happen to be down for maintenance. Pulling via rsync on a cron is better, but lacks the ability to notify services when files change.
Managing configs with packages falls into this category, and is attractive for a number of reasons. Packages can be written to take actions in their post-install sections, creating a way to restart services. It’s also pretty handy to be able to query package managers to see installed versions. However, you still need a way to manage config content, as well as initiate their installation in the first place.
In this scheme, autonomous agents run on hosts under management. The word autonomous is important, because it stresses that the machines manage themselves by interpreting policy remotely set by administrators. The policy could state any number of things about installed software and configuration files.
Policy written as code is run through an agent, letting the manipulation of packages, configuration files, and services all be handled by the same process. Brittle scripts behaving badly are eliminated by exploiting the idempotent nature of a declarative interface.
When first encountered, this is often perceived as overly complex and confusing by some administrators. I believe this is because they have equated the task of configuration management to file management for such a long time. After the initial learning curve and picking up some tools, management is dramatically simplified by allowing administrators to spend time focusing on policy definition rather than implementation.
This is where things get interesting. We have programs under our command running on every node in an infrastructure, so what should we make them to do concerning configuration files?
“Copy this file from its distribution point” is very common, since it allows for versioning of configuration files. Packaging configs also accomplishes this, and lets you make declarations about dependency. But how are the contents of the files determined?
It’s actually possible to do this by hand. Information can be gathered from wikis, spreadsheets, grey matter, and stick-it notes. Configuration files can then be assembled by engineers, distributed, and manually modified as an infrastructure changes.
File generation is a much better idea. Information about the nodes in an infrastructure can be encoded into a database, then fed into templates by small utility programs that handle various aspects of dependency analysis. When a change is made, such as adding or removing a node from a cluster, configurations concerning themselves with that cluster can be updated with ease.
The logic that generates configuration files has to be executed somewhere. This is often done on the machine responsible for hosting the file distribution. A better place is directly on the nodes that need the configurations. This eliminates the need for distribution entirely.
Modifications to the node database now end up in all the correct places during the next agent run. Packaging the configs is completely unnecessary, since they don’t need to be moved from anywhere. Management complexity is reduced by eliminating the task entirely. Instead of worrying about file versioning, all that needs to be ensured is code correctness and the accuracy of the database.
Don’t edit config files. Instead, edit the truth.
-s
]]>One thing coming down the pipe are full stack installers for various distributions. A full stack installer provides everything you need to run Chef, above libc. Telling people about this has generated a lot of excitement and interest, so I went ahead and built them for you, live from the floor of Velocity.
Below are instructions for manual installation on EL5 and EL6, clones and derivatives. I’ll leave the creation of a custom knife bootstrap script as an exercise for the reader.
Enjoy!
-s
1 2 3 4 5 6 7 8 9 10 11 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
-s
]]>Search is Chef’s killer feature for sure. Searching for the IPs or FQDNs of nodes with particular roles or attributes lets you dynamically string together machines within your infrastructure. This eliminates the need for centralized planning of IP addresses among Chef managed resources. This is especially useful on the clouds or in DHCP environments where you are assigned random IPs.
Munin is one of the first cookbooks that I read after finding out about Chef, and is pretty much responsible for selling me on it. Below are the recipes from a simplified version of the munin cookbook.
Munin is a system metrics collection tool that gives you a ton of information out of the box with very little configuration. It’s really great for smaller installations and a great way to get some metrics now if you’re in a hurry. The Opscode apache2 cookbook is included without modification to provide a web console for viewing graphs.
You can view the complete cookbook here.
The cookbook is broken into two recipes, server.rb and client.rb
The server searches for clients to poll, and the client searches for servers to accept poll connections from. We start out by setting a node attribute in each recipe so the other half has something to search for.
The search syntax comes from Solr, so a node attribute set with node[:foo][:bar][:baz]=”buzz” is searched for with: search(:node, “foo_bar_baz:buzz”)
Searches return arrays of node objects (JSON blobs), which are then passed into templates where IP information is dug out and rendered into a config file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
A number of people have asked me why I used attributes rather than roles. This is to avoid baking convention into the recipe code, which I like to do whenever I can help it.
Consider the following scenario:
You have a role[monitoring], that includes recipe[nagios::server] and recipe[munin::server]. In the Nagios and Munin client cookbooks, you’ve searched for the role[monitoring] are happily populating your configuration files. A few months pass, and you’ve added more machines to your infrastructure.
One day your monitoring server starts crawling, since it has slow disks and can’t keep up with the IO intensive graph generation. You decide that “monitoring” is an overloaded term, and set off to split your metrics and alerting onto different machines. You edit your role structure and change your node object’s runlist assignment, and bring up some new machines. However, you still have more work to do. Now you have to go into the recipe code and change them to search for their new roles.
Using attributes as above frees you from having to modify the recipe code when editing role definitions. Don’t get me wrong, there are plenty of scenarios where roles are preferable to attributes, but for things like this, I like to avoid them.
-s
update A reader has pointed out that instead of using attributes, I could have used the search syntax search(:node, ‘recipes:”munin::server”’). Good to know!
]]>Chef is configuration management platform written in Ruby. Configuration management is a large topic that most systems administrators and IT management are just now starting to gain experience with. Historically, infrastructures have been maintained either by hand, with structured scripting, by imagine cloning, or a combination of those. Chef’s usage model rejects the idea of cloning and maintaining “golden images”. Instead, the idea is to start with an embryonic image and grow it into it’s desired state. This works much better as infrastructure complexity increases, and eliminates the problem of image sprawl. The convergent nature of the tool allows you to change the infrastructure over time without much fuss. Chef allows you to express your infrastructure as code, which lets you store it in version control.
“A Can of Condensed Chef Documentation” is available here
Actually you can use any SCM, but git is the most widely adopted in the Chef community. All Chef Git repos begin their lives as clones of the Opscode chef-repo, found here: https://github.com/opscode/chef-repo There is a nice overview of the repo structure (cookbooks, databags, roles, etc) in the README.
This is easily installed from packages by following the instructions on the opscode wiki. The process amounts to “add a package repository, install the packages, and turn it on” Alternatively, you could use the Opscode Platform and go dancing with space robots.
1
|
|
1
|
|
A “client” in chef parlance is an SSL certificate used to access the chef-server API. If the client’s CN name is marked “admin” in chef-server, the client can perform restricted operations such as creating and deleting nodes. This is the kind of client needed by knife to manipulate the infrastructure, and normally correspond to actual human being, but by no means has to. Nodes have non-admin client certificates, and can only manipulate their own node objects. To create a client certificate, you’ll need to log into the chef-server webui, click on “clients”, think of a name for it (I use someara), and paste the displayed private key into a local file.
Copy the validation key
The validation key is a special key that is shared by all freshly bootstrapped nodes. It has the ability to create new client certificates and nodes objects through the API.
1
|
|
For more details on this section, please visit http://wiki.opscode.com/display/chef/Chef+Configuration+Settings
.chef/client.rb - This file is copied onto the nodes that are bootstrapped with knife, and needs to be configured to point to the IP or FQDN of your chef server
example
1 2 3 4 5 6 7 8 9 10 11 12 |
|
.chef/knife.rb - This file also needs to be configured to point to your chef-server, and also to the client private key that was created earlier.
example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
|
As mentioned earlier, run lists are made up from role trees. Here is an example of how you would create a demo server with a correct clock, managed users, and metrics and monitoring capabilities. In this example, six recipes are executed per run, and an unknown number of resources are managed. (To figure that out, you’d have to read the recipes)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
That’s quite a bit of cooking for a beginner tutorial, so we’re just going to focus on a single node running an NTP client for now. Roles can be written either as .rb files or .json files. I prefer to use the .rb format because they’re easier to read and write. Some people prefer to deal with the JSON formatted version directly, since thats the way they’re dumped with knife. At the end of the day, it doesn’t really matter, so do which ever makes you happy.
1 2 3 4 5 |
|
1
|
|
A machine’s NTP client is simple to install and configure. Every systems administrator is already familiar with it, which makes it a great example.
Most software available as a native package in a given linux distribution can be managed with a “package, template, service” design pattern.
Each of those words refers to a Chef resource, which we pass arguments to.
1
|
|
This creates a directory structure for the ntp cookbook. You can check it out with ls:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Recipe names are related to cookbook structure. Putting recipe[foo::bar] in a node’s run list results in cookbooks/foo/recipes/bar.rb being downloaded from chef-server and executed.
There is a special recipe in every cookbook called default.rb. It is executed by every recipe in the cookbook. Specifying recipe[foo::bar] actually results in cookbooks/foo/recipes/default.rb, as well as cookbooks/foo/recipes/bar.rb being executed.
Default.rb is a good place to put common stuff when writing cookbooks with multiple recipes, but we’re going to keep it simple and just use default.rb for everything.
This is where all the fun stuff happens. When using resources, you’re writing things in a declarative fashion. Declarative means you can concentrate on the WHAT without having to worry about the HOW. Chef will take care of that for you with something called a resource provider. When installing a package, it will check to see what your operating system is and use the appropriate methodology. For example, on Debian based systems, it will use apt-get, and on Redhat based systems, it will use yum.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Chef recipes are evaluated top down (like a normal ruby program), with each resource being ran in the order it appears. Order is important. In the above example, if we were to reverse the order of those three resources, it would first fail to start the service (as the software is not installed yet), then write the configuration file, then finally clobber the file it just wrote by installing the package.
1 2 3 4 5 6 7 8 9 |
|
1
|
|
The chef-client needs to somehow get itself installed and running on managed nodes. This process is known as bootstrapping and is accomplished with shell scripting. The method of bootstrap will vary depending on how you go about provisioning your server, and the script will depend on the platform.
Cloud providers like AWS and Rackspace will let you make an API request, then return the IP of your compute resource.
1
|
|
In this example, knife uses the ruby fog library to talk to ec2 and request a server with an argument of the desired AMI. Knife then uses net-ssh-multi to ssh into the machine and execute a bootstrapping script. There are a number of other arguments that can be passed to knife, such as ec2 region, machine size, what ssh key to use. You can read all about them on the Opscode wiki.
If your method of provisioning servers is “ask your VMware administrator” or “fill out these forms”, then you’ll probably bootstrap via an IP address.
1
|
|
In these provisioning scenarios, you can skip knife completely and put the contents of a bootstrap script kickstart or equivalent.
By default (with no arguments), Chef attempts a gem based installation meant to work on Ubuntu. If you’re not using Ubuntu, or are uncomfortable installing gems directly from rubygems.org, you’ll have to change the script to suite your taste. It works by specifying a template name with the -d flag, SSH’ing into the machine and running the rendered script. When using knife to SSH, make sure you have the correct key loaded into your ssh-agent.
1
|
|
ends up running this
1
|
|
What I do in my boot scripts:
hostname -f
has to work properly)After the script is ran, chef-client does the following
There is an example of a custom bootstrap script here
At this point, you should have an ntp client installed, configured, and running.
(It’s actually a little bit more complicated than that. For more information about chef-client runs, see http://wiki.opscode.com/display/chef/Anatomy+of+a+Chef+Run )
Data driven infrastructures are all the rage these days. This allows you to do things like change the NTP server all your nodes use by editing a single JSON value in chef-server. You can get really creative with this, so let your imagination run wild.
1 2 3 4 5 6 7 |
|
1
|
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
You can also access data bag data through the search() interface, which you can read about on the opscode wiki.
1
|
|
We’re not quite done yet. Let’s SSH into our shiny new NTP enabled machine and go poking about.
1 2 |
|
Wait a sec, isn’t that supposed to be “us.pool.ntp.org”? Not yet. We haven’t enabled our convergence mechanism yet! If we manually run chef-client on the node, we will indeed see that the file has changed.
1 2 3 |
|
That file just converged into the correct state. Lets edit the file again, this time filling it with complete garbage.
1 2 3 4 |
|
Again, the file converged into the correct state. Awesome. Running chef-client by hand on a large cluster of nodes would be a real pain, so it makes sense to set it up automatically. Indeed, often found in a “role[base]” is a “recipe[chef-client]” that configures it to run as a daemon, or from a cron.
It is safe to run the recipes on the nodes time and time again because resources are written to be idempotent. You may remember from math class that a function f is idempotent if, for all values of x, f(f(x))=f(x). That means you can run a function over a resource a bajillion times and it will behave as if it was only done once.
This is implemented under the hood as “If it ain’t broke, don’t fix it.” In a file resource, checksums are calculated and compared. In a package resource, the rpm or dpkg databases are consulted to see if the package is installed. The effect of this is that most chef-client runs do absolutely nothing to resources. That is, until you change the function by altering the inputs to the resource providers.
Further examination reveals that the ntpd service is still talking to “time.nist.gov”. This is because during the chef-client run, the resource named “ntpd” ran it’s idempotency check, and found that it was, in fact, running. It therefore did nothing. It we want ntpd to restart when the contents /etc/ntp.conf are altered, we have to modify our recipe to set up that relation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Alternatively, we could have set up the “service[ntpd]” resource to subscribe to the “template[/etc/ntp.conf]” resource.
Upload the modified ntp cookbook to chef-server and re-run the client on your demo server to check your work.
1 2 3 |
|
Winning.
To save yourself from writing crazy for loops on command line like
1
|
|
Or even worse,
1 2 3 4 5 6 |
|
… somebody was nice enough to write some rake tasks. List them with rake -T, and then install your repo in chef-server with “rake install”
There are two ways to view your infrastructure. The first is through the management console, and the other is from knife. Here is a list of handy commands to get you started.
1 2 3 4 |
|
Remember that nodes, their client certificates, and the machines they’re associated with are three separate entities.
1
|
|
This just deletes the node object from chef-server. The next time the machine runs chef-client, the node object will be recreated in chef-server. This node object will have an empty run list what will have to be repopulated before chef-client actually does anything.
1
|
|
This deletes a node object’s associated public key from chef-server. The next time the machine runs chef-client, it will get a permission denied error. If this is done on accident, ssh into the machine, delete it’s client key at /etc/chef/client.pem and re-run chef-client.
Deleting a machine will be specific to how it was provisioned. On AWS, it would look like “knife ec2 server delete i-DEAFBEEF”. On a VMware cluster, it could be by clicking buttons in a GUI. I once deleted a hardware machine by throwing it off a balcony. YMMV.
I like to keep a special directory called “infrastructures” that contain sub-directories and nodes.sh files. A nodes.sh contains a list of knife commands that can be thought of as the highest level view of the infrastructure. for example:
1 2 3 4 5 6 7 8 |
|
This file can eventually be used to bring up entire infrastructures, but during development, lines are typically pasted into a terminal individually.
This is as close as I’ve gotten to replacing myself with a very small shell script so far. Many a sysadmin has been pursuing this for a long time now. It is here. The journey has just begun.
-s
]]>Chef’s documentation is vast and broken up into many pages on the Opscode wiki. The goal here is to index this information and give a brief explanation of each topic without going into too much depth.
http://wiki.opscode.com/display/chef/Architecture
Chef is a configuration management platform in the same class of tools as Cfengine, Bcfg2, and Puppet. The idea is to define policy at a centralized, version controlled place, and then have the machines under management pull down their policy and converge onto that state at regular intervals. This gives you a single point of administration allowing for easier change management and disaster recovery. Combined with a compute resource provisioning layer (such as knife’s ability to manipulate EC2 or Rackspace servers), entire complex infrastructures can pop into existence in minutes.
http://wiki.opscode.com/display/chef/Chef+Server
Chef server has various components under the hood. Assorted information (cookbooks, databags, client certificates, and node objects), are stored in CouchDB as JSON blobs. CouchDB is indexed by chef-solr-indexer. RabbitMQ sits between the data store and A RESTful API service that exposes all this to the network as chef-server. If you don’t want to run chef-server yourself, Opscode will do it for you for with their Platform service for a meager $5/node/month. The management console is really handy during development, since it provides a nice way to examine JSON data. However, it should be noted that real chefs value knife techniques.
http://wiki.opscode.com/display/chef/Api+Clients
In Chef, the term “client” refers to an SSL certificate for an API user of chef-server. This is often a point of confusion, and should not be confused with chef-client. Most of the time, one machine has one client certificate, which corresponds to one node object.
http://wiki.opscode.com/display/chef/Nodes
Nodes are JSON representations of machines under Chef management. They live in chef-server. They contain two important things: The node’s run list, and a collection of attributes. The run list is a collection of recipes names that will be ran on the machine when chef-client is invoked. Attributes are various facts about the node, which can be manipulated either by hand, or from recipes.
http://wiki.opscode.com/display/chef/Attributes
Attributes are arbitrary values set in a node object. Ohai provides a lot of informational attributes about the node, and arbitrary attributes can be set by the line cooks. They can be set from recipes or roles, and have a precedence system that allow you to override them. Examples of arbitrary attributes are listen ports for network services, or the names of a package on a particular linux distribution (httpd vs apache2).
http://wiki.opscode.com/display/chef/Ohai
Ohai is the Chef equivilent of Cfengine’s cf-know and Puppet’s facter. When invoked, it collects a bunch of information about the machine its running on, including Chef related stuff, hostname, FQDN, networking, memory, cpu, platform, and kernel data. This information is then output as JSON and used to update the node object on chef-server. It is possible to write custom Ohai plugins, if your’re interested in something not dug up by default.
http://wiki.opscode.com/display/chef/Chef+Client
Managed nodes run an agent called chef-client at regular intervals. This agent can be ran as a daemon or invoked from cron. The agent pulls down policy from chef-server and converges the system to the described state. This lets you introduce changes to machines in your infrastructure by manipulating data in chef-server. The pull (vs push) technique ensures machines that are down for maintenance end up the proper state when turned back on.
http://wiki.opscode.com/display/chef/Resources
Resources are the basic configuration items that are manipulated by Chef recipes. Resources make up the Chef DSL by providing a declarative interface to objects on the machine. Examples of core resources include files, directories, users and groups, links, packages, and services.
http://wiki.opscode.com/display/chef/Recipes
Recipes contain the actual code that gets ran on machines by chef-client. Recipes can be made up entirely of declarative resources statements, but rarely are. The real power of Chef stems from a recipes’s ability to search chef-server for information. Recipes can say “give me a list of all the hostnames of my web servers”, and then generate the configuration file for your load balancer. Another recipe might say “give me a list of all my database servers”, so it can configure Nagios to monitor them.
http://wiki.opscode.com/display/chef/Cookbooks
Cookbooks allow you to logically group recipes. Cookbooks come with all the stuff the recipes need to make themselves work, such as files, templates, and custom resources (LWRPs).
http://wiki.opscode.com/display/chef/Roles
Roles allow you to assemble trees of recipe names, which are expanded into run lists. Roles can contain other roles, which serve as vertices, and recipe names, which are the leaves. The tree is walked depth first, which makes ordering intuitive when assembling run lists. It is possible to apply many of these trees to a single node, but you don’t have to. Roles can also contain lists of attributes to apply to nodes, potentially changing recipe behavior.
http://wiki.opscode.com/display/chef/Data+Bags
Databags are arbitrary JSON structures that can be searched for by Chef recipes. They typically contain things like database passwords and other information that needs to be shared between resources on nodes. You can think of them as read only global variables that live on chef-server. They also have a great name that can be used to make various jokes in Campfire.
http://wiki.opscode.com/display/chef/Knife
knife is the CLI interface to the chef-server API. It can manipulate databags, node objects, cookbooks, etc. It can also be used to provision cloud resources and bootstrap systems.
-s
]]>