A Fistful of Servers

Sean OMeara is a Systems Administrator and technology consultant living in NYC

Promises, Lies, and Dry-Run Mode

| Comments

Introduction

“I need to know what this will do to my production systems before I run it.” – Ask a Systems Administrator why they want dry-run mode in a management tool, and this is the answer you’ll get almost every single time.

Historically, we have been able to use dry-run as a risk mitigation strategy before applying changes to machines. Dry-run is supposed to report what a tool would do, so that the administrator can determine if it is safe to run. Unfortunately, this only works if the reporting can be trusted as accurate.

In this post, I’ll show why modern configuration management tools behave differently than the classical tool set, and why their dry-run reporting is untrustworthy. While useful for development, it should never be used in place of proper testing.

make -n

Many tools in a sysadmin’s belt have a dry-run mode. Common utilities like make, rsync, rpm, and apt all have it. Many databases will let you simulate updates, and most disk utilities can show you changes before making them.

The make utility is the earliest example I can find of an automation tool with a dry-run option. Dry-run in make -n works by building a list of commands, then printing instead of executing them. This is useful because it can be trusted that make will always run the exact same list in real-run mode. Rsync and others behave the same way.

Convergence based tools, however, don’t build lists of commands. They build sets of convergent operators instead.

Convergent Operators, Sets and Sequence

Convergent operators ensure state. They have a subject, and two sets of instructions. The first set are tests that determine if the subject is in the desired state, and the second set takes corrective actions if needed. Types are made by grouping common tests and actions. This allows us to talk about things like users, groups, files, and services abstractly.

CFEngine promise bundles, Puppet manifests, and Chef recipes are all sets of these data structures. Putting them into a feedback loop lets them cooperate over multiple runs, and enables the self-healing behavior that is essential when dealing with large amounts of complexity.

During each run, ordered sets of convergent operators are applied against the system. How order is determined varies from tool to tool, but it is ordered none the less.

Promises and Lies

CFEngine models Promise Theory as a way of doing systems management. While Puppet and Chef do not model promise theory explicitly, it is still useful to borrow its vocabulary and metaphors and think about individual, autonomous agents that promise to fix the things they’re concerned with.

When writing policy, imagine every resource statement as a simple little robot. When the client runs, a swarm of these robots run tests, interrogate package managers, inspect files, and examine process tables. Corrective action is taken only when necessary.

When dealing with these agents, it can sometimes seem like they’re lying to you. This raises a few questions. Why would they lie? Under what circumstances are they likely to lie? What exactly is a lie anyway?

A formal examination of promises does indeed include the notion of lies. Lies can be outright deceptions, which are the lies of the rarely-encountered Evil Robots. Lies can also be “non-deceptions”, which are the lies of occasionally-encountered Broken Robots. Most often though, we experience lies from the often-encountered Merely Mis-informed Robots.

The Best You Can Do

The best you can possibly hope to do in a dry-run mode is to build the operator sequences, then interrogate each one about what they would do to repair the system at that exact moment. The problem with this is, in real-run mode, the the system is changing between the tests. Quite often, the results of any given test will be affected by a preceeding action.

Configuration operations can have rather large side effects. Sending signals to processes can change files on disk. Mounting a disk will change an entire branch of a directory tree. Packages can drop off one or a million different files, and will often execute pre or post-installation scripts. Installing the Postfix package on an Ubuntu system will not only write the package contents to disk, but also create users and disable Exim before automatically starting the service.

Throw in some notifications and boolean checks and things can get really interesting.

Experiments in Robotics

To experiment with dry-run mode, I wrote a Chef cookbook that configures a machine with initial conditions, then drops off CFEngine and Puppet policies for dry-running.

Three configuration management systems, each with conflicting policies, wreaking havoc on a single machine sounds like a fun way to spend the evening. Lets get weird.

If you already have a Ruby and Vagrant environment setup on your workstation and would like to follow along, feel free. Otherwise, you can just read the code examples by clicking on the provided links as we go.

Clone out the dry-run-lies cookbook from Github, then bring up a Vagrant box with Chef.

1
2
3
4
5
~/src/$ git clone https://github.com/someara/dry-run-lies-cookbook dry-run-lies
~/src/$ cd dry-run-lies
~/src/dry-run-lies$ bundle install
~/src/dry-run-lies$ bundle exec vagrant up
~/src/dry-run-lies$ bundle exec vagrant ssh

CFEngine –dry-run

When Chef is done configuring the machine, log into it and switch to root. We can test the /tmp/lies-1.cf policy file by running cf-agent with the -n flag.

CFEngine dry-run
1
2
3
root@dry-run-lies:~# cf-agent -K -f /tmp/lies-1.cf -n
-> Would execute script /bin/echo hello from bundle_one. puppet_bin_does_not_exist
 -> Need to execute /usr/bin/aptitude update...

Dry-run mode reports that it would run an echo command in bundle_one. Let’s remove -n and see what happens.

CFEngine real-run
1
2
3
4
5
root@dry-run-lies:~# cf-agent -K -f /tmp/lies-1.cf 
Q: ".../bin/echo hello": hello from bundle_one. puppet_bin_does_not_exist
I: Last 1 quoted lines were generated by promiser "/bin/echo hello from bundle_one. puppet_bin_does_not_exist"
Q: ".../bin/echo hello": hello from bundle_three. puppet_bin_exists
I: Last 1 quoted lines were generated by promiser "/bin/echo hello from bundle_three. puppet_bin_exists"

Wait a sec… What’s all this bundle_three business? Did dry-run just lie to me?

Examine the lies-1.cf file here.

The policy said three things. First, “echo hello from bundle one if /usr/bin/puppet does NOT exist”. Second, “make sure the puppet package is installed”. Third, “echo hello from bundle three if /usr/bin/puppet exists.”

In dry-run mode, each agent was interrogated individually. This resulted in a report leading us to believe that only one “echo hello” would be made, when in reality, there were two.

Puppet –noop

Let’s give Puppet a spin. We can test the policy at /tmp/lies-1.pp with the --noop flag to see what Puppet thinks it will do.

Puppet dry-run
1
2
3
4
5
6
root@dry-run-lies:~# puppet apply /tmp/lies-1.pp --noop
notice: /Stage[main]//Mount[/mnt/nfsmount]/ensure: current_value ghost, should be unmounted (noop)
notice: /Stage[main]//Mount[/mnt/nfsmount]: Would have triggered 'refresh' from 1 events
notice: Class[Main]: Would have triggered 'refresh' from 3 events
notice: Stage[main]: Would have triggered 'refresh' from 1 events
notice: Finished catalog run in 0.30 seconds

Dry-run reports that there is one resource to fix. Excellent. Let’s remove the --noop flag and see what happens.

Puppet real-run
1
2
3
4
5
6
7
root@dry-run-lies:~# puppet apply /tmp/lies-1.pp
notice: /Stage[main]//Mount[/mnt/nfsmount]/ensure: ensure changed 'ghost' to 'unmounted'
notice: /Stage[main]//Mount[/mnt/nfsmount]: Triggered 'refresh' from 1 events
notice: /Stage[main]//File[/mnt/nfsmount/file-1]/ensure: created
notice: /Stage[main]//File[/mnt/nfsmount/file-2]/ensure: created
notice: /Stage[main]//File[/mnt/nfsmount/file-3]/ensure: created
notice: Finished catalog run in 4.37 seconds

Like the CFEngine example, we have the real-run doing things that were not listed in the dry-run report.

The Chef policy that set up the initial machine state mounted an NFS directory into /mnt/nfssrv. When interrogated during dry-run, the tests in the file resources saw that the files were present, so they did not report that they needed to be fixed. During the real-run, Puppet unmounts the directory, changing the view of the filesystem and the outcome of the tests.

Check out the policy here.

It should be noted that Puppet’s resource graph model does nothing to enable noop functionality, nor can it affect its accuracy. It is used only for the purposes of ordering and ensuring non-conflicting node names within its model.

Chef –why-run

Finally, we’ll run the original Chef policy with the -W flag to see if it lies like the others.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
root@dry-run-lies:~# chef-solo -c /tmp/vagrant-chef-1/solo.rb -j /tmp/vagrant-chef-1/dna.json -Fmin --why-run
Starting Chef Client, version 10.16.4
Compiling cookbooks .......done.
Converging 32 resources .........................U.......UUUS
System converged.

resources updated this run:
* mount[/mnt/nfsmount]
- mount 127.0.0.1:/srv/nfssrv to /mnt/nfsmount

* package[nmap]
- install version 5.21-1.1ubuntu1 of package nmap

* package[puppet]
- remove  package puppet
- purge  package puppet

* package[puppet-common]
- remove  package puppet-common
- purge  package puppet-common

chef client finished, 4 resources updated

Seems legit. Let’s remove the --why-run flag and do it for real.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
root@dry-run-lies:~# chef-solo -c /tmp/vagrant-chef-1/solo.rb -j /tmp/vagrant-chef-1/dna.json -Fmin 
Starting Chef Client, version 10.16.4
Compiling cookbooks .......done.
Converging 32 resources .........................U.......UUUU
System converged.

resources updated this run:
* mount[/mnt/nfsmount]
- mount 127.0.0.1:/srv/nfssrv to /mnt/nfsmount

* package[nmap]
- install version 5.21-1.1ubuntu1 of package nmap

* package[puppet]
- remove  package puppet

* package[puppet-common]
- remove  package puppet-common

* execute[hack the planet]
- execute /bin/echo HACKING THE PLANET

chef client finished, 5 resources updated

Right. “HACKING THE PLANET” was definitely not in the dry-run output. Let’s go figure out what happened. See the Chef policy here.

Previously, our CFEngine policy had installed Puppet on the machine. Our Puppet policy ensured nmap was absent. Chef will install nmap, but only if the Puppet binary is present in /usr/bin.

Running Chef in --why-run mode, the test for the 'package[nmap]' resource succeeds because of the pre-conditions set up by the CFEngine policy. Had we not applied that policy, the 'execute[hack the planet]' resource would still not have fired because nothing installs nmap along the way. In real-run mode, it succeeds because Chef changes the machine state between tests, but would have failed if we had never ran the Puppet policy.

Yikes.

Okay, So What?

The robots were not trying to be deceptive. Each autonomous agent told us what it honestly thought it should do in order to fix the system. As far as they could see, everything was fine when we asked them.

As we automate the world around us, it is important to know how the systems we build fail. We are going to need to fix them, after all. It is even more important to know how our machines lie to us. The last thing we need is an army of lying robots wandering around.

Luckily, there are a number of techniques for testing and introducing change that can be used to help ensure nothing bad happens.

Keeping the Machines Honest

Testing needs to be about observation, not interrogation. In each case, the system converged to the policy, regardless of whether dry-run got confused or not. If we can setup up test machines that reproduce a system’s state, we can real-run the policy and observe the behavior. Integration tests can then be written to ensure that the policy achieves what it is supposed to.

Ideally, machines are modeled with policy from the ground up, starting with Just Enough Operating System to allow them to run Chef. This ensures all the details of a system have been captured and are reproducible.

Other ways of reproducing state work, but come with the burden of having to drag that knowledge around with you. Snapshots, kickstart or bootstrap scripts, and even manual configuration will all work as long as you can promise they’re accurate.

There are some situations where reproducing a test system is impossible, or modeling it from the ground up is not an option. In this case, a slow, careful, incremental application of policy, aided by dry-run mode and human intuition is the safest way to start. Chef’s why-run mode can help aide intuition by publishing assumptions about what’s going on. “I would start the service, assuming the software had been previously installed” helps quite a bit during development.

Finally, increasing the resolution of our policies will help the most in the long term. The more robots the better. Ensuring the contents of your configuration files is good. Making sure that they are only ones present in a conf.d directory is better. As a community, we need to produce as much high quality, trusted, tested, and reuseable policy as possible.

Good luck, and be careful out there.

-s

CFEngine Puppet and Chef Part 3

| Comments

At the end of the last installment, we used Puppet to create a Chef server. That brings us full circle, and the only thing we have left to do is examine how Chef works. We’ll do that by looking at the code that gave us our original CFEngine server.

Chef

Since they’re both written in Ruby, people tend to compare Puppet and Chef. This is natural since they have a lot in common. Both are convergence based configuration management tools inspired by CFEngine. Both have stand alone discovery agents (facter and ohai, respectively), as well as RESTful APIs for gleaning node information from the server. It turns out, however, that Chef actually has a lot more in common with CFEngine.

Like CFEngine, Chef copies policy from the server and evaluates it on the edges. This allows for high scalability, since the server isn’t doing very much. Think of web application that does most of its work in the browser instead of on the server.

A Chef recipe is a collection of convergent resource statements, and serves as the basic unit of intent. This is analogous to a CFEngine promise bundle. The Chef run list is how recipe ordering is defined, and is directly comparible to CFEngine’s bundlesqeuence. Using this approach makes it easy to reason about what’s going on when writing infrastructure as code.

Chef Specials

Imperative programming and declarative interface

While it’s true that Chef is just “pure ruby” and therefore imperative, to say that Chef is imperative without considering the declarative interface to resources is disingenuous at best. Using nothing but Chef resources, recipes look very much like their CFEngine and Puppet counterparts. The non-optimally ordered Chef version of NTP converges in the same number of runs as the CFEngine example from the first installment. This is because the underlying science of convergent operators is the same.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# service
service "ntp" do
  action [ :enable, :start ]
  ignore_failure true
end

# file 
template "/etc/ntp.conf" do
  source "ntp.conf.erb"
  owner "root"
  group "root"
  mode 0644
  notifies :restart, "service[ntp]"
  ignore_failure true
end

# package 
package "ntp" do
  action :install
  ignore_failure true
end

When and where order matters, imperative ordering isolated within a recipe is the most intuitive way for sysadmins to accomplish tasks within the convergent model. “Install a package, edit a config file, and start the service” is how most people think about the task. Imperative ordering of declarative statements give the best of both worlds. When order does NOT matter, it’s safe to re-arrange recipe ordering in the Chef run list.

Multiphase execution

The real trick to effective Chef cookbook development is to understand the Anatomy of a Chef Run. When a Chef recipe is evaluated in the compilation phase, encountered resources are added to the Resource Collection, which is an array of evaluated resources with deferred execution.

The compile phase of this recipe would add 99 uniquely named, 12 oz, convergent beer_bottles to the collection, and the configure phase would take them down and pass them around. Subsequent runs would do nothing.

thanks jtimberman!
1
2
3
4
5
6
7
8
size = ((2 * 3) * 4) / 2

99.downto(1) do |i|
  beer_bottle "bottle-#{i}" do
    oz size
    action [ :take_down, :pass_around ]
  end
end

The idea is that you can take advantage of the full power of Ruby to make decisions about what to declare about your resources. Most people just use the built in Chef APIs to consult chef-server for topology information about their infrastructure. However, there’s nothing stopping you from importing random Ruby modules and accessing existing SQL databases instead.

Want to name name servers after your Facebook friends? Go for it. Want your MOTD to list all James Brown albums released between 1980 and 1990? Not a problem. The important part is that things are ultimately managed with a declarative, idempotent, and convergent resource interface.

cfengine.rb

Let’s take a look at the recipe that gave us our original CFEngine server.

File /Users/someara/src/someara.github.com/source/affs-blog/cookbooks/cfengine/recipes/server.rb could not be found

Topology management

When a node is bootstrapped with Chef, a run list of roles or recipes is requested by the node itself. After that, the host is found by recipes running elsewhere in the infrastructure by searching for roles or attributes. This is contrasted from the CFEngine and Puppet techniques of matching classes based on a hostname, FQDN, IP, or other found information.

This approach has the effect of decoupling a node’s name from its functionality. Line 10 in cfengine.rb above searches out node objects and later be passes them to the promises-server.cf.erb template for authorization.

Wrapping up

So there you have it folks. Chef making CFEngine making Puppet making Chef. These tools can be used to automate literally anything, and they’re pretty easy to use once you figure out how they work. I was going to throw some Bcfg2 and LCFG in there just for fun, but I only had some much free time =)

Configuration mangement is like a portal.

-s

CFEngine Puppet and Chef Part 2

| Comments

In the previous installment, we used Chef to configure CFEngine to serve policy that allowed us to create a Puppet service. In this one, we’ll have Chef use that Puppet service to create a Chef server. If you think this is a ridiculous thing to do, I would be inclined to agree with you. However, this is my blog so I make the rules.

Puppet

Puppet at its core works like CFEngine. Statements in Puppet are convergent operators, in that they are declarative (and therefore idempotent), and convergent in that they check a resource’s state before taking any action. Like the NTP example from the CFEngine installment, non-optimally ordered execution will usually work itself out after repeated Puppet runs.

Unlike CFEngine, where policy is copied and evaluated on the edges, Puppet clients connect to the Puppet server where configuration is determined based on a certificate CN. A catalog of serialized configuration data is shipped back to the client for execution. This catalog is computed based on the contents of the manifests stored on the server, as well as a collection of facts collected from the clients. Puppet facts, like CFEngine hard classes, are discoverable things about a node such as OS version, hostname, kernel version, network information, etc.

Puppet works a bit like the food replicators in Star Trek. Resources make up the basic atoms of a system, and the precise configuration of each must be defined. If a resource is defined twice in a manifest with conflicting states, Puppet refuses to run.

Ordering can be specified though require statements that set up relations between resources. These are used to build a directed graph, which Puppet sorts topologically and uses to determine the final ordering. If a resource in a chain fails for some reason, dependent resources down the graph will be skipped.

This allows for isolation of non-related resources collections. For example, if a package repository for some reason fails to deliver the ‘httpd’ package, its dependent configuration file and service resources will be skipped. This has nothing to do with an SSH resource collection, so the resources concerning that service will be executed even though the httpd collection had previously failed.

Just be careful not to create the coffee without the cup.

chef.pp

Let’s examine a Puppet manifest that creates a Chef server on Centos 6.

File /Users/someara/src/someara.github.com/source/affs-blog/cookbooks/cfengine/files/default/server/puppet/manifests/classes/chef.pp could not be found

Picking it apart

Line 1 is a Puppet class definition. This groups the resource statments between together, allowing us to assign chef-server to a node based on its hostname. This can be accomplished with an explicit nodes.pp definition, or with an external node classifier.

Line 3 is an exec resource, which we can later refer to with its name: rbel6-release. When using exec resources, it’s up to you to specify a convergence check. In this case, we used the unless keyword to check the return status of an rpm command. The same goes for command promise types in CFEngine, or an execute resources in Chef.

Line 9 is an example of an array variable, which is iterated over in line 21, much like a CFEngine slist.

Everything else is a standard Puppet resource declaration, each of which have a type, a name, and an argument list. Like CFEngine promises, each type has various intentions available under the hood. Packages can be installed. Services can be running or stopped, and files can be present with certain contents and permissions.

Refer to the Puppet documentation for more details.

On to Chef

1
knife bootstrap centos6-3 -r 'role[affs-chef]' -N "affs-chef-1.example.com" -E development -d affs-omnibus-pre -x root

CFEngine Puppet and Chef Part 1

| Comments

Introduction

Over the past few years, the topic of Infrastructure Automation has received a huge amount of attention. The three most commonly used tools for doing this (in order of appearance) are CFEngine, Puppet, and Chef. This article explores each of them by using one to set up another. If you have a chef-server or Hosted Chef account, you can follow along by following the instructions in the setup section. (Full disclosure: I work for Opscode, creators of Chef.)

Infrastructure

“Infrastructure” turns out to be the hardest thing to explain when discussing automation, yet is the most critical to understand. In this context, Infrastructure isn’t anything physical (or virtualized) like servers or networks. Instead, what we’re talking about is all the “stuff” that is configured across machines to enable an application or service.

In practice, “stuff” translates to operating system baselines, kernel settings, disk mounts, OS user accounts, directories, symlinks, software installations, configuration files, running processes, etc. People of the ITIL persuasion may think of these as Configuration Items. Units of management are composed into larger constructs, and complexity arises as these arrangements become more intricate.

Services running in an Infrastructure need to communicate with each other, and do so via networks. Even when running on a single node, things still communicate over a loopback address or a Unix domain socket. This means that Infrastructure has a topology, which is in itself yet another thing to manage.

Automation

Here is a picture of a duck.

This duck happens to be an automaton. An automaton is a self-operating machine. This one pretends to digest grain. It interacts with its environment by taking input and producing output. To continue operating, the duck requires maintenance. It needs to be wound, cleaned, and repaired. Automated services running on a computer are no different.

Once turned on, an automated service takes input, does something useful, then leaves logs and other data in its wake. Its machinery is the arrangement of software installation, configuration, and the running state of a process. Maintenance is performed in a control loop, where an agent comes around at regular intervals inspecting its parts and fixing anything that’s broken.

In automated configuration management, the name of the game is hosting policy. The agents that build and maintain systems pull down blueprints and set to work building our automatons. When systems come back up from maintenance or new ones spring into existence, they configure themselves by downloading policy from the server.

Setup

If you’d like to follow along by configuring your own machines with knife, follow the setup instructions here. The setup will get your Chef workstation configured, code checked out from my blog git repo, and uploaded to chef-server for use. Otherwise, you can just browse the source here

CFEngine

CFEngine is a system based on promise theory. Promises are the basic atoms of the CFEngine universe. They have names, types, and intentions (among other things), and each acts as a convergent operator to move its subject toward an intended state. Like the parts in our duck, promises are assembled to create a larger whole.

Promises of various types are capable of different things. Promises of type “package” can interact with a package manager to make sure somthing is installed or removed, while a promise of type “file”, can copy, edit, and set permissions. Processes can be started or stopped, and commands can be ran if needed. Read all about them in the CFEngine reference manual.

Promises provide a declarative interface to resources under management, which has the remarkably handy attribute of being idempotent. An idempotent function gives the same result when applied multiple times. This allows our duck repairing maintence loop (in the form of cf-agent on a cron) to come around and safely execute instructions without having to worry about side effects. Consider “the line ‘foo’ should exist in the file” vs “append ‘foo’ to the end of the file”; the non-declarative ‘append’ would not be safe to repeat.

Convergent maintenance refers to the continuous repair of a system towards a desired state. At the individual promise level, convergence happens in a single run of the maintenance loop. If a package is supposed to be installed but isn’t, action will be taken to fix it. If a process is not running but should be, action will be taken again. Convergence in a larger system of promises can take multiple runs if things are processed in a non-optimal order. Consider the following:

1
2
3
Start the NTP service.
Make sure the NTP configuration file is correct, restart the NTP service if repaired.
Install the NTP package.

Assuming a system with a base install, the first promise would fail to be kept. The NTP binary is not available, since we haven’t installed its package yet. The second promise would write the configuration file, but fail to restart the service. The third promise would succeed, assuming an appropriate package repo was available and functioning properly. After the first run is complete, the system has converged closer to where we want it to be, but isn’t quite there yet. Applying the functions again gets us closer to our goal.

The second run of the loop would succeed in starting the service, but would be using the wrong configuration file. The package install from the previous loop clobbered the one written previously. Promise number two would fix the config and restart the service, and the third would do nothing because the package is already installed. Finally, we’ve converged to our desired system state. A third loop would take no actions at all.

Kicking things off

To set up a CFEngine server, invoke the following Chef command:

1
knife bootstrap centos6-1 -r 'role[cfengine]' -N "cfengine-1.example.com" -E development -d affs-omnibus-pre -x root

When Chef is done doing its thing, you’ll end up with a functioning CFEngine policy host, happily promising to serve policy. Log into the freshly configured machine and check it out. Three things have happened. First, the cfengine package itself has been installed. Second, two directories have been created and populated: /var/cfengine/inputs, and /var/cfengine/masterfiles.

The inputs directory contains configuration for the CFEngine itself, which includes a promise to make the contents of masterfiles available for distribution. When a CFEngine client comes up, it will copy the contents of /var/cfengine/masterfiles from the server into its own inputs directory.

Examining policy

CFEngine’s main configuration file is promises.cf, from which everything else flows. Here’s a short snippet:

promises.cf snippet
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
body common control
{
  bundlesequence  => {
    "update",
    "garbage_collection",
    "cfengine",
    "puppet_server",
  };

  inputs  => {
    "update.cf",
    "cfengine_stdlib.cf",
    "cfengine.cf",
    "garbage_collection.cf",
    "puppet.cf",
  };
}

The bundlesequence section tells cf-agent what promise bundles to execute, and in what order. The one we’re examining today is named puppet_server, found in puppet.cf

File /Users/someara/src/someara.github.com/source/affs-blog/cookbooks/cfengine/templates/default/inputs/puppet.cf.erb could not be found

A promise bundle is CFEngine’s basic unit of intent. It’s a place to logically group related promises. Within a bundle, CFEngine processes things with normal ordering. That is, variables are converged first, then classes, then files, then packages, and so on. I wrote the bundle sections in normal order to make it easier to read, but they could be rearranged and still have the same effect. Without going into too much detail about the language, I’ll give a couple hints to help with groking the example.

First, in CFEngine, the word ‘class’ does not mean what it normally does in other programming languages. Instead, classes are boolean flags that describe context. Classes can be ‘hard classes’, which are discovered attributes about the environment (hostname, operating system, time, etc), or ‘soft classes’, which are defined by the programmer. In the above example, puppetmaster_enabled and iptables_enabled are soft classes set based on the return status of a command. In the place of if or case statements, boolean checks on classes are used.

Second, there are no control statements like for or while. Instead, when lists are encountered they are automatically iterated. Check out the packages section for examples of both class decisions and list iteration. Given those two things, you should be able to work your way through the example. However, there’s really no getting around reading the reference manual if you want to learn CFEngine.

On to Puppet

Finally, let’s go ahead and use Chef to bring up a CFEngine client, which will be turned into a Puppet server.

1
knife bootstrap centos6-2 -r 'role[puppet]' -N "puppet-1.example.com" -E development -d affs-omnibus-pre -x root

The first run will fail, since the host’s IP isn’t yet in the cfengine server’s allowed hosts lists. Complete the convergence by running these commands:

1
2
3
knife ssh "role:cfengine" -a ipaddress chef-client
knife ssh "role:puppet" -a ipaddress chef-client
knife ssh "role:puppet" -a ipaddress chef-client

And viola! A working Puppet server, serving policy.

Configuration Management Strategies

| Comments

I just watched the “To Package or Not to Package” video from DevOps days Mountain View. The discussion was great, and there were some moments of hilarity. If you haven’t watched it yet, check it out here

Stephen Nelson Smith, I salute you, sir.

I’m quite firmly in the “Let your CM tool handle your config files” camp. To explain why, I think it’s worth briefly examining the evolution of configuration management strategies.

In order to keep this post as vague and heady as possible, no distinction between “system” and “application” configurations shall be made.

What is a configuration file?

Configuration files are text files that control the behavior of programs on a machine. That’s it. They are usually read once, when a program is started from a prompt or init script. A process restart or HUP is typically required for changes to take effect.

What is configuration management, really?

When thinking about configuration management, especially across multiple machines, it is easy to equate the task to file management. Configs do live in files, after all. Packages are remarkably good at file management, so it’s natural to want to use them.

However, the task goes well beyond that.

An important attribute of an effective management strategy, config or otherwise, is that it reduces the amount of complexity (aka work) that humans need to deal with. But what is the work that we’re trying to avoid?

Dependency Analysis and Runtime Configuration

Two tasks that systems administrators concern themselves with doing are dependency analysis and runtime configuration.

Within the context of a single machine, dependency analysis usually concerns software installation. Binaries depend on libraries and scripts depend on binaries. When building things from source, headers and compilers are needed. Keeping the details of all this straight is no small task. Packages capture these relationships in their metadata, the construction of which is painstaking and manual. Modern linux distributions can be described as collections of packages and the metadata that binds them. Go out and hug a package maintainer today.

Within the context of infrastructure architecture, dependency analysis involves stringing together layers of services and making individual software components act in concert. A typical web application might depend on database, caching, and email relay services being available on a network. A VPN or WiFi service might rely on PKI, Radius, LDAP and Kerberos services.

Runtime configuration is the process of taking all the details gathered from dependency analysis and encoding them into the system. Appropriate software needs to be installed, configuration files need to be populated, and kernels need to be tuned. Processes need to be started, and of course, it should all still work after a reboot.

Manual Configuration

Once upon a time, all systems were configured manually. This strategy is the easiest to understand, but the hardest one to execute. It typically happens in development and small production environments where configuration details are small enough to fit into a wiki or spreadsheet. As a network’s size and scope increases, management efforts became massive, time consuming, and prone to human error. Details end up in the heads of a few key people and reproducibility is abysmal. This is obviously unsustainable.

Scripting

The natural progression away from manual configuration was custom scripting. Scripting reduced management complexity by automating things using languages like Bash and Perl. Tutorials and documentation instruction like “add the following line to your /etc/sshd_config” were turned into automated scripts that grepped, sed’ed, appended, and clobbered. These scripts were typically very brittle and would only produce desired outcome after their first run.

File Distribution

File distribution was the next logical tactic. In this scheme, master copies of important configuration files are kept in a centralized location and distributed to machines. Distribution is handled in various ways. RDIST, NFS mounts, scp-on-a-for-loop, and rsync pulls are all popular methods.

This is nice for a lot of reasons. Centralization enables version control and reduces the time it takes to make changes across large groups of hosts. Like scripting, file distribution lowers the chance of human error by automating repetitive tasks.

However, these methods have their drawbacks. NFS mounts introduce single points of failure and brittleness. Push based methods miss hosts that happen to be down for maintenance. Pulling via rsync on a cron is better, but lacks the ability to notify services when files change.

Managing configs with packages falls into this category, and is attractive for a number of reasons. Packages can be written to take actions in their post-install sections, creating a way to restart services. It’s also pretty handy to be able to query package managers to see installed versions. However, you still need a way to manage config content, as well as initiate their installation in the first place.

Declarative Syntax

In this scheme, autonomous agents run on hosts under management. The word autonomous is important, because it stresses that the machines manage themselves by interpreting policy remotely set by administrators. The policy could state any number of things about installed software and configuration files.

Policy written as code is run through an agent, letting the manipulation of packages, configuration files, and services all be handled by the same process. Brittle scripts behaving badly are eliminated by exploiting the idempotent nature of a declarative interface.

When first encountered, this is often perceived as overly complex and confusing by some administrators. I believe this is because they have equated the task of configuration management to file management for such a long time. After the initial learning curve and picking up some tools, management is dramatically simplified by allowing administrators to spend time focusing on policy definition rather than implementation.

Configuration File Content Management

This is where things get interesting. We have programs under our command running on every node in an infrastructure, so what should we make them to do concerning configuration files?

“Copy this file from its distribution point” is very common, since it allows for versioning of configuration files. Packaging configs also accomplishes this, and lets you make declarations about dependency. But how are the contents of the files determined?

It’s actually possible to do this by hand. Information can be gathered from wikis, spreadsheets, grey matter, and stick-it notes. Configuration files can then be assembled by engineers, distributed, and manually modified as an infrastructure changes.

File generation is a much better idea. Information about the nodes in an infrastructure can be encoded into a database, then fed into templates by small utility programs that handle various aspects of dependency analysis. When a change is made, such as adding or removing a node from a cluster, configurations concerning themselves with that cluster can be updated with ease.

Local Configuration Generation

The logic that generates configuration files has to be executed somewhere. This is often done on the machine responsible for hosting the file distribution. A better place is directly on the nodes that need the configurations. This eliminates the need for distribution entirely.

Modifications to the node database now end up in all the correct places during the next agent run. Packaging the configs is completely unnecessary, since they don’t need to be moved from anywhere. Management complexity is reduced by eliminating the task entirely. Instead of worrying about file versioning, all that needs to be ensured is code correctness and the accuracy of the database.

Don’t edit config files. Instead, edit the truth.

-s

Full Stack Chef Installers for EL5 and EL6

| Comments

I started at Opscode this week, just in time for the Velocity Conference in Santa Clara. The amount of brain power in the building right now is absolutely astounding, and I’m having a total blast. Manning the Opscode booth in the exhibition hall, a number attendees have stopped by inquiring about when we’re going to improve Centos and Redhat support. So far, this has largely been a community effort, and the experience tends to lag behind that of Ubuntu users.

One thing coming down the pipe are full stack installers for various distributions. A full stack installer provides everything you need to run Chef, above libc. Telling people about this has generated a lot of excitement and interest, so I went ahead and built them for you, live from the floor of Velocity.

Below are instructions for manual installation on EL5 and EL6, clones and derivatives. I’ll leave the creation of a custom knife bootstrap script as an exercise for the reader.

Enjoy!

-s

For EL5 users:

1
2
3
4
5
6
7
8
9
10
11
curl http://rpm.aegisco.com/aegisco/el5/aegisco.repo > /etc/yum.repos.d/aegisco.repo

yum clean all ; yum install gecode

wget http://yum.afistfulofservers.net/affs/fatty/el5/chef-full-0.10.0-1-centos-5.4-x86_64.sh

chmod +x ./chef-full-0.10.0-1-centos-5.4-x86_64.sh

sudo ./chef-full-0.10.0-1-centos-5.4-x86_64.sh

sudo /opt/opscode/setup.s

For EL6 users:

1
2
3
4
5
6
7
8
9
10
11
12
13
rpm -Uvh http://mirror.pnl.gov/epel/6/x86_64/epel-release-6-5.noarch.rpm

rpm -Uvh http://yum.afistfulofservers.net/affs/centos/6/noarch/affs-release-el6-6-1.noarch.rpm

yum clean all ; yum -y install gecode

wget http://yum.afistfulofservers.net/affs/fatty/el6/chef-full-0.10.0-1-scientific-6.0-x86_64.sh

chmod +x ./chef-full-0.10.0-1-scientific-6.0-x86_64.sh

sudo ./chef-full-0.10.0-1-scientific-6.0-x86_64.sh

sudo /opt/opscode/setup.sh

-s

Searching Chef Server

| Comments

Overview

Search is Chef’s killer feature for sure. Searching for the IPs or FQDNs of nodes with particular roles or attributes lets you dynamically string together machines within your infrastructure. This eliminates the need for centralized planning of IP addresses among Chef managed resources. This is especially useful on the clouds or in DHCP environments where you are assigned random IPs.

Munin

Munin is one of the first cookbooks that I read after finding out about Chef, and is pretty much responsible for selling me on it. Below are the recipes from a simplified version of the munin cookbook.

Munin is a system metrics collection tool that gives you a ton of information out of the box with very little configuration. It’s really great for smaller installations and a great way to get some metrics now if you’re in a hurry. The Opscode apache2 cookbook is included without modification to provide a web console for viewing graphs.

You can view the complete cookbook here.

Searching

The cookbook is broken into two recipes, server.rb and client.rb

The server searches for clients to poll, and the client searches for servers to accept poll connections from. We start out by setting a node attribute in each recipe so the other half has something to search for.

The search syntax comes from Solr, so a node attribute set with node[:foo][:bar][:baz]=”buzz” is searched for with: search(:node, “foo_bar_baz:buzz”)

Searches return arrays of node objects (JSON blobs), which are then passed into templates where IP information is dug out and rendered into a config file.

munin::server

munin/recipes/server.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
node.set[:munin][:server] = true
munin_clients = search(:node, "munin_client:true")

include_recipe "apache2"
include_recipe "apache2::mod_rewrite"
include_recipe "munin::client"

package "munin"

cookbook_file "/etc/cron.d/munin" do
  source "munin-cron"
  mode "0644"
  owner "root"
  group "root"
end

template "/etc/munin/munin.conf" do
  source "munin.conf.erb"
  mode 0644
  variables(:munin_clients => munin_clients)
end

apache_site "000-default" do
  enable false
end

case node[:platform]
  when "fedora", "redhat", "centos", "scientific"
    file "/var/www/html/munin/.htaccess" do
      action [:delete]
    end
end

template "#{node[:apache][:dir]}/sites-available/munin.conf" do
  source "localsystem.apache2.conf.erb"
  mode 0644
  if ::File.symlink?("#{node[:apache][:dir]}/sites-enabled/munin.conf")
    notifies :reload, resources(:service => "apache2")
  end
end

apache_site "munin.conf"

munin::client

munin/recipes/client.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
node.set[:munin][:client] = true
munin_servers = search(:node, "munin_server:true")

unless munin_servers.empty?
  package "munin-node" do
    action :install
  end

  template "/etc/munin/munin-node.conf" do
    source "munin-node.conf.erb"
    mode 0644
    variables :munin_servers => munin_servers
    notifies :restart, "service[munin-node]"
  end

  service "munin-node" do
    supports :restart => true
    action [ :enable, :start ]
  end
end

Roles vs Attributes

A number of people have asked me why I used attributes rather than roles. This is to avoid baking convention into the recipe code, which I like to do whenever I can help it.

Consider the following scenario:

You have a role[monitoring], that includes recipe[nagios::server] and recipe[munin::server]. In the Nagios and Munin client cookbooks, you’ve searched for the role[monitoring] are happily populating your configuration files. A few months pass, and you’ve added more machines to your infrastructure.

One day your monitoring server starts crawling, since it has slow disks and can’t keep up with the IO intensive graph generation. You decide that “monitoring” is an overloaded term, and set off to split your metrics and alerting onto different machines. You edit your role structure and change your node object’s runlist assignment, and bring up some new machines. However, you still have more work to do. Now you have to go into the recipe code and change them to search for their new roles.

Using attributes as above frees you from having to modify the recipe code when editing role definitions. Don’t get me wrong, there are plenty of scenarios where roles are preferable to attributes, but for things like this, I like to avoid them.

-s

update A reader has pointed out that instead of using attributes, I could have used the search syntax search(:node, ‘recipes:”munin::server”’). Good to know!

A Brief Chef Tutorial (From Concentrate)

| Comments

Overview

Chef is configuration management platform written in Ruby. Configuration management is a large topic that most systems administrators and IT management are just now starting to gain experience with. Historically, infrastructures have been maintained either by hand, with structured scripting, by imagine cloning, or a combination of those. Chef’s usage model rejects the idea of cloning and maintaining “golden images”. Instead, the idea is to start with an embryonic image and grow it into it’s desired state. This works much better as infrastructure complexity increases, and eliminates the problem of image sprawl. The convergent nature of the tool allows you to change the infrastructure over time without much fuss. Chef allows you to express your infrastructure as code, which lets you store it in version control.

“A Can of Condensed Chef Documentation” is available here

Prerequisites

Git

Actually you can use any SCM, but git is the most widely adopted in the Chef community. All Chef Git repos begin their lives as clones of the Opscode chef-repo, found here: https://github.com/opscode/chef-repo There is a nice overview of the repo structure (cookbooks, databags, roles, etc) in the README.

chef-server up and running at a known IP or FQDN.

This is easily installed from packages by following the instructions on the opscode wiki. The process amounts to “add a package repository, install the packages, and turn it on” Alternatively, you could use the Opscode Platform and go dancing with space robots.

Knife installed on your local system

1
gem install chef net-ssh net-ssh-multi fog highline

Chef git repo checked out on local file system

1
git clone https://github.com/opscode/chef-repo

Client certificate creation

A “client” in chef parlance is an SSL certificate used to access the chef-server API. If the client’s CN name is marked “admin” in chef-server, the client can perform restricted operations such as creating and deleting nodes. This is the kind of client needed by knife to manipulate the infrastructure, and normally correspond to actual human being, but by no means has to. Nodes have non-admin client certificates, and can only manipulate their own node objects. To create a client certificate, you’ll need to log into the chef-server webui, click on “clients”, think of a name for it (I use someara), and paste the displayed private key into a local file.

Copy the validation key
The validation key is a special key that is shared by all freshly bootstrapped nodes. It has the ability to create new client certificates and nodes objects through the API.

1
scp root@chefserver:/etc/chef/validation.pem .chef/

Edit configuration files

For more details on this section, please visit http://wiki.opscode.com/display/chef/Chef+Configuration+Settings

.chef/client.rb - This file is copied onto the nodes that are bootstrapped with knife, and needs to be configured to point to the IP or FQDN of your chef server

example

$ vim client.rb
1
2
3
4
5
6
7
8
9
10
11
12
log_level          :info
log_location       STDOUT
ssl_verify_mode    :verify_none
chef_server_url    "http://y.t.b.d:4000"
file_cache_path    "/var/cache/chef"
pid_file           "/var/run/chef/client.pid"
cache_options({ :path => "/var/cache/chef/checksums", :skip_expires => true})
signing_ca_user "chef"
Mixlib::Log::Formatter.show_time = true
validation_client_name "chef-validator"
validation_key         "/etc/chef/validation.pem"
client_key             "/etc/chef/client.pem"

.chef/knife.rb - This file also needs to be configured to point to your chef-server, and also to the client private key that was created earlier.

example

$ vim knife.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
log_level            :info
log_location         STDOUT
node_name           'knife'
cache_type          'BasicFile'
cache_options( :path => "~/.chef/checksums" )
client_key       '~/.chef/knife.key.pem'

cookbook_path       [ "~/mychefrepo/cookbooks" ]
cookbook_copyright "example org"
cookbook_email     "cookbooks@example.net"
cookbook_license   "apachev2"

chef_server_url    "http://y.t.b.d:4000"

validation_key      "~/.chef/validation.pem"

# rackspacecloud
knife[:rackspace_api_key] = '00000000000000000000000000000000'
knife[:rackspace_username] = 'rackspace'

# slicehost
knife[:slicehost_password] = '0000000000000000000000000000000000000000000000000000000000000000'

# AFFS aws
knife[:aws_access_key_id]     = '00000000000000000000'
knife[:aws_secret_access_key] = '0000000000000000000000000000000000000000'

#knife[:region]  = "us-east-1"
#knife[:availability_zone] = "us-west-1b"
#knife[:ssh_user] = "root"
#knife[:flavor] = "t1.micro"
#knife[:image] = "ami-10a55279"
#knife[:use_sudo]  = "false"
#knife[:distro] = "affs-fc13"

Role, recipes, and run lists

As mentioned earlier, run lists are made up from role trees. Here is an example of how you would create a demo server with a correct clock, managed users, and metrics and monitoring capabilities. In this example, six recipes are executed per run, and an unknown number of resources are managed. (To figure that out, you’d have to read the recipes)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
role[demo]
  role[base]                   <---- nested role
  recipe[foo::server]
  recipe[foo::muninplugin]
       
role[base]
  recipe[ntp]
  recipe[localusers::common]
  recipe[munin::client]
  recipe[nagios::client]

expanded run list
  recipe[ntp]
    recipe[localusers::common]
    recipe[munin::client]
    recipe[nagios::client]
    recipe[foo::server]
    recipe[foo::muninplugin]

That’s quite a bit of cooking for a beginner tutorial, so we’re just going to focus on a single node running an NTP client for now. Roles can be written either as .rb files or .json files. I prefer to use the .rb format because they’re easier to read and write. Some people prefer to deal with the JSON formatted version directly, since thats the way they’re dumped with knife. At the end of the day, it doesn’t really matter, so do which ever makes you happy.

Step One : Creating a demo role file

$ vim roles/demo.rb
1
2
3
4
5
name "demo"
description "demo role"
run_list [
    "recipe[ntp]"
    ]

Step Two : Installing the role on chef-server

1
$ knife role from file roles/demo.rb

Writing Recipes

Hello, NTP!

A machine’s NTP client is simple to install and configure. Every systems administrator is already familiar with it, which makes it a great example.

Most software available as a native package in a given linux distribution can be managed with a “package, template, service” design pattern.

Each of those words refers to a Chef resource, which we pass arguments to.

Step One : Creating an ntp cookbook

1
$ knife cookbook create ntp

This creates a directory structure for the ntp cookbook. You can check it out with ls:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ ls -la cookbooks/ntp/
total 24
drwxr-xr-x  13 someara  staff   442 Mar 14 17:56 .
drwxr-xr-x  36 someara  staff  1224 Mar 15 19:39 ..
-rw-r--r--   1 someara  staff    58 Mar 14 17:56 README.rdoc
drwxr-xr-x   2 someara  staff    68 Mar 14 17:56 attributes
drwxr-xr-x   2 someara  staff    68 Mar 14 17:56 definitions
drwxr-xr-x   3 someara  staff   102 Mar 14 17:56 files
drwxr-xr-x   2 someara  staff    68 Mar 14 17:56 libraries
-rw-r--r--   1 someara  staff   259 Mar 14 17:56 metadata.rb
drwxr-xr-x   2 someara  staff    68 Mar 14 17:56 providers
drwxr-xr-x   4 someara  staff   136 Mar 14 17:56 recipes
drwxr-xr-x   2 someara  staff    68 Mar 14 17:56 resources
drwxr-xr-x   3 someara  staff   102 Mar 14 17:56 templates

Step Two : Deciding what to name the recipe

Recipe names are related to cookbook structure. Putting recipe[foo::bar] in a node’s run list results in cookbooks/foo/recipes/bar.rb being downloaded from chef-server and executed.

There is a special recipe in every cookbook called default.rb. It is executed by every recipe in the cookbook. Specifying recipe[foo::bar] actually results in cookbooks/foo/recipes/default.rb, as well as cookbooks/foo/recipes/bar.rb being executed.

Default.rb is a good place to put common stuff when writing cookbooks with multiple recipes, but we’re going to keep it simple and just use default.rb for everything.

Step Three : Creating a recipe

This is where all the fun stuff happens. When using resources, you’re writing things in a declarative fashion. Declarative means you can concentrate on the WHAT without having to worry about the HOW. Chef will take care of that for you with something called a resource provider. When installing a package, it will check to see what your operating system is and use the appropriate methodology. For example, on Debian based systems, it will use apt-get, and on Redhat based systems, it will use yum.

$ vim cookbooks/ntp/recipes/default.rb
1
2
3
4
5
6
7
8
9
10
11
12
package "ntp" do
  action [:install]
end

template "/etc/ntp.conf" do
  source "ntp.conf.erb"
  variables( :ntp_server => "time.nist.gov" )
end

service "ntpd" do
  action[:enable,:start]
end

Chef recipes are evaluated top down (like a normal ruby program), with each resource being ran in the order it appears. Order is important. In the above example, if we were to reverse the order of those three resources, it would first fail to start the service (as the software is not installed yet), then write the configuration file, then finally clobber the file it just wrote by installing the package.

Step Four : Creating the ntp.conf.erb template

$ vim cookbooks/ntp/templates/default/ntp.conf.erb
1
2
3
4
5
6
7
8
9
# generated by Chef.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
restrict 127.0.0.1
restrict -6 ::1
server <%= @ntp_server %>
server  127.127.1.0     # local clock
driftfile /var/lib/ntp/drift
keys /etc/ntp/keys

Step Five : uploading the cookbook to chef-server

1
$ knife cookbook upload ntp

Bootstraping nodes

The chef-client needs to somehow get itself installed and running on managed nodes. This process is known as bootstrapping and is accomplished with shell scripting. The method of bootstrap will vary depending on how you go about provisioning your server, and the script will depend on the platform.

Clouds

Cloud providers like AWS and Rackspace will let you make an API request, then return the IP of your compute resource.

1
$ knife ec2 server create "role[demo]" -N "demo.example.net" -i ami-3e02f257

In this example, knife uses the ruby fog library to talk to ec2 and request a server with an argument of the desired AMI. Knife then uses net-ssh-multi to ssh into the machine and execute a bootstrapping script. There are a number of other arguments that can be passed to knife, such as ec2 region, machine size, what ssh key to use. You can read all about them on the Opscode wiki.

Meatclouds

If your method of provisioning servers is “ask your VMware administrator” or “fill out these forms”, then you’ll probably bootstrap via an IP address.

1
knife boostrap 10.0.0.5 -x root -N demo.example.net -r 'role[demo]' -d pp-centos5

Cobbler / FAI / pxe_dust / Jumpstart / etc

In these provisioning scenarios, you can skip knife completely and put the contents of a bootstrap script kickstart or equivalent.

Customizing the bootstrap

By default (with no arguments), Chef attempts a gem based installation meant to work on Ubuntu. If you’re not using Ubuntu, or are uncomfortable installing gems directly from rubygems.org, you’ll have to change the script to suite your taste. It works by specifying a template name with the -d flag, SSH’ing into the machine and running the rendered script. When using knife to SSH, make sure you have the correct key loaded into your ssh-agent.

Example

1
knife boostrap 10.0.0.5 -x root -N demo.example.net -r ‘role[demo]’ -d my-centos5

ends up running this

1
ssh root@10.0.0.179 bash -c ‘<contents of rendered .chef/bootstrap/my-centos5.erb template>’

What I do in my boot scripts:

  • Correctly set the hostname to value of -N argument. (By correctly, I mean that hostname -f has to work properly)
  • Configure the package repositories
  • Install Chef. I like packages using the native package manager
  • Copy the validation key
  • Write /etc/chef/client.rb (points to server)
  • Write a json file with the contents of the -r argument
  • chef-client -j bootstrap.json

After the script is ran, chef-client does the following

  • Ohai!
  • Client registration: SSL CN is FQDN from ohai
  • Node creation: Node name is also FQDN from ohai, run lists are from JSON
  • Expands run list
  • Downloads needed cookbooks
  • Starts executing recipes

There is an example of a custom bootstrap script here

At this point, you should have an ntp client installed, configured, and running.

(It’s actually a little bit more complicated than that. For more information about chef-client runs, see http://wiki.opscode.com/display/chef/Anatomy+of+a+Chef+Run )

Databag Driven Recipes

Data driven infrastructures are all the rage these days. This allows you to do things like change the NTP server all your nodes use by editing a single JSON value in chef-server. You can get really creative with this, so let your imagination run wild.

Step One : Create an ntp data bag

1
2
3
4
5
6
7
$ knife data bag create ntp
$ mkdir -p data_bags/ntp
$ vim data_bags/ntp/default_server.json
{
    "id" : "default_server",
      "value" : "us.pool.ntp.org"
}

Step Two : Upload data bag to chef-server

1
$ knife data bag from file ntp data_bags/ntp/default_server.json

Step Three : Modify the recipe to take advantage of it

ntp/recipes/default.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
package "ntp" do
  action [:install]
end

ntp_server = data_bag_item('ntp', 'default_server')

template "/etc/ntp.conf" do
  source "ntp.conf.erb"
  variables( :ntp_server => ntp_server['value'] )
end

service "ntpd" do
  action[:enable,:start]
end

You can also access data bag data through the search() interface, which you can read about on the opscode wiki.

Step Four : uploading the cookbook to chef-server

1
$ knife cookbook upload ntp

Understanding Idempotence and Convergence

We’re not quite done yet. Let’s SSH into our shiny new NTP enabled machine and go poking about.

1
2
$ grep server /etc/ntp.conf | head -n 1
server time.nist.gov

Wait a sec, isn’t that supposed to be “us.pool.ntp.org”? Not yet. We haven’t enabled our convergence mechanism yet! If we manually run chef-client on the node, we will indeed see that the file has changed.

Convergence

1
2
3
# chef-client
$ grep server /etc/ntp.conf | head -n 1
us.pool.ntp.org

That file just converged into the correct state. Lets edit the file again, this time filling it with complete garbage.

1
2
3
4
# dd if=/dev/urandom of=/etc/ntp.conf bs=128 count=1
# chef-client
$ grep server /etc/ntp.conf | head -n 1
us.pool.ntp.org

Again, the file converged into the correct state. Awesome. Running chef-client by hand on a large cluster of nodes would be a real pain, so it makes sense to set it up automatically. Indeed, often found in a “role[base]” is a “recipe[chef-client]” that configures it to run as a daemon, or from a cron.

Idempotence

It is safe to run the recipes on the nodes time and time again because resources are written to be idempotent. You may remember from math class that a function f is idempotent if, for all values of x, f(f(x))=f(x). That means you can run a function over a resource a bajillion times and it will behave as if it was only done once.

This is implemented under the hood as “If it ain’t broke, don’t fix it.” In a file resource, checksums are calculated and compared. In a package resource, the rpm or dpkg databases are consulted to see if the package is installed. The effect of this is that most chef-client runs do absolutely nothing to resources. That is, until you change the function by altering the inputs to the resource providers.

Notifications and Subscriptions

Further examination reveals that the ntpd service is still talking to “time.nist.gov”. This is because during the chef-client run, the resource named “ntpd” ran it’s idempotency check, and found that it was, in fact, running. It therefore did nothing. It we want ntpd to restart when the contents /etc/ntp.conf are altered, we have to modify our recipe to set up that relation.

ntp/recipes/default.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
package "ntp" do
  action [:install]
end

ntp_server = data_bag_item('ntp', 'default_server')

template "/etc/ntp.conf" do
  source "ntp.conf.erb"
  variables( :ntp_server => ntp_server['value'] )
  notifies :restart, "service[ntpd]"
end

service "ntpd" do
  action[:enable,:start]
end

Alternatively, we could have set up the “service[ntpd]” resource to subscribe to the “template[/etc/ntp.conf]” resource.

Upload the modified ntp cookbook to chef-server and re-run the client on your demo server to check your work.

1
2
3
# chef-client
$ lsof -i | grep ntp | grep pool
ntpd       5673    ntp   19u  IPv4 12481380      0t0  UDP us.pool.ntp.org:ntp}

Winning.

Bulk Loading data into chef-server

To save yourself from writing crazy for loops on command line like

1
for i in `ls cookbooks` ; do knife cookbook upload $i ; done

Or even worse,

1
2
3
4
5
6
for i in `ls data_bags` ; do
  for j in `ls data_bags/$i/`; do
    knife data bag create $i
    knife data bag from file $i data_bags/$i/$j ;
  done ;
done

… somebody was nice enough to write some rake tasks. List them with rake -T, and then install your repo in chef-server with “rake install”

Viewing your Infrastructure

There are two ways to view your infrastructure. The first is through the management console, and the other is from knife. Here is a list of handy commands to get you started.

1
2
3
4
knife node list
knife node show foo.example.net
knife data bag list
knife data bag show whatever

Deleting Clients, Nodes, and Machines

Remember that nodes, their client certificates, and the machines they’re associated with are three separate entities.

Nodes

1
$ knife node delete foo.example.net

This just deletes the node object from chef-server. The next time the machine runs chef-client, the node object will be recreated in chef-server. This node object will have an empty run list what will have to be repopulated before chef-client actually does anything.

Clients

1
$ knife client delete foo.example.net

This deletes a node object’s associated public key from chef-server. The next time the machine runs chef-client, it will get a permission denied error. If this is done on accident, ssh into the machine, delete it’s client key at /etc/chef/client.pem and re-run chef-client.

Machines

Deleting a machine will be specific to how it was provisioned. On AWS, it would look like “knife ec2 server delete i-DEAFBEEF”. On a VMware cluster, it could be by clicking buttons in a GUI. I once deleted a hardware machine by throwing it off a balcony. YMMV.

nodes.sh

I like to keep a special directory called “infrastructures” that contain sub-directories and nodes.sh files. A nodes.sh contains a list of knife commands that can be thought of as the highest level view of the infrastructure. for example:

1
2
3
4
5
6
7
8
knife bootstrap 10.0.0.10 -r 'role[database]'  -N database-01.example.net -x root -d my-fedora13
knife bootstrap 10.0.0.11 -r 'role[database]' -N database-02.example.net -x root -d my-fedora13
knife bootstrap 10.0.0.14 -r 'role[redis]' -N redis01.example.net -x root -d my-fedora13
knife bootstrap 10.0.0.15 -r 'role[redis]' -N redis02.example.net -x root -d my-fedora13
knife bootstrap 10.0.0.14 -r 'role[files]' -N files01.example.net -x root -d my-fedora13
knife bootstrap 10.0.0.15 -r 'role[files]' -N files02.example.net -x root -d my-fedora13
knife bootstrap 10.0.0.16 -r 'role[appdemo]' -N appdemo01.example.net -x root -d my-fedora13
knife bootstrap 10.0.0.17 -r 'role[appdemo]' -N appdemo02.example.net -x root -d my-fedora13

This file can eventually be used to bring up entire infrastructures, but during development, lines are typically pasted into a terminal individually.

This is as close as I’ve gotten to replacing myself with a very small shell script so far. Many a sysadmin has been pursuing this for a long time now. It is here. The journey has just begun.

-s

A Can of Condensed Chef Documentation

| Comments

Overview

Chef’s documentation is vast and broken up into many pages on the Opscode wiki. The goal here is to index this information and give a brief explanation of each topic without going into too much depth.

Architecture

http://wiki.opscode.com/display/chef/Architecture

Chef is a configuration management platform in the same class of tools as Cfengine, Bcfg2, and Puppet. The idea is to define policy at a centralized, version controlled place, and then have the machines under management pull down their policy and converge onto that state at regular intervals. This gives you a single point of administration allowing for easier change management and disaster recovery. Combined with a compute resource provisioning layer (such as knife’s ability to manipulate EC2 or Rackspace servers), entire complex infrastructures can pop into existence in minutes.

Chef Server

http://wiki.opscode.com/display/chef/Chef+Server

Chef server has various components under the hood. Assorted information (cookbooks, databags, client certificates, and node objects), are stored in CouchDB as JSON blobs. CouchDB is indexed by chef-solr-indexer. RabbitMQ sits between the data store and A RESTful API service that exposes all this to the network as chef-server. If you don’t want to run chef-server yourself, Opscode will do it for you for with their Platform service for a meager $5/node/month. The management console is really handy during development, since it provides a nice way to examine JSON data. However, it should be noted that real chefs value knife techniques.

Clients

http://wiki.opscode.com/display/chef/Api+Clients

In Chef, the term “client” refers to an SSL certificate for an API user of chef-server. This is often a point of confusion, and should not be confused with chef-client. Most of the time, one machine has one client certificate, which corresponds to one node object.

Nodes

http://wiki.opscode.com/display/chef/Nodes

Nodes are JSON representations of machines under Chef management. They live in chef-server. They contain two important things: The node’s run list, and a collection of attributes. The run list is a collection of recipes names that will be ran on the machine when chef-client is invoked. Attributes are various facts about the node, which can be manipulated either by hand, or from recipes.

Attributes

http://wiki.opscode.com/display/chef/Attributes

Attributes are arbitrary values set in a node object. Ohai provides a lot of informational attributes about the node, and arbitrary attributes can be set by the line cooks. They can be set from recipes or roles, and have a precedence system that allow you to override them. Examples of arbitrary attributes are listen ports for network services, or the names of a package on a particular linux distribution (httpd vs apache2).

Ohai

http://wiki.opscode.com/display/chef/Ohai

Ohai is the Chef equivilent of Cfengine’s cf-know and Puppet’s facter. When invoked, it collects a bunch of information about the machine its running on, including Chef related stuff, hostname, FQDN, networking, memory, cpu, platform, and kernel data. This information is then output as JSON and used to update the node object on chef-server. It is possible to write custom Ohai plugins, if your’re interested in something not dug up by default.

Chef Client

http://wiki.opscode.com/display/chef/Chef+Client

Managed nodes run an agent called chef-client at regular intervals. This agent can be ran as a daemon or invoked from cron. The agent pulls down policy from chef-server and converges the system to the described state. This lets you introduce changes to machines in your infrastructure by manipulating data in chef-server. The pull (vs push) technique ensures machines that are down for maintenance end up the proper state when turned back on.

Resources

http://wiki.opscode.com/display/chef/Resources

Resources are the basic configuration items that are manipulated by Chef recipes. Resources make up the Chef DSL by providing a declarative interface to objects on the machine. Examples of core resources include files, directories, users and groups, links, packages, and services.

Recipes

http://wiki.opscode.com/display/chef/Recipes

Recipes contain the actual code that gets ran on machines by chef-client. Recipes can be made up entirely of declarative resources statements, but rarely are. The real power of Chef stems from a recipes’s ability to search chef-server for information. Recipes can say “give me a list of all the hostnames of my web servers”, and then generate the configuration file for your load balancer. Another recipe might say “give me a list of all my database servers”, so it can configure Nagios to monitor them.

Cookbooks

http://wiki.opscode.com/display/chef/Cookbooks

Cookbooks allow you to logically group recipes. Cookbooks come with all the stuff the recipes need to make themselves work, such as files, templates, and custom resources (LWRPs).

Roles

http://wiki.opscode.com/display/chef/Roles

Roles allow you to assemble trees of recipe names, which are expanded into run lists. Roles can contain other roles, which serve as vertices, and recipe names, which are the leaves. The tree is walked depth first, which makes ordering intuitive when assembling run lists. It is possible to apply many of these trees to a single node, but you don’t have to. Roles can also contain lists of attributes to apply to nodes, potentially changing recipe behavior.

Databags

http://wiki.opscode.com/display/chef/Data+Bags

Databags are arbitrary JSON structures that can be searched for by Chef recipes. They typically contain things like database passwords and other information that needs to be shared between resources on nodes. You can think of them as read only global variables that live on chef-server. They also have a great name that can be used to make various jokes in Campfire.

Knife

http://wiki.opscode.com/display/chef/Knife

knife is the CLI interface to the chef-server API. It can manipulate databags, node objects, cookbooks, etc. It can also be used to provision cloud resources and bootstrap systems.

-s