We're all busy people. Why do any of us want to spend time learning our esoteric software management system? The goal is to minimize the amount of time we spend working on individual systems. Rewashing software for the same platform again and again (or manually installing the software over and over) takes a lot of effort, allows for more mistakes to creep in, and all around is just a bummer. We simply don't have enough people to do individual management of all of our unix machines.
The Andrew environment is designed to allow all of us to share the grunge work of making systems work, without preventing per-machine flexibility. The downside is that it has a learning curve all its own--but one which you will hopefully find well worth it.
depot is primarily responsible for managing collection versioning. depot takes over the management of a directory hierarchy (in our environment, depot manages /usr/local, /usr/contributed, and /usr/host). No changes happen inside of this hierarchy without depot making them, ensuring that changes are reversible and reproducible. depot works by linking or copying various collections to the target directory and ensuring that these collections don't conflict. Individual collections can then be upgraded or installed and each file belongs to one and only one collection. depot by itself understands very little of versioning or per-machine customization; we use dpp, the depot pre-processor, for per-machine customization.
package is responsible for management of the operating system and other boot time configuration. It is the primary method by which we customize individual or classes of machines. package itself is very stupid; it merely knows how to make a filesystem resemble its configuration file. package is also usually the most irritating program on our system, since it will delete files that don't match its configuration file--all of us have seen package delete something we wanted to keep. Again, package by itself is fairly stupid and doesn't allow any simple inheritance. We use yet another pre-processor, mpp, to provide these features for package. Along with our use of mpp, we use a large set of conventions to make our package environment comprehensible.
wsadmin or /afs/andrew.cmu.edu/wsadmin is the directory hierarchy on AFS which holds large numbers of fragments of package (and occasionally depot) configuration files. The mpp processor knits these fragments together to form a complete package configuration. These convention allows us to configure Apache on a machine with a single %define doesapache in a /etc/package.proto instead of manually inserting the tens to hundreds of lines of package configuration Apache would normally need.
emt works with adm to perform delegated software management. emt manages a set of environments (a "beta" and a "gamma" environment for each systype) and allows collection maintainers to release software to those environments. Since emt uses fairly long and annoying commands, the Perl script carpe generates the appropriate command to run after a simple interactive dialog and automatically e-mails it to a bboard (these bboards start with org.acs.asg.request). Individual maintainers can generally affect the beta environment directly. Gatekeepers are generally responsible for releases to the gamma environment.
most of these things are ideas on what we should be doing, not what we're
necessarily doing now.
Clusters are gamma machines.
Computing service desktop machines are generally beta machines. You might
want to have /usr/local/depot/depot.pref.proto:
%define beta
%define tree local
%include /afs/andrew.cmu.edu/wsadmin/depot/src/depot.include
searchpath * ${local}
collection.installmethod copy netscape,lemacs,kerberos,maplev,com_err,gnucc,gdb
collection.installmethod copy gnome,xfree86
collection.installmethod copy openssl,mozilla
Add to the list depending on what applications you use frequently. (This
is only for performance.)
Your workstation will depot nightly. You can cause depot to copy a specific
version of a collection with a line like
path cyrus ${dest}/cyrus/064
which will cause the Cyrus version 064 to by installed on your computer.
This is useful for testing new versions before beta release or examining
how old versions worked.
You want to reboot whenever new OS versions are put into beta (see bboards);
probably around once a month is a good choice, or after you run package.
(Always reboot after running package!)
The primary question for production machines is "how often should they
update"? The more frequent they update, the more times something may break--and
frequent updates means that people are probably not paying close attention
to each update. On the other hand, less frequent updates cause each update
to be much bigger, which means tracking down what change caused a bustage
can be much more complicated. Infrequent updates can also complicate security
fixes--ideally, security fixes would require a very small software change
but if a machine is too far behind the times, it will require a special version
or a large update to stabilize.
If possible, production machines should reboot weekly, causing depot and package to run at each reboot. Generally,
redundant services such as SMTP servers, Unix servers, or DNS servers should
have no problems meeting this requirement, since they can reboot on a staggered
schedule and cause little or no user visible outages. (Our users are remarkably
tolerant of daily outages: the Unix servers are unavailable for 10-30 minutes
every day with few complaints.) A single redundant server can be down for
an extended period of time, so if an environment change has broken the server
it is not a catastrophe.
Non-redundant servers need to balance the need for uptime versus the resources
we want to spend as system administrators. While we've made some changes
to package and depot to have them run faster, our
server hardware tends to reboot slowly. Non-replicated file servers (such
as Cyrus backends or AFS user servers) can cause interesting questions. Lately,
we've rebooted AFS servers weekly (with little complaint) but have attempted
to minimize the downtime for Cyrus backends. Non-replicated services can
also suffer from the "unintended upgrade" effect--a seemingly unrelated change
causes downtime, and causes downtime when no system administrator is immediately
available to fix it. Possible remedies to this include:
Most of our modern servers use Unix as a substrate but provide user access
through a well-defined protocol served by specific application software.
Since this application software is usually not run by ordinary users, it
is generally put in the /usr/host tree (to provide versioning using depot and emt) and then copied from host onto
the local disk by package at
boot time. While this provides generally good flexibility, it suffers from
the lack of versioning in the wsadmin area. wsadmin can tell if there's a
beta or gamma environment but is unaware of the exact version of the software
being run. (one possibility is package fragments in /usr/host areas?)