Some people seem to operate on the assumption that if they don’t
think about a disaster, one won’t happen. This is similar to the idea
that if you don’t write a will, you’ll never die—and just about as
realistic. No business owner or system administrator should feel
comfortable about their degree of preparedness without a clear
disaster recovery plan that has been thoroughly tested. Even then, you
should continually look for ways to improve the plan—it should only be
your starting point.
A good disaster recovery plan is one that you are constantly
examining, improving, updating, and testing. But understand your
disaster plan’s limitations: it isn’t perfect, and even the best
disaster recovery plan needs to be constantly examined and adjusted or
it quickly gets out of date.
Planning for disaster or emergencies is not a single step, but
an iterative, ongoing process. Systems are not mountains,
but rivers, constantly moving and changing, and your disaster recovery
plan needs to change as your environment changes. To put together a
good disaster recovery plan—one you can bet your business on—you need
to follow these steps:
Disasters happen to businesses of all sizes and types. Small
businesses are no more insulated from them than large businesses
are, but generally they don’t have the same levels of resources to
respond to them and recover from them. A large, multinational
corporation with an IT staff of several hundred worldwide certainly
has more resources than a small accounting firm with an IT staff of
one. As you work through the steps to build your disaster recovery
plan, how you plan and implement it will vary depending on the
size of your company and the resources
available.
In the discussion of disaster planning that follows, many of the steps, and
the actions associated with those steps, are quite formal and
probably sound like a bit more than you can manage in your small
business. And, in many cases, you’re right—in a small business, one
can often be substantially more informal. But do
not make the mistake of ignoring something
because it sounds too formal or involved. Rather, adjust the step
and actions to fit within your smaller, but no less important,
business. No matter how small your business, if it uses and depends
on Microsoft Windows Small Business Server (SBS) 2011 Standard, you
have valuable and business-critical assets on your server, so take
the steps to protect them and your business before you have a
disaster. You’ll save money, time, and, most important, business
reputation by being able to withstand and even grow in the face of
disaster.
We’ve been through fires, earthquakes, crashed servers, and
just plain egregious error, and we’ve learned the hard way that
disaster recovery is something you can do a lot better if you’ve
planned for it ahead of time. It’s not sexy, and it’s sometimes hard
to sell to upper management, but it is worth the effort. If you’re
lucky, you’ll never need to use all of your plans for worst-case
scenarios, but if you do need them, you’ll be really,
really glad you have them.
|
1. Identifying the Risks
The first step in creating a disaster recovery plan is to
identify the risks to your business and the costs associated with
those risks. The risks vary from the simple deletion of a critical
file to the total destruction of your place of business and its
computers. To properly prepare for a disaster, you need to perform a
realistic assessment of the risks, the potential costs and
consequences of each disaster scenario, the likelihood of any given
disaster scenario, and the resources available to address the risks.
Risks that seemed vanishingly remote a few years ago are now part of
our everyday lives.
This isn’t a job for a single person. As with all the tasks
associated with a disaster recovery plan, all concerned parties must
participate. There are two important reasons for this: you want to
make sure that you have commitment and buy-in from the parties
concerned, and you also want to make sure you don’t miss anything
important.
No matter how carefully and thoroughly you try to identify the
risks, you’ll miss at least one. You can account for that missing
risk by including an “unknown risk” item in your list. Treat it just
like any other risk: identify the resources available to address it,
and develop countermeasures to take should it occur. The difference
with this risk, of course, is that your resources and
countermeasures are somewhat more generic, and you can’t really test
your response to the risk, because you don’t yet know what it
is.
Start by trying to list all the possible ways that your
network could fail. Solicit help from everyone with a stake in the
process. The more people involved in the brainstorming, the more
ideas you’ll get, and the more prevention and recovery procedures
you can develop and practice. Be careful at this stage in the
process to not dismiss any idea or concern as trivial, unimportant,
or unlikely.
Next, look at all the ways that some external event could
affect your system. (The current buzz word for this is
threat modeling, if you care.) The team of
people responsible for identifying possible external problems is
probably similar to a team looking at internal failures, but with
some important differences. For example, if your business is housed
in a large commercial office building, you’ll want to involve that
building’s security and facilities groups even though they aren’t
employees of your business. They will not only have important input
into the possible threats to the business, but also they’ll also
have information on the resources and preventative measures already
in place.
The risk identification phase is really made up of two parts:
identification and assessment. They are different tasks. During the
identification portion of the phase, you need to identify every
possible risk, no matter how remote or unlikely. No risk suggested
should be regarded as silly—don’t limit the suggestions in any way.
You want to identify every possible risk that anyone can think of.
Then, when you have as complete a list as you can create, move on to
the assessment task. In the risk-assessment task, you will try to
understand and quantify just how likely a particular risk is. If
you’re located in a flood plain, for example, you’re much more
likely to think flood insurance is a good investment.
Note:
Even in a very small business, where there might be only one
person involved in disaster planning, it’s a really good idea to
get others involved somehow in at least the risk-identification
task. Different people think up different scenarios and risk
factors, and soliciting more and different viewpoints will improve
the overall result of the process.
2. Identifying the Resources
After you’ve identified the risks to your network, you need to identify what the
resources are to address those risks. These resources can be
internal or external, people or systems, hardware or
software.
When you’re identifying the resources available to deal with a
specific risk, be as complete as you can, but also be specific.
Identifying everyone in the company as a resource to solve a crashed
server might look good, but realistically only one or two people are
likely to actually be able to rebuild the server. Make sure you
identify those key people for each risk, as well as the more general
secondary resources they have to call on, such as Microsoft Customer
Support Services (CSS) and local Microsoft partners. For example,
the primary resource available to recover a crashed server might
consist of your hardware vendor to recover the failed hardware and
your own IT person or primary system consultant to restore the
software and database. General secondary resources could include
Microsoft Support (http://support.microsoft.com/oas/default.aspx?gprid=3208),
Microsoft Partners in your area, and the TechNet Forum for SBS
(http://social.technet.microsoft.com/Forums/en-US/smallbusinessserver/threads).
An important step in identifying resources in your disaster
recovery plan is to specify both the first-line responsibility and
the back-end or supervisory responsibility. Make sure everyone knows
who to go to when the problem is more than they can handle or when
they need additional resources. Also, clearly define when they
should escalate. The best disaster recovery plans include clear,
unambiguous escalation policies. This takes the burden off
individuals to decide when to notify someone and whom to notify, and
it makes escalation simply part of the procedure.
3. Developing the Responses
An old but relevant adage comes to mind when discussing
disaster recovery scenarios: When you’re up to your elbows in
alligators, it’s difficult to remember that your original objective
was to drain the swamp. This is another way of saying that people
lose track of what’s important when they are overloaded by too many
problems that require immediate attention. To ensure that your swamp
is drained and your network gets back online, you need to take those
carefully researched risks and resources and develop a disaster
recovery plan. There are two important parts of any good disaster
recovery plan:
Making sure these procedures are in place and clearly
understood by everyone involved, before a disaster strikes, puts you
in a far better position to recover gracefully and with a minimum of
lost productivity and data.
3.1. Standard Operating Procedures
Emergencies bring out both the best and worst in people. If
you’re prepared for the emergency, you can be one of those who
come out smelling like a rose, but if you’re not prepared and let
yourself get flustered or lose track of what you’re trying to
accomplish, you can make the whole situation worse than it needs
to be.
It’s just plain hard to stay calm and focused when you’re in
the middle of an emergency and there’s a lot of extra stress being
applied by everyone around you. Although no one is ever as
prepared for a system emergency as they’d like to be, careful
planning and preparation can give you an edge in recovering
expeditiously and with a minimal loss of data. It’s a lot easier
to deal with the situation calmly when you know you’ve prepared
for this problem and you have a well-organized, tested SOP to
follow.
Because the very nature of emergencies is that you can’t
predict exactly which one is going to strike, you need to plan and
prepare for as many possibilities as you can. The time to decide
how to recover from a disaster is before the disaster happens, not
in the middle of it when users are screaming and bosses are
standing around looking serious and concerned. If you’re lucky.
(We seem to have been blessed by those who follow the more common
adage, “When in trouble or in doubt, run in circles, scream and
shout.”).
Your risk-assessment phase involved identifying as many
possible disaster scenarios and risks as you could; the
resource-assessment phase identified the resources for those
risks. Now you need to create SOPs for recovering the system from each of the
scenarios. Having an SOP that details how to recover from a failed
server makes that recovery a lot easier.
Reduce your stress and prevent mistakes by planning for
disasters before they occur. Practice recovering from each of your
disaster scenarios. Write down each of the steps, and work through
questionable or unclear areas until you can identify exactly what
it takes to recover from the problem. This is like a fire drill,
and you should do it for the same reasons—not because a fire is
inevitable, but because fires do happen, and the statistics
demonstrate irrefutably that those who prepare for a fire and
practice what to do in a fire are far more likely to survive the
fire.
Even if you know you’re the only resource the company has to
recover from a disaster scenario, write down the basic steps to do
it. You don’t need to go into minute detail, but at the very
least, outline the key steps. This might be something you do for
real only once in your life, so don’t count on being able to
remember everything. Disasters, by their very nature, raise the
overall stress level and cause people to forget important
steps.
Your job as a system administrator is to prepare for
disasters and practice what to do in those disasters—not because
you expect the disaster, but because if you do have one, you want
to be the hero, not the goat. After all, it isn’t often that the
system administrator or IT consultant gets to be a hero, so be
ready when your time comes.
The first step in developing any SOP is to outline the overall steps
you want to accomplish. Keep it general at this point—you’re
looking for the big picture here. Again, you want everyone to be
involved in the process. What you’re really trying to do is make
sure you don’t forget any critical steps, and that’s much easier
when you get the overall plan down first. There will be plenty of
opportunity later to cover the specific details.
After you have a broad, high-level outline for a given
procedure, the people you identified as the actual resources
during the resource-assessment phase should start to fill in the
blanks of the outline. You don’t need every detail at this point,
but you should get down to at least a level below the original
outline. This will help you identify missing resources that are
important to a timely resolution of the problem. Again, don’t get
too bogged down in the details at this point. You’re not actually
writing the SOP, just trying to make sure that you’ve identified
all of its pieces.
When you feel confident that the outline is ready, get the
larger group back together again. Go over the procedure and smooth
out the rough edges, refining the outline and listening to make
sure you haven’t missed anything critical. When everyone agrees
that the outline is complete, you’re ready to add the final
details to it.
The people who are responsible for each procedure should now
work through all the details of the disaster recovery plan and
document the steps thoroughly. They should keep in mind that the
people who actually perform the recovery might not be who they
expect. It’s great to have an SOP for recovering from a failed
router, but if the only person who understands the procedure is
the IT person and she’s on vacation in Bora Bora that week, your
disaster recovery plan has a big hole in it.
When you create the documentation, write down everything.
What seems obvious to you now, while you’re devising the
procedure, will not seem at all obvious in six months or a year
when you suddenly have to follow it under stress.
It’s tempting to centralize your SOPs into a single, easily accessible database.
And you should do that, making sure everyone understands how to
use it. But you’ll also need to have alternative locations and
formats for your procedures. Not only do you not want to keep
the only copy in a single database, you also don’t want to have
only an electronic version—how accessible is the SOP for
recovering a failed server going to be when the server has
failed? Always maintain hard-copy versions as well. The one
thing you don’t want to do is create a single point of failure
in your disaster recovery plan!
Every good server room should have a large binder,
prominently visible and clearly identified, that contains all
the SOPs. Each responsible person should also have one
or more copies of at least the procedures he or she is
either a resource for or likely to become a resource for. We
like to keep copies of all our procedures in several places so
that we can get at them no matter what the source of the
emergency or where we happen to be when one of our pagers goes
off.
Even if you’re the only resource, keep multiple copies of
your procedures and key phone numbers of external resources.
Don’t rely entirely on electronic storage, because even external
electronic storage might be difficult to access if the disaster
is major. But don’t ignore electronic storage, either. Most of
the time, it’s the fastest and easiest to get to, and the most
likely to be completely up to date.
|
After you have created the SOPs, your job has only begun. You need to keep them
up to date and make sure that they don’t become stale. It’s no
good having an SOP to recover your ISDN connection to the Internet
when you ripped the ISDN line out three years ago and put in a DSL
line with five times the bandwidth at half the cost.
You also need to make sure that all your copies of an SOP
are updated. Electronic ones should probably be stored in a
database or in a folder on SBS that is available offline. However,
hard-copy documents are notoriously tricky to maintain. A good
method is to make yet another SOP that details who updates what
SOPs, how often that person updates it, and who gets fresh copies
whenever a change is made. Then put a version control system into
place and make sure everyone understands his or her role in the
process. Build rewards into the system for timely and consistent
updating of SOPs—if 10 or 20 percent of someone’s bonus is
dependent on keeping those SOPs up to date and distributed, you
can be sure they’ll be current at least as often as the review
process.
3.2. Standard Escalation Procedures
No matter how carefully you’ve identified potential risks,
and how detailed your procedures to recover from them are, you’re
still likely to have situations you didn’t anticipate. An
important part of any disaster recovery plan is a standardized
escalation procedure. Not only should each individual SOP have its
own procedure-specific SEP, but you should also have an overall
escalation procedure that covers everything you haven’t thought
of—because it’s certain you haven’t thought of
everything.
An escalation procedure has two functions—resource
escalation and notification escalation. Both have the same
purpose: to make sure that everyone who needs to know about the
problem is up to date and involved as appropriate, and to keep the
overall noise level down so that the work of resolving the problem
can go forward as quickly as possible. The resource
escalation procedure details the resources that are
available to the people who are trying to recover from the current
disaster so that these people don’t have to try to guess who (or
what) the appropriate resource might be when they run into
something they can’t handle or something doesn’t go as planned.
This procedure helps them stay calm and focused. They know that if
they run into a problem, they aren’t on their own, and they know
exactly who to call when they do need help.
The notification escalation procedure
details who is to be notified of serious problems. Even more
important, it should provide specifics regarding
when notification is to be made. If a
particular print queue crashes but comes right back up, you might
want to send a general message only to the users of that
particular printer letting them know what happened. However, if
your email has been down for more than half an hour, a lot of
folks are going to be concerned. The SEP for email should detail
who needs to be notified when the server is unavailable for longer
than some specified amount of time, and it should probably detail
what happens and who gets notified when it’s still down some
significant amount of time after that.
This notification has two purposes: to make sure that the
necessary resources are made available as required, and to keep
everyone informed and aware of the situation. If you let people
know that you’ve had a server hardware failure and that the vendor
has been called and will be onsite within an hour, you’ll cut down
the number of phone calls exponentially, freeing you to do
whatever you need to do to ensure that you’re ready when the
vendor arrives.
4. Testing the Responses
A disaster recovery plan is nice to have, but it really isn’t
worth a whole lot until it has actually been tested. Needless to
say, the time to test the plan is at your convenience and under
controlled conditions, rather than in the midst of an actual
disaster. It’s a nuisance to discover that your detailed disaster
recovery plan has a fatal flaw in it when you’re testing it under
controlled conditions. It’s a bit more than a nuisance to discover
it when every second counts.
You won’t be able to test everything in your disaster recovery
plans. Even most large organizations don’t have the resources to
create fully realistic simulated natural disasters and test their
response to each of them under controlled conditions, and even fewer
small businesses have those kinds of resources. Nevertheless, there
are things you can do to test your response plans. The details of
how you test them depend on your environment, but they should
include as realistic a test as feasible and should, as much as
possible, cover all aspects of the response plan. The other reason
to test the disaster recovery plan is that it provides a valuable
training ground. If you’ve identified primary and backup resources,
as you should, chances are that the people you’ve identified as
backup resources are not as skilled or knowledgeable in a particular
area as the primary resource. Testing the procedures gives you a
chance to train the backup resources at the same time.
You should also consider using the testing to cross-train
people who are not necessarily in the primary response group. Not
only will they get valuable training, but you’ll also create a
knowledgeable pool of people who might not be directly needed when
the procedure has to be used for real, but who can act as key
communicators with the rest of the community.
5. Iterating
When you finish a particular disaster recovery plan, you might
think your job is done, but it’s not. Standardizing a process is actually just the first step. You need to
continually look for ways to improve it.
You should make a regular, scheduled practice of pulling out
your disaster recovery plan with those responsible and making sure
it’s up to date. Use the occasion to actually look at it and see how
you can improve on it. Take the opportunity to examine your
environment. What’s changed since you last looked at the plan? What
equipment has been retired, and what has been added? What software
is different? Are all the people on your notification and escalation
lists still working at the company in the same roles? Are the phone
numbers, including home phone numbers, up to date?
Kaizen is a Japanese word and concept
that means “small, continuous, improvement.” Its literal
translation is, “Change (kai) to become good (zen).”
So, why bring a Japanese word and concept into a discussion
about disaster recovery? Because a good disaster recovery plan is
one that you are constantly Kaizening. When you really understand
Kaizen, it becomes a way of life that you can use in many
ways.
The first thing to understand about Kaizen is that you are
not striving for major change or improvement. Small improvements
are the goal. Don’t try to fix or change everything all at once.
Instead, focus on one area, and try to make it just a little bit
better.
The second part of Kaizen is that it is continuous. You must
constantly look for ways to improve and implement those
improvements. Because each improvement is small and incremental,
you can easily implement it and move on to the next one.
Kaizen is very much about teamwork. Good Kaizen balances the
load on a team and finds ways to build the strengths of the team
as a whole. If you practice Kaizen and continually look for small,
incremental ways to improve your work, you will soon have a better
and more enjoyable workplace. As a manager, if you find ways to
encourage and reward those who practice Kaizen, your team and you
will grow and prosper.
|
Another way to iterate your disaster recovery plan is to use
every disaster as a learning experience. After the disaster or
emergency is over, get everyone together as soon as possible to talk
about what happened. Find out what they think worked and what didn’t
in the plan. What tools did you not have that would have made the
job go quicker or better? Actively solicit suggestions for how the
process could be improved. Then make the changes and
test them. You’ll not only improve your responsiveness to this
particular type of disaster, but you’ll also improve your overall
responsiveness by getting people involved in the process and
enabling them to be part of the solution.
Warning:
IMPORTANT Do not use this
post-disaster recovery discussion to assign blame or look for the
cause of the disaster. This is about how to respond to, and
recover from, a disaster better. And to do that, you need to learn
from the experience so that you can do a better job planning for
the next one. If everyone is trying to avoid blame, they won’t
have any energy for improving the process.