Windows Small Business Server 2011 : Disaster Planning - Planning for Disaster

1/14/2013 11:39:15 AM

Some people seem to operate on the assumption that if they don’t think about a disaster, one won’t happen. This is similar to the idea that if you don’t write a will, you’ll never die—and just about as realistic. No business owner or system administrator should feel comfortable about their degree of preparedness without a clear disaster recovery plan that has been thoroughly tested. Even then, you should continually look for ways to improve the plan—it should only be your starting point.

A good disaster recovery plan is one that you are constantly examining, improving, updating, and testing. But understand your disaster plan’s limitations: it isn’t perfect, and even the best disaster recovery plan needs to be constantly examined and adjusted or it quickly gets out of date.

Planning for disaster or emergencies is not a single step, but an iterative, ongoing process. Systems are not mountains, but rivers, constantly moving and changing, and your disaster recovery plan needs to change as your environment changes. To put together a good disaster recovery plan—one you can bet your business on—you need to follow these steps:

Identify the risks.
Identify the resources.
Develop the responses.
Test the responses.
Iterate.

REAL WORLD: Size Does Matter

Disasters happen to businesses of all sizes and types. Small businesses are no more insulated from them than large businesses are, but generally they don’t have the same levels of resources to respond to them and recover from them. A large, multinational corporation with an IT staff of several hundred worldwide certainly has more resources than a small accounting firm with an IT staff of one. As you work through the steps to build your disaster recovery plan, how you plan and implement it will vary depending on the size of your company and the resources available.

In the discussion of disaster planning that follows, many of the steps, and the actions associated with those steps, are quite formal and probably sound like a bit more than you can manage in your small business. And, in many cases, you’re right—in a small business, one can often be substantially more informal. But do not make the mistake of ignoring something because it sounds too formal or involved. Rather, adjust the step and actions to fit within your smaller, but no less important, business. No matter how small your business, if it uses and depends on Microsoft Windows Small Business Server (SBS) 2011 Standard, you have valuable and business-critical assets on your server, so take the steps to protect them and your business before you have a disaster. You’ll save money, time, and, most important, business reputation by being able to withstand and even grow in the face of disaster.

We’ve been through fires, earthquakes, crashed servers, and just plain egregious error, and we’ve learned the hard way that disaster recovery is something you can do a lot better if you’ve planned for it ahead of time. It’s not sexy, and it’s sometimes hard to sell to upper management, but it is worth the effort. If you’re lucky, you’ll never need to use all of your plans for worst-case scenarios, but if you do need them, you’ll be really, really glad you have them.

1. Identifying the Risks

The first step in creating a disaster recovery plan is to identify the risks to your business and the costs associated with those risks. The risks vary from the simple deletion of a critical file to the total destruction of your place of business and its computers. To properly prepare for a disaster, you need to perform a realistic assessment of the risks, the potential costs and consequences of each disaster scenario, the likelihood of any given disaster scenario, and the resources available to address the risks. Risks that seemed vanishingly remote a few years ago are now part of our everyday lives.

This isn’t a job for a single person. As with all the tasks associated with a disaster recovery plan, all concerned parties must participate. There are two important reasons for this: you want to make sure that you have commitment and buy-in from the parties concerned, and you also want to make sure you don’t miss anything important.

No matter how carefully and thoroughly you try to identify the risks, you’ll miss at least one. You can account for that missing risk by including an “unknown risk” item in your list. Treat it just like any other risk: identify the resources available to address it, and develop countermeasures to take should it occur. The difference with this risk, of course, is that your resources and countermeasures are somewhat more generic, and you can’t really test your response to the risk, because you don’t yet know what it is.

Start by trying to list all the possible ways that your network could fail. Solicit help from everyone with a stake in the process. The more people involved in the brainstorming, the more ideas you’ll get, and the more prevention and recovery procedures you can develop and practice. Be careful at this stage in the process to not dismiss any idea or concern as trivial, unimportant, or unlikely.

Next, look at all the ways that some external event could affect your system. (The current buzz word for this is threat modeling, if you care.) The team of people responsible for identifying possible external problems is probably similar to a team looking at internal failures, but with some important differences. For example, if your business is housed in a large commercial office building, you’ll want to involve that building’s security and facilities groups even though they aren’t employees of your business. They will not only have important input into the possible threats to the business, but also they’ll also have information on the resources and preventative measures already in place.

The risk identification phase is really made up of two parts: identification and assessment. They are different tasks. During the identification portion of the phase, you need to identify every possible risk, no matter how remote or unlikely. No risk suggested should be regarded as silly—don’t limit the suggestions in any way. You want to identify every possible risk that anyone can think of. Then, when you have as complete a list as you can create, move on to the assessment task. In the risk-assessment task, you will try to understand and quantify just how likely a particular risk is. If you’re located in a flood plain, for example, you’re much more likely to think flood insurance is a good investment.

Note:

Even in a very small business, where there might be only one person involved in disaster planning, it’s a really good idea to get others involved somehow in at least the risk-identification task. Different people think up different scenarios and risk factors, and soliciting more and different viewpoints will improve the overall result of the process.

2. Identifying the Resources

After you’ve identified the risks to your network, you need to identify what the resources are to address those risks. These resources can be internal or external, people or systems, hardware or software.

When you’re identifying the resources available to deal with a specific risk, be as complete as you can, but also be specific. Identifying everyone in the company as a resource to solve a crashed server might look good, but realistically only one or two people are likely to actually be able to rebuild the server. Make sure you identify those key people for each risk, as well as the more general secondary resources they have to call on, such as Microsoft Customer Support Services (CSS) and local Microsoft partners. For example, the primary resource available to recover a crashed server might consist of your hardware vendor to recover the failed hardware and your own IT person or primary system consultant to restore the software and database. General secondary resources could include Microsoft Support (http://support.microsoft.com/oas/default.aspx?gprid=3208), Microsoft Partners in your area, and the TechNet Forum for SBS (http://social.technet.microsoft.com/Forums/en-US/smallbusinessserver/threads).

An important step in identifying resources in your disaster recovery plan is to specify both the first-line responsibility and the back-end or supervisory responsibility. Make sure everyone knows who to go to when the problem is more than they can handle or when they need additional resources. Also, clearly define when they should escalate. The best disaster recovery plans include clear, unambiguous escalation policies. This takes the burden off individuals to decide when to notify someone and whom to notify, and it makes escalation simply part of the procedure.

3. Developing the Responses

An old but relevant adage comes to mind when discussing disaster recovery scenarios: When you’re up to your elbows in alligators, it’s difficult to remember that your original objective was to drain the swamp. This is another way of saying that people lose track of what’s important when they are overloaded by too many problems that require immediate attention. To ensure that your swamp is drained and your network gets back online, you need to take those carefully researched risks and resources and develop a disaster recovery plan. There are two important parts of any good disaster recovery plan:

Standard operating procedures (SOPs)
Standard escalation procedures (SEPs)

Making sure these procedures are in place and clearly understood by everyone involved, before a disaster strikes, puts you in a far better position to recover gracefully and with a minimum of lost productivity and data.

3.1. Standard Operating Procedures

Emergencies bring out both the best and worst in people. If you’re prepared for the emergency, you can be one of those who come out smelling like a rose, but if you’re not prepared and let yourself get flustered or lose track of what you’re trying to accomplish, you can make the whole situation worse than it needs to be.

It’s just plain hard to stay calm and focused when you’re in the middle of an emergency and there’s a lot of extra stress being applied by everyone around you. Although no one is ever as prepared for a system emergency as they’d like to be, careful planning and preparation can give you an edge in recovering expeditiously and with a minimal loss of data. It’s a lot easier to deal with the situation calmly when you know you’ve prepared for this problem and you have a well-organized, tested SOP to follow.

Because the very nature of emergencies is that you can’t predict exactly which one is going to strike, you need to plan and prepare for as many possibilities as you can. The time to decide how to recover from a disaster is before the disaster happens, not in the middle of it when users are screaming and bosses are standing around looking serious and concerned. If you’re lucky. (We seem to have been blessed by those who follow the more common adage, “When in trouble or in doubt, run in circles, scream and shout.”).

Your risk-assessment phase involved identifying as many possible disaster scenarios and risks as you could; the resource-assessment phase identified the resources for those risks. Now you need to create SOPs for recovering the system from each of the scenarios. Having an SOP that details how to recover from a failed server makes that recovery a lot easier.

Reduce your stress and prevent mistakes by planning for disasters before they occur. Practice recovering from each of your disaster scenarios. Write down each of the steps, and work through questionable or unclear areas until you can identify exactly what it takes to recover from the problem. This is like a fire drill, and you should do it for the same reasons—not because a fire is inevitable, but because fires do happen, and the statistics demonstrate irrefutably that those who prepare for a fire and practice what to do in a fire are far more likely to survive the fire.

Even if you know you’re the only resource the company has to recover from a disaster scenario, write down the basic steps to do it. You don’t need to go into minute detail, but at the very least, outline the key steps. This might be something you do for real only once in your life, so don’t count on being able to remember everything. Disasters, by their very nature, raise the overall stress level and cause people to forget important steps.

Your job as a system administrator is to prepare for disasters and practice what to do in those disasters—not because you expect the disaster, but because if you do have one, you want to be the hero, not the goat. After all, it isn’t often that the system administrator or IT consultant gets to be a hero, so be ready when your time comes.

The first step in developing any SOP is to outline the overall steps you want to accomplish. Keep it general at this point—you’re looking for the big picture here. Again, you want everyone to be involved in the process. What you’re really trying to do is make sure you don’t forget any critical steps, and that’s much easier when you get the overall plan down first. There will be plenty of opportunity later to cover the specific details.

After you have a broad, high-level outline for a given procedure, the people you identified as the actual resources during the resource-assessment phase should start to fill in the blanks of the outline. You don’t need every detail at this point, but you should get down to at least a level below the original outline. This will help you identify missing resources that are important to a timely resolution of the problem. Again, don’t get too bogged down in the details at this point. You’re not actually writing the SOP, just trying to make sure that you’ve identified all of its pieces.

When you feel confident that the outline is ready, get the larger group back together again. Go over the procedure and smooth out the rough edges, refining the outline and listening to make sure you haven’t missed anything critical. When everyone agrees that the outline is complete, you’re ready to add the final details to it.

The people who are responsible for each procedure should now work through all the details of the disaster recovery plan and document the steps thoroughly. They should keep in mind that the people who actually perform the recovery might not be who they expect. It’s great to have an SOP for recovering from a failed router, but if the only person who understands the procedure is the IT person and she’s on vacation in Bora Bora that week, your disaster recovery plan has a big hole in it.

When you create the documentation, write down everything. What seems obvious to you now, while you’re devising the procedure, will not seem at all obvious in six months or a year when you suddenly have to follow it under stress.

REAL WORLD: Multiple Copies, Multiple Locations

It’s tempting to centralize your SOPs into a single, easily accessible database. And you should do that, making sure everyone understands how to use it. But you’ll also need to have alternative locations and formats for your procedures. Not only do you not want to keep the only copy in a single database, you also don’t want to have only an electronic version—how accessible is the SOP for recovering a failed server going to be when the server has failed? Always maintain hard-copy versions as well. The one thing you don’t want to do is create a single point of failure in your disaster recovery plan!

Every good server room should have a large binder, prominently visible and clearly identified, that contains all the SOPs. Each responsible person should also have one or more copies of at least the procedures he or she is either a resource for or likely to become a resource for. We like to keep copies of all our procedures in several places so that we can get at them no matter what the source of the emergency or where we happen to be when one of our pagers goes off.

Even if you’re the only resource, keep multiple copies of your procedures and key phone numbers of external resources. Don’t rely entirely on electronic storage, because even external electronic storage might be difficult to access if the disaster is major. But don’t ignore electronic storage, either. Most of the time, it’s the fastest and easiest to get to, and the most likely to be completely up to date.

After you have created the SOPs, your job has only begun. You need to keep them up to date and make sure that they don’t become stale. It’s no good having an SOP to recover your ISDN connection to the Internet when you ripped the ISDN line out three years ago and put in a DSL line with five times the bandwidth at half the cost.

You also need to make sure that all your copies of an SOP are updated. Electronic ones should probably be stored in a database or in a folder on SBS that is available offline. However, hard-copy documents are notoriously tricky to maintain. A good method is to make yet another SOP that details who updates what SOPs, how often that person updates it, and who gets fresh copies whenever a change is made. Then put a version control system into place and make sure everyone understands his or her role in the process. Build rewards into the system for timely and consistent updating of SOPs—if 10 or 20 percent of someone’s bonus is dependent on keeping those SOPs up to date and distributed, you can be sure they’ll be current at least as often as the review process.

3.2. Standard Escalation Procedures

No matter how carefully you’ve identified potential risks, and how detailed your procedures to recover from them are, you’re still likely to have situations you didn’t anticipate. An important part of any disaster recovery plan is a standardized escalation procedure. Not only should each individual SOP have its own procedure-specific SEP, but you should also have an overall escalation procedure that covers everything you haven’t thought of—because it’s certain you haven’t thought of everything.

An escalation procedure has two functions—resource escalation and notification escalation. Both have the same purpose: to make sure that everyone who needs to know about the problem is up to date and involved as appropriate, and to keep the overall noise level down so that the work of resolving the problem can go forward as quickly as possible. The resource escalation procedure details the resources that are available to the people who are trying to recover from the current disaster so that these people don’t have to try to guess who (or what) the appropriate resource might be when they run into something they can’t handle or something doesn’t go as planned. This procedure helps them stay calm and focused. They know that if they run into a problem, they aren’t on their own, and they know exactly who to call when they do need help.

The notification escalation procedure details who is to be notified of serious problems. Even more important, it should provide specifics regarding when notification is to be made. If a particular print queue crashes but comes right back up, you might want to send a general message only to the users of that particular printer letting them know what happened. However, if your email has been down for more than half an hour, a lot of folks are going to be concerned. The SEP for email should detail who needs to be notified when the server is unavailable for longer than some specified amount of time, and it should probably detail what happens and who gets notified when it’s still down some significant amount of time after that.

This notification has two purposes: to make sure that the necessary resources are made available as required, and to keep everyone informed and aware of the situation. If you let people know that you’ve had a server hardware failure and that the vendor has been called and will be onsite within an hour, you’ll cut down the number of phone calls exponentially, freeing you to do whatever you need to do to ensure that you’re ready when the vendor arrives.

4. Testing the Responses

A disaster recovery plan is nice to have, but it really isn’t worth a whole lot until it has actually been tested. Needless to say, the time to test the plan is at your convenience and under controlled conditions, rather than in the midst of an actual disaster. It’s a nuisance to discover that your detailed disaster recovery plan has a fatal flaw in it when you’re testing it under controlled conditions. It’s a bit more than a nuisance to discover it when every second counts.

You won’t be able to test everything in your disaster recovery plans. Even most large organizations don’t have the resources to create fully realistic simulated natural disasters and test their response to each of them under controlled conditions, and even fewer small businesses have those kinds of resources. Nevertheless, there are things you can do to test your response plans. The details of how you test them depend on your environment, but they should include as realistic a test as feasible and should, as much as possible, cover all aspects of the response plan. The other reason to test the disaster recovery plan is that it provides a valuable training ground. If you’ve identified primary and backup resources, as you should, chances are that the people you’ve identified as backup resources are not as skilled or knowledgeable in a particular area as the primary resource. Testing the procedures gives you a chance to train the backup resources at the same time.

You should also consider using the testing to cross-train people who are not necessarily in the primary response group. Not only will they get valuable training, but you’ll also create a knowledgeable pool of people who might not be directly needed when the procedure has to be used for real, but who can act as key communicators with the rest of the community.

5. Iterating

When you finish a particular disaster recovery plan, you might think your job is done, but it’s not. Standardizing a process is actually just the first step. You need to continually look for ways to improve it.

You should make a regular, scheduled practice of pulling out your disaster recovery plan with those responsible and making sure it’s up to date. Use the occasion to actually look at it and see how you can improve on it. Take the opportunity to examine your environment. What’s changed since you last looked at the plan? What equipment has been retired, and what has been added? What software is different? Are all the people on your notification and escalation lists still working at the company in the same roles? Are the phone numbers, including home phone numbers, up to date?

REAL WORLD: Understand and Practice Kaizen

Kaizen is a Japanese word and concept that means “small, continuous, improvement.” Its literal translation is, “Change (kai) to become good (zen).”

So, why bring a Japanese word and concept into a discussion about disaster recovery? Because a good disaster recovery plan is one that you are constantly Kaizening. When you really understand Kaizen, it becomes a way of life that you can use in many ways.

The first thing to understand about Kaizen is that you are not striving for major change or improvement. Small improvements are the goal. Don’t try to fix or change everything all at once. Instead, focus on one area, and try to make it just a little bit better.

The second part of Kaizen is that it is continuous. You must constantly look for ways to improve and implement those improvements. Because each improvement is small and incremental, you can easily implement it and move on to the next one.

Kaizen is very much about teamwork. Good Kaizen balances the load on a team and finds ways to build the strengths of the team as a whole. If you practice Kaizen and continually look for small, incremental ways to improve your work, you will soon have a better and more enjoyable workplace. As a manager, if you find ways to encourage and reward those who practice Kaizen, your team and you will grow and prosper.

Another way to iterate your disaster recovery plan is to use every disaster as a learning experience. After the disaster or emergency is over, get everyone together as soon as possible to talk about what happened. Find out what they think worked and what didn’t in the plan. What tools did you not have that would have made the job go quicker or better? Actively solicit suggestions for how the process could be improved. Then make the changes and test them. You’ll not only improve your responsiveness to this particular type of disaster, but you’ll also improve your overall responsiveness by getting people involved in the process and enabling them to be part of the solution.

Warning:

IMPORTANT Do not use this post-disaster recovery discussion to assign blame or look for the cause of the disaster. This is about how to respond to, and recover from, a disaster better. And to do that, you need to learn from the experience so that you can do a better job planning for the next one. If everyone is trying to avoid blame, they won’t have any energy for improving the process.

Other -----------------

- SQL Server 2008 : Security - Networking

- SQL Server 2008 : Security - Authentication mode

- Microsoft Dynamic GP 2010 : Providing clean vendor information by properly closing Purchase Orders, Protecting against information loss by printing Fixed Asset Reports

- Microsoft Dynamic GP 2010 : Protecting Dynamics GP with key security settings

- Working with the Windows Home Server Registry : Finding Registry Entries

- Working with the Windows Home Server Registry : Working with Registry Entries - Changing the Value of a Registry Entry

- SharePoint 2010 : Packaging and Deployment Model - Site Definitions

- SharePoint 2010 : Packaging and Deployment Model - Features (part 3) - Upgrading Features

- SharePoint 2010 : Packaging and Deployment Model - Features (part 2) - Feature Receivers

- SharePoint 2010 : Packaging and Deployment Model - Features (part 1) - Feature Designer