Sharepoint 2010 : SharePoint Disaster Recovery Testing and Maintenance

6/23/2011 3:44:21 PM

Hopefully it goes without saying that the content covered in this article is the next logical step in your disaster recovery planning process: testing and maintaining your plan. These items are natural and important components of any information technology (IT) project or process, but they’re all too often given little attention or resources. Given the potential importance of your SharePoint environment and its contents, you can drastically increase your risk factor and decrease the viability of your system if you don’t adequately test and sustain your disaster recovery plan.

Obviously, these two items can occur at different stages in the life cycle of your disaster recovery process, but they’re related. Most notably, the first maintenance activities of your plan are likely going to happen after you conduct its first test. Testing your plan should produce several lessons learned, valuable data, and necessary modifications. These naturally lead you into the maintenance phase of the process. Likewise, as you continue ongoing maintenance for your plan, you should re-execute your tests to validate all the changes that you’ve made to the plan.

Planning Your Test

The quality of the testing you do for your disaster recovery plan can be just as crucial to the success of your plan as the quality of its design and contents. If you don’t conduct an effective test of your plan, you don’t have a comprehensive understanding of how it will be applied and utilized if a disaster is declared involving your SharePoint environment. Testing is the best way to begin identifying potential bottlenecks, weaknesses, and dependencies that you may not have considered during the design process. Testing also provides your team with an outstanding training mechanism. Through execution of the plan, team members are developing a deeper understanding of the plan and gaining realistic experience with it. Testing also helps you to estimate your ability to meet your recovery time objective (RTO) and recovery point objective (RPO) goals, which are of paramount importance to the viability of the disaster recovery plan.

Whenever possible, conduct disaster recovery testing for your SharePoint environment within the context of testing your organization’s overall business continuity plan (BCP). Given the interdependencies between technical systems such as SharePoint and the business users who work with them, most of the time it isn’t sufficient to simply test your disaster recovery plan in a vacuum. You need to know how your design impacts the rest of the BCP, any consequences the BCP may have for your recovery plans, and any other systems in your organization that depend on the restoration of the SharePoint environment for their own success. This information lets you examine your communication plan and its viability, not to mention allows business users to verify that their expectations and strategies involving the BCP and your SharePoint environment are accurate and realistic. If your testing efforts don’t in some way involve stakeholders or resources from the business side of your organization, you should at a minimum convey the results of your testing effort so these key people are informed of your findings.

Defining the Scope of Your Outage

The first step of defining how to create an outage in your SharePoint environment for purposes of testing is to determine the scope of that outage. As with any type of test or activity, the value of your test results is based on how successfully the test covers the key aspects of your system and assesses the effectiveness of your disaster recovery plan. Running a test that doesn’t impact SharePoint or isn’t likely to actually occur in the real world isn’t a productive use of your time and resources. The following list outlines some of the questions you should be asking yourself as you determine what your disaster recovery test will encompass:

What are the most likely types of outages your system may experience? If your SharePoint environment contains mostly read-only content, there may be little reason to test the retrieval of content that was accidentally deleted by end users. If your servers are located in an area of the world prone to certain types of weather patterns or natural disasters (tornados, hurricanes, earthquakes, and so on), does it make sense to simulate one of those events in your test?
What are your most valuable recovery targets? Your test should confirm your plan’s ability to restore your system’s most important recovery targets. These are likely the items your business users will be looking for first, and your plan must be able to bring them back successfully.
What items have minimal RTOs and RPOs? If you have little time to bring back a resource or need to bring back a resource to a recent state, it’s imperative that you test and verify your ability to meet those requirements.
What are your most vulnerable recovery targets? If your SharePoint farm has components that are more likely than others to be impacted by an outage, such as a WAN connection or Internet-facing servers outside your firewall, you should exercise them during the disaster test.
What resources are available for testing? There may be constraints placed on your test by the resources you have available to execute it with. If your production SharePoint farm contains load-balanced Web front-end (WFE) servers but your test environment doesn’t, you won’t be able to test that high availability aspect of your disaster recovery plan. This evaluation should also include resources external to your SharePoint environment, such as business representatives, data center administrators, or storage area network (SAN) capacity available to your servers.
What components or dependent systems in your SharePoint environment are governed by disaster recovery plans other than your own? Again, consider testing your plan as part of testing your organization’s overall BCP. If you are testing independently of the BCP, your plan may still have dependencies on other plans that you need to examine. In particular, you should be aware of any service-only farms or published Service Applications that your SharePoint farm consumes, because these may tie your recovery plans directly to plans that exist for one or more additional SharePoint environments. It may not be necessary to test these items, but you must verify that these external plans have been tested or are assured by their owners to reduce the risk to your plan.

Organizing Your Resources

The obvious conclusion you may come to when evaluating how to test your SharePoint disaster recovery plan is that your test should, whenever possible, mirror the conditions, configurations, and resources found in your production environment as closely as possible. This is certainly one way to approach your test, but you need to determine if this is the most effective way to test your plan and the most effective use of your resources. Review the requirements and design of your plan, and find an approach for testing that is authentic and challenging without wasting efforts or resources.

Testing Your Systems

Again, your plan’s RTO and RPO goals play an important role in deciding what systems or environments to use to conduct your test. If your SharePoint environment is designed to deliver minimal or near-zero RTO and RPO outage windows, it’s probably going to involve multiple duplicate systems, such as replicated SharePoint farms in alternate data centers, clustered databases, and redundant storage. In this case, it may make more sense to actually conduct the test by leveraging these failover systems, even though they’re in a production environment. This gives you a highly accurate profile of how your system will perform in a disaster by using the actual systems that you’ll need to function correctly when something hits the fan. This isn’t to say that a duplicate testing environment is a poor solution. Rather, the point is to consider the best testing solution to give you the most accurate and relevant data possible about how your plan, your SharePoint farm, its dependent systems, and all the involved personnel will perform in a disaster. If it makes the most sense for your organization to create a test environment for this activity, by all means do so. But make sure that you think about how your plan, its requirements, and its constituents are best tested, in addition to considering your test’s available resources and budget.

Also keep in mind that the physical resources your test requires are not just limited to the Share-Point environment needed to run your test. Just as your production SharePoint environment most likely uses several other systems for monitoring, reporting, networking, and other crucial capabilities, your test environment has equivalent dependencies to consider. For example, if you rely on a monitoring system that generates trouble tickets or pages resources when an outage occurs, make sure that system is also monitoring the SharePoint farm hosting your test. But also configure the monitoring system so that production resources aren’t assigned to handle the events generated by your test system during disaster recovery testing, to avoid confusion and service degradation for the production system.

Testing Your People

Whenever possible, make the test as authentic as possible, not just in terms of the IT assets used, but also the team involved in the test. Assign participants to fill each of the key roles dictated by your disaster recovery plan so that the required actions, abilities, and responsibilities of each role can be assessed and evaluated. Also include business owners or their representatives in the test. This can go a long way toward properly setting their expectations in an outage and not only give them an excellent understanding of the communication they can expect when an outage occurs but show them the role(s) they play during plan execution and the overall recovery effort.

Planning for Losses

Seriously consider incorporating certain losses of disaster recovery resources and personnel in your test so that you and your team can understand how to overcome those challenges should something similar occur during an actual outage. Who needs to be informed if the latest set of tape backups is corrupted and an RPO target can’t be met? What if a database administrator is on vacation during an outage? Can your plan still be executed to meet its criteria for success without the presence of key resources? By purposely building losses into your test, you can further identify weaknesses and dependencies in your system.

Verifying Checklists and Preparedness

The initial test of your system is also an excellent opportunity to verify or develop any checklists that you may need as job aids for the disaster recovery plan. During the planning phase of any project, it’s often difficult to capture every necessary activity down to the smallest detail, but it becomes much more feasible to do so during test execution. Creating task and resource lists can make your personnel more effective during an actual outage, improving your disaster recovery team’s efficiency and effectiveness while eliminating common mistakes and missteps. It’s also much easier to learn these lessons during a test than during an actual disaster when business owners are breathing down your neck and everything has to be executed without surprises and errors.

Testing your disaster recovery plan with the people who are likely to execute it in a production environment is a great training exercise for these resources and can identify other areas for additional improvement. It also educates your partners and service providers on what you’ll be counting on them for in the event of an outage in terms of both services and their delivery windows. Remember that your disaster recovery plan is likely going to encompass a group far larger than just your SharePoint team. The more you can do to ensure the preparedness and responsiveness of all parties involved in a recovery effort, the more effective the recovery effort is.

Conducting the Test

Remember that the more authentic your test is and the more accurately it re-creates an outage of your SharePoint environment, the more value it gives you and the more predictable and effective your disaster recovery plan becomes. The test isn’t an excuse to inconvenience your personnel or make unnecessary requests of your external service providers, but all participants should take the test seriously and act as if it’s an actual outage. With business representatives and nontechnical personnel from your organization participating, it’s even more important to take the exercise seriously to build their confidence in your plan, your team’s ability to execute it, and the stability of your SharePoint environment in general.

Encouraging Communication

At all stages of the test, encourage communication among the test’s participants and provide them with all the information necessary to fully participate in the test. This starts with the test’s kickoff activities, where the participants are introduced to the test SharePoint farm, assigned their roles within it, informed of the outage, and provided with the specific details of the catastrophic event that has occurred in the test environment. All participants must understand their role within the test; otherwise, the test may not be fully implemented or worse, would be executed incorrectly.

Throughout the test, the recovery team should have regular meetings to communicate status and findings. The frequency of meetings can follow the communication requirements of the disaster recovery plan, but you might need to provide updates on a more consistent basis as participants execute, learn, and troubleshoot the plan. Record all the key findings, tips, issues, and communications made during the test so that you can review them once the exercise is completed and incorporate them into the revised plan.

Tip

Because recording information and observations during a test can take a significant amount of time, assign a note-taking observer for each person carrying out some part of the recovery plan. Taking this step ensures that execution of the recovery plan isn’t slowed and that the feedback gathered is objective in nature. It also encourages recovery plan participants to stay focused on the work they’re doing rather than taking notes.

After the test has been completed, you can take several steps to gather further information about it. Collect any and all notes that participants made during their activities, and survey all contributors to collect general thoughts and responses about the test. Once you’ve gathered all the data, communicate a summary and findings report to all participants. Make sure that the personnel executing the test are given feedback on their work so they know what they did well during the test and what they need to work on and improve in the event of an actual disaster.

Observing the Test

In addition to the notes, thoughts, and data generated by the note-taking observers assigned to each of the test’s participants, it’s important to assign certain members of your team to observe the overall test as it progresses. These independent observers should especially be on the lookout for items that are not addressed but need to be added to the larger disaster recovery plan, different streams of recovery that may conflict with one another, activities that have some dependency on other activities, timing, or some other outside influence. You may find that you’re best served by assigning this task to team members closely familiar with the disaster recovery plan so they can spend their time observing the test, as opposed to constantly referencing the plan to confirm one detail or another. This ensures that your less experienced team members are getting more hands-on time with the plan to build their knowledge and expertise.

Validating the Plan

The nice thing about testing your disaster recovery plan is that it should already provide you with the criteria you need to evaluate whether you passed. Your SharePoint environment’s disaster recovery plan should not only define the benchmarks and goals you need to meet for a successful recovery from an outage, but it should inform you of the RTO and RPO goals you’re required to meet to fully satisfy your business owners’ requirements. Once the test has completed, validate its output against these standards and determine how successful you were at meeting them. If you’re unable to meet the RTO and RPO requirements of your plan, you’ll need to perform additional analysis to determine how to remedy that issue and update the plan accordingly.

Redesigning the Plan

After you’ve validated your test and reviewed its output, you may need to redesign your plan based on your findings. Although you can’t expect your disaster recovery plan to account for every complication or calamity that may arise during the recovery of your SharePoint farm, an effective test of your plan often results in some valuable information and changes to the plan. Your responsibility, once the test is completed, is to refactor the plan based on those conclusions and then retest it to verify the accuracy of your modifications.

Performing Ongoing Maintenance of Your Disaster Recovery Plan

In life and in IT administration in particular, the only constant is change. One challenging aspect of creating a disaster recovery plan is that the system you’re designing against is likely to go through frequent modifications, even during the course of your design process. It is not uncommon that in as soon as six months after your plan is completed and approved, the system you designed it for will have grown, matured, and been updated to the point that the plan is no longer fully relevant. That’s why it isn’t only important to write your plan in such a way that it can be easily modified and updated, but to re-evaluate and update it on a regular basis to keep it in line with the SharePoint environment it addresses.

Analyzing Your Systems: As-Is/To-Be

One way to anticipate changes that may be required for your SharePoint disaster recovery plan is by creating some key lists that track the current and future state of your environment. Organizations are constantly evaluating their IT systems to determine if they’re able to meet their specific needs and learn what modifications, additions, or subtractions they may make to them in the future. Often this analysis is broken into two sections: As-Is and To-Be. As-Is analysis of a system examines the business’s current users, processes, and data and compares it to the existing IT system. This comparison is then used to evaluate how well the system serves the needs and actions of the business and to establish a baseline for the future state of the system. The future state is defined in the To-Be analysis. The To-Be list defines the vision for the business’s IT systems of the future, prioritizes features and functionality, and establishes goals that upgrades should meet or exceed.

An effective disaster recovery plan is designed to meet the requirements and conditions set forth by the As-Is list of an organization while keeping an eye toward the state described by the To-Be list. A plan must encompass the current system’s entire configuration, workflows, and data but also be flexible enough to either handle or be modified to accommodate the projected future state of the system. If a disaster recovery plan can’t grow with your SharePoint farm as its role within your organization grows, and thus its IT footprint grows to match, it quickly loses its effectiveness.

If your organization doesn’t have official As-Is and To-Be lists that include your SharePoint environment, consider compiling these items before finalizing your SharePoint disaster recovery plan. You need to have a concrete understanding of your system, its strengths and weaknesses, and its projected future state to effectively know what needs to be preserved and restored and how that could yield changes to your disaster recovery plan in the coming years.

Modifying Your Plan

In general, your organization should have procedures that govern the review and update of approved documentation so that all documents are evaluated on a regular basis (for example, every year) and updated accordingly. You may find that, based on how your SharePoint system evolves and grows, your disaster recovery plan requires more frequent care and feeding. Take care to establish certain criteria that can trigger an update to your plan, such as a major release for your system, the deployment of new hardware, or the installation of service packs or version upgrades for your software.

When you do modify the plan, create a new version of its documentation so that you can maintain and track a history of its changes over time. Ensure that the document again goes through a full review and approval process so that all stakeholders are made aware of the changes that have occurred in the system and the disaster recovery plan itself. Allowing the plan to gather dust while the state of your production SharePoint system evolves presents a major risk to the plan’s relevance and effectiveness and your ability to actually recover the system in a catastrophe.

Tip

Specialized applications and systems, such as SunGard’s Living Disaster Recovery Planning System (LDRPS), exist to serve and address the needs of disaster recovery planners. These applications and systems can greatly simplify the processes of disaster recovery documentation, change tracking, and ongoing plan maintenance. If your organization contains a group with formalized disaster recovery responsibility, check with them to see if you could or should be leveraging such a system for your SharePoint disaster recovery planning purposes. If the decision is in your hands, investigate the use of one of these systems. It can save time, effort, and most importantly, confusion—particularly when disaster strikes.

Expecting and Budgeting for Ongoing Maintenance

To make changes to your disaster recovery plan, you need to expend at least some resources in the form of the time necessary to redesign the plan to meet the changing needs of your systems as well as any additional hardware or software that the redesigned plan may require. Be prepared for expenses beyond time if the scope of your SharePoint farm grows, because you’ll likely require further physical resources such as expanded storage space or more servers, not to mention the possibility of specialized backup and restore software. All these items can add definitive costs to your budget that you may not necessarily anticipate once the disaster recovery plan is in place, but you should expect them as part of your plan’s ongoing maintenance. As economic circumstances fluctuate and available budgets grow and shrink, you must make sure that sufficient resources are made available to support ongoing maintenance of the plan.

Tip

The yearly cost of disaster recovery maintenance is often tied to the disaster recovery design that is implemented for a SharePoint farm. A best practice for most corporate SharePoint farm owners is to calculate and budget for the cost of ongoing disaster recovery maintenance at the same time they prepare a capital asset request for the acquisition of a SharePoint environment and the initial implementation of its disaster recovery strategy and design.

Other -----------------

- Microsoft PowerPoint 2010 : Working Together on Office Documents - Publishing Slides to a SharePoint Library

- Microsoft PowerPoint 2010 : Working Together on Office Documents - Inviting Others to a Groove Workspace & Saving a Document to a SharePoint Server

- Microsoft PowerPoint 2010 : Working Together on Office Documents - Sharing Documents in a Groove Workspace

- Using Microsoft Dynamics CRM for Outlook : Synchronizing Contacts, Tasks, and Appointments

- Using Microsoft Dynamics CRM for Outlook : Accessing CRM Records Within Microsoft Dynamics CRM for Outlook

- SQL Server 2008 : Upgrading to Microsoft SQL Server 2008 - SQL Server Integration Services & Post-Upgrade Procedures

- SQL Server 2008 : Upgrade Strategies (part 2) - Side-by-Side Upgrade

- SQL Server 2008 : Upgrade Strategies (part 1) - In-Place Upgrade

- Windows Server 2008 R2 : Build Virtual Machines (part 4) - Import & Export a Virtual Machine

- Windows Server 2008 R2 : Build Virtual Machines (part 3) - Install an Operating System & Use Snapshots