Hopefully it goes without saying that the content
covered in this article is the next logical step in your disaster
recovery planning process: testing and maintaining your plan. These
items are natural and important components of any information technology
(IT) project or process, but they’re all too often given little
attention or resources. Given the potential importance of your
SharePoint environment and its contents, you can drastically increase
your risk factor and decrease the viability of your system if you don’t
adequately test and sustain your disaster recovery plan.
Obviously, these two items can
occur at different stages in the life cycle of your disaster recovery
process, but they’re related. Most notably, the first maintenance
activities of your plan are likely going to happen after you conduct its
first test. Testing your plan should produce several lessons learned,
valuable data, and necessary modifications. These naturally lead you
into the maintenance phase of the process. Likewise, as you continue
ongoing maintenance for your plan, you should re-execute your tests to
validate all the changes that you’ve made to the plan.
Planning Your Test
The quality of the testing
you do for your disaster recovery plan can be just as crucial to the
success of your plan as the quality of its design and contents. If you
don’t conduct an effective test of your plan, you don’t have a
comprehensive understanding of how it will be applied and utilized if a
disaster is declared involving your SharePoint environment. Testing is
the best way to begin identifying potential bottlenecks, weaknesses, and
dependencies that you may not have considered during the design
process. Testing also provides your team with an outstanding training
mechanism. Through execution of the plan, team members are developing a
deeper understanding of the plan and gaining realistic experience with
it. Testing also helps you to estimate your
ability to meet your recovery time objective (RTO) and recovery point
objective (RPO) goals, which are of paramount importance to the
viability of the disaster recovery plan.
Whenever possible,
conduct disaster recovery testing for your SharePoint environment within
the context of testing your organization’s overall business continuity
plan (BCP). Given the interdependencies between technical systems such
as SharePoint and the business users who work with them, most of the
time it isn’t sufficient to simply test your disaster recovery plan in a
vacuum. You need to know how your design impacts the rest of the BCP,
any consequences the BCP may have for your recovery plans, and any other
systems in your organization that depend on the restoration of the
SharePoint environment for their own success. This information lets you
examine your communication plan and its viability, not to mention allows
business users to verify that their expectations and strategies
involving the BCP and your SharePoint environment are accurate and
realistic. If your testing efforts don’t in some way involve
stakeholders or resources from the business side of your organization,
you should at a minimum convey the results of your testing effort so
these key people are informed of your findings.
Defining the Scope of Your Outage
The first step of defining
how to create an outage in your SharePoint environment for purposes of
testing is to determine the scope of that outage. As with any type of
test or activity, the value of your test results is based on how
successfully the test covers the key aspects of your system and assesses
the effectiveness of your disaster recovery plan. Running a test that
doesn’t impact SharePoint or isn’t likely to actually occur in the real
world isn’t a productive use of your time and resources. The following
list outlines some of the questions you should be asking yourself as you
determine what your disaster recovery test will encompass:
What are the most likely types of outages your system may experience?
If your SharePoint environment contains mostly read-only content, there
may be little reason to test the retrieval of content that was
accidentally deleted by end users. If your servers are located in an
area of the world prone to certain types of weather patterns or natural
disasters (tornados, hurricanes, earthquakes, and so on), does it make
sense to simulate one of those events in your test?
What are your most valuable recovery targets?
Your test should confirm your plan’s ability to restore your system’s
most important recovery targets. These are likely the items your
business users will be looking for first, and your plan must be able to
bring them back successfully.
What items have minimal RTOs and RPOs?
If you have little time to bring back a resource or need to bring back a
resource to a recent state, it’s imperative that you test and verify
your ability to meet those requirements.
What are your most vulnerable recovery targets?
If your SharePoint farm has components that are more likely than others
to be impacted by an outage, such as a WAN connection or Internet-facing servers outside your firewall, you should exercise them during the disaster test.
What resources are available for testing?
There may be constraints placed on your test by the resources you have
available to execute it with. If your production SharePoint farm
contains load-balanced Web front-end (WFE) servers but your test
environment doesn’t, you won’t be able to test that high availability
aspect of your disaster recovery plan. This evaluation should also
include resources external to your SharePoint environment, such as
business representatives, data center administrators, or storage area
network (SAN) capacity available to your servers.
What components or dependent systems in your SharePoint environment are governed by disaster recovery plans other than your own?
Again, consider testing your plan as part of testing your
organization’s overall BCP. If you are testing independently of the BCP,
your plan may still have dependencies on other plans that you need to
examine. In particular, you should be aware of any service-only farms or
published Service Applications that your SharePoint farm consumes,
because these may tie your recovery plans directly to plans that exist
for one or more additional SharePoint environments. It may not be
necessary to test these items, but you must verify that these external
plans have been tested or are assured by their owners to reduce the risk
to your plan.
Organizing Your Resources
The obvious conclusion you may
come to when evaluating how to test your SharePoint disaster recovery
plan is that your test should, whenever possible, mirror the conditions,
configurations, and resources found in your production environment as
closely as possible. This is certainly one way to approach your test,
but you need to determine if this is the most effective way to test your
plan and the most effective use of your resources. Review the
requirements and design of your plan, and find an approach for testing
that is authentic and challenging without wasting efforts or resources.
Testing Your Systems
Again, your plan’s RTO and
RPO goals play an important role in deciding what systems or
environments to use to conduct your test. If your SharePoint environment
is designed to deliver minimal or near-zero RTO and RPO outage windows,
it’s probably going to involve multiple duplicate systems, such as
replicated SharePoint farms in alternate data centers, clustered
databases, and redundant storage. In this case, it may make more sense
to actually conduct the test by leveraging these failover systems, even
though they’re in a production environment. This gives you a highly
accurate profile of how your system will perform in a disaster by using
the actual systems that you’ll need to function correctly when something
hits the fan. This isn’t to say that a duplicate testing environment is
a poor solution. Rather, the point is to consider the best testing
solution to give you the most accurate and relevant data possible about
how your plan, your SharePoint farm, its dependent systems, and all the
involved personnel will perform in a
disaster. If it makes the most sense for your organization to create a
test environment for this activity, by all means do so. But make sure
that you think about how your plan, its requirements, and its
constituents are best tested, in addition to considering your test’s
available resources and budget.
Also keep in mind that the
physical resources your test requires are not just limited to the
Share-Point environment needed to run your test. Just as your production
SharePoint environment most likely uses several other systems for
monitoring, reporting, networking, and other crucial capabilities, your
test environment has equivalent dependencies to consider. For example,
if you rely on a monitoring system that generates trouble tickets or
pages resources when an outage occurs, make sure that system is also
monitoring the SharePoint farm hosting your test. But also configure the
monitoring system so that production resources aren’t assigned to
handle the events generated by your test system during disaster recovery
testing, to avoid confusion and service degradation for the production
system.
Testing Your People
Whenever possible, make the
test as authentic as possible, not just in terms of the IT assets used,
but also the team involved in the test. Assign participants to fill each
of the key roles dictated by your disaster recovery plan so that the
required actions, abilities, and responsibilities of each role can be
assessed and evaluated. Also include business owners or their
representatives in the test. This can go a long way toward properly
setting their expectations in an outage and not only give them an
excellent understanding of the communication they can expect when an
outage occurs but show them the role(s) they play during plan execution
and the overall recovery effort.
Planning for Losses
Seriously consider
incorporating certain losses of disaster recovery resources and
personnel in your test so that you and your team can understand how to
overcome those challenges should something similar occur during an
actual outage. Who needs to be informed if the latest set of tape
backups is corrupted and an RPO target can’t be met? What if a database
administrator is on vacation during an outage? Can your plan still be
executed to meet its criteria for success without the presence of key
resources? By purposely building losses into your test, you can further
identify weaknesses and dependencies in your system.
Verifying Checklists and Preparedness
The initial test of your system
is also an excellent opportunity to verify or develop any checklists
that you may need as job aids for the disaster recovery plan. During the
planning phase of any project, it’s often difficult to capture every
necessary activity down to the smallest detail, but it becomes much more
feasible to do so during test execution. Creating task and resource
lists can make your personnel more effective during an actual outage,
improving your disaster recovery team’s efficiency and effectiveness
while eliminating common mistakes and missteps. It’s also much easier to
learn these lessons during a test than during an actual disaster when
business owners are breathing down your neck and everything has to be
executed without surprises and errors.
Testing
your disaster recovery plan with the people who are likely to execute
it in a production environment is a great training exercise for these
resources and can identify other areas for additional improvement. It
also educates your partners and service providers on what you’ll be
counting on them for in the event of an outage in terms of both services
and their delivery windows. Remember that your disaster recovery plan
is likely going to encompass a group far larger than just your
SharePoint team. The more you can do to ensure the preparedness and
responsiveness of all parties involved in a recovery effort, the more
effective the recovery effort is.
Conducting the Test
Remember that the more
authentic your test is and the more accurately it re-creates an outage
of your SharePoint environment, the more value it gives you and the more
predictable and effective your disaster recovery plan becomes. The test
isn’t an excuse to inconvenience your personnel or make unnecessary
requests of your external service providers, but all participants should
take the test seriously and act as if it’s an actual outage. With
business representatives and nontechnical personnel from your
organization participating, it’s even more important to take the
exercise seriously to build their confidence in your plan, your team’s
ability to execute it, and the stability of your SharePoint environment
in general.
Encouraging Communication
At all stages of the test,
encourage communication among the test’s participants and provide them
with all the information necessary to fully participate in the test.
This starts with the test’s kickoff activities, where the participants
are introduced to the test SharePoint farm, assigned their roles within
it, informed of the outage, and provided with the specific details of
the catastrophic event that has occurred in the test environment. All
participants must understand their role within the test; otherwise, the
test may not be fully implemented or worse, would be executed
incorrectly.
Throughout the test, the
recovery team should have regular meetings to communicate status and
findings. The frequency of meetings can follow the communication
requirements of the disaster recovery plan, but you might need to
provide updates on a more consistent basis as participants execute,
learn, and troubleshoot the plan. Record all the key findings, tips,
issues, and communications made during the test so that you can review
them once the exercise is completed and incorporate them into the
revised plan.
Tip
Because recording
information and observations during a test can take a significant amount
of time, assign a note-taking observer for each person carrying out
some part of the recovery plan. Taking this step ensures that execution
of the recovery plan isn’t slowed and that the feedback gathered is
objective in nature. It also encourages recovery plan participants to
stay focused on the work they’re doing rather than taking notes.
After
the test has been completed, you can take several steps to gather
further information about it. Collect any and all notes that
participants made during their activities, and survey all contributors
to collect general thoughts and responses about the test. Once you’ve
gathered all the data, communicate a summary and findings report to all
participants. Make sure that the personnel executing the test are given
feedback on their work so they know what they did well during the test
and what they need to work on and improve in the event of an actual
disaster.
Observing the Test
In addition to the notes,
thoughts, and data generated by the note-taking observers assigned to
each of the test’s participants, it’s important to assign certain
members of your team to observe the overall test as it progresses. These
independent observers should especially be on the lookout for items
that are not addressed but need to be added to the larger disaster
recovery plan, different streams of recovery that may conflict with one
another, activities that have some dependency on other activities,
timing, or some other outside influence. You may find that you’re best
served by assigning this task to team members closely familiar with the
disaster recovery plan so they can spend their time observing the test,
as opposed to constantly referencing the plan to confirm one detail or
another. This ensures that your less experienced team members are
getting more hands-on time with the plan to build their knowledge and
expertise.
Validating the Plan
The nice thing about
testing your disaster recovery plan is that it should already provide
you with the criteria you need to evaluate whether you passed. Your
SharePoint environment’s disaster recovery plan should not only define
the benchmarks and goals you need to meet for a successful recovery from
an outage, but it should inform you of the RTO and RPO goals you’re
required to meet to fully satisfy your business owners’ requirements.
Once the test has completed, validate its output against these standards
and determine how successful you were at meeting them. If you’re unable
to meet the RTO and RPO requirements of your plan, you’ll need to
perform additional analysis to determine how to remedy that issue and
update the plan accordingly.
Redesigning the Plan
After you’ve validated your
test and reviewed its output, you may need to redesign your plan based
on your findings. Although you can’t expect your disaster recovery plan
to account for every complication or calamity that may arise during the
recovery of your SharePoint farm, an effective test of your plan often
results in some valuable information and changes to the plan. Your
responsibility, once the test is completed, is to refactor the plan
based on those conclusions and then retest it to verify the accuracy of
your modifications.
Performing Ongoing Maintenance of Your Disaster Recovery Plan
In
life and in IT administration in particular, the only constant is
change. One challenging aspect of creating a disaster recovery plan is
that the system you’re designing against is likely to go through
frequent modifications, even during the course of your design process.
It is not uncommon that in as soon as six months after your plan is
completed and approved, the system you designed it for will have grown,
matured, and been updated to the point that the plan is no longer fully
relevant. That’s why it isn’t only important to write your plan in such a
way that it can be easily modified and updated, but to re-evaluate and
update it on a regular basis to keep it in line with the SharePoint
environment it addresses.
Analyzing Your Systems: As-Is/To-Be
One way to
anticipate changes that may be required for your SharePoint disaster
recovery plan is by creating some key lists that track the current and
future state of your environment. Organizations are constantly
evaluating their IT systems to determine if they’re able to meet their
specific needs and learn what modifications, additions, or subtractions
they may make to them in the future. Often this analysis is broken into
two sections: As-Is and To-Be. As-Is analysis of a system examines the
business’s current users, processes, and data and compares it to the
existing IT system. This comparison is then used to evaluate how well
the system serves the needs and actions of the business and to establish
a baseline for the future state of the system. The future state is
defined in the To-Be analysis. The To-Be list defines the vision for the
business’s IT systems of the future, prioritizes features and
functionality, and establishes goals that upgrades should meet or
exceed.
An effective disaster
recovery plan is designed to meet the requirements and conditions set
forth by the As-Is list of an organization while keeping an eye toward
the state described by the To-Be list. A plan must encompass the current
system’s entire configuration, workflows, and data but also be flexible
enough to either handle or be modified to accommodate the projected
future state of the system. If a disaster recovery plan can’t grow with
your SharePoint farm as its role within your organization grows, and
thus its IT footprint grows to match, it quickly loses its
effectiveness.
If your organization
doesn’t have official As-Is and To-Be lists that include your SharePoint
environment, consider compiling these items before finalizing your
SharePoint disaster recovery plan. You need to have a concrete
understanding of your system, its strengths and weaknesses, and its
projected future state to effectively know what needs to be preserved
and restored and how that could yield changes to your disaster recovery
plan in the coming years.
Modifying Your Plan
In general, your
organization should have procedures that govern the review and update of
approved documentation so that all documents are evaluated on a regular
basis (for example, every year) and updated accordingly. You may find
that, based on how your SharePoint system evolves
and grows, your disaster recovery plan requires more frequent care and
feeding. Take care to establish certain criteria that can trigger an
update to your plan, such as a major release for your system, the
deployment of new hardware, or the installation of service packs or
version upgrades for your software.
When you do modify the plan,
create a new version of its documentation so that you can maintain and
track a history of its changes over time. Ensure that the document again
goes through a full review and approval process so that all
stakeholders are made aware of the changes that have occurred in the
system and the disaster recovery plan itself. Allowing the plan to
gather dust while the state of your production SharePoint system evolves
presents a major risk to the plan’s relevance and effectiveness and
your ability to actually recover the system in a catastrophe.
Tip
Specialized applications and
systems, such as SunGard’s Living Disaster Recovery Planning System
(LDRPS), exist to serve and address the needs of disaster recovery
planners. These applications and systems can greatly simplify the
processes of disaster recovery documentation, change tracking, and
ongoing plan maintenance. If your organization contains a group with
formalized disaster recovery responsibility, check with them to see if
you could or should be leveraging such a system for your SharePoint
disaster recovery planning purposes. If the decision is in your hands,
investigate the use of one of these systems. It can save time, effort,
and most importantly, confusion—particularly when disaster strikes.
Expecting and Budgeting for Ongoing Maintenance
To make changes to your disaster recovery plan, you need to expend at least some
resources in the form of the time necessary to redesign the plan to
meet the changing needs of your systems as well as any additional
hardware or software that the redesigned plan may require. Be prepared
for expenses beyond time if the scope of your SharePoint farm grows,
because you’ll likely require further physical resources such as
expanded storage space or more servers, not to mention the possibility
of specialized backup and restore software. All these items can add
definitive costs to your budget that you may not necessarily anticipate
once the disaster recovery plan is in place, but you should expect them
as part of your plan’s ongoing maintenance. As economic circumstances
fluctuate and available budgets grow and shrink, you must make sure that
sufficient resources are made available to support ongoing maintenance
of the plan.
Tip
The yearly cost of disaster
recovery maintenance is often tied to the disaster recovery design that
is implemented for a SharePoint farm. A best practice for most corporate
SharePoint farm owners is to calculate and budget for the cost of
ongoing disaster recovery maintenance at the same time they prepare a
capital asset request for the acquisition of a SharePoint environment
and the initial implementation of its disaster recovery strategy and
design.