The domain of business
continuity planning possesses a somewhat unique set of concepts, terms,
and processes. To continue building on the concepts and drivers
associated with disaster recovery planning, Figure 1
zooms out to look at the larger, more holistic process of business
continuity planning and where SharePoint disaster recovery planning fits
into it.
As illustrated in Figure 1, business continuity planning involves three distinct
stages:
The risk assessment. The risk assessment is where disaster recovery planning
begins. It entails the analysis of a SharePoint farm and the business
processes tied to it from the perspective of vulnerabilities, threats,
and general exposures that are introduced simply by having the farm in
production and in use by business users. The identifiable risks
typically equate to one or more SharePoint functions or usage scenarios.
“Collaboration on XYZ project,” “business intelligence functions
leveraged by executives,” and “workflow that is used to approve public
communications in the ABC document library” are examples of such
functions and scenarios.
The business
impact analysis (BIA). The results of the
risk assessment serve as the input to the BIA. The BIA attempts to
equate the loss of a particular SharePoint capability or function (such
as the loss of business intelligence functions leveraged by executives)
with the projected magnitude or expected monetary impact associated with
the loss (for example, $10,000 per day in investments). Equating
outages to exact losses is difficult at this stage due to all the
variables that are typically in play, but the results of the analysis
serve as a valuable prioritization tool in the next stage of the
business continuity planning process.
The business continuity plan (BCP). Armed with the results of the BIA,
business continuity planners possess the data they need to prioritize
and address the risk areas identified during the risk assessment. Risk
areas or regions that the BIA identifies as carrying the largest
potential for loss or adverse business exposure are addressed more
urgently, whereas those with lesser potential impact are addressed when
the opportunity arises or is most cost effective. As described earlier,
the BCP that results from this process addresses both the technological
areas included in the disaster recovery plan (such as “restore the
system and associated databases from backup”) and associated business
processes (for example, “have the accounts payable team begin using the
new repository at URL http://DRAccountsPayable instead of the standard production URL”). A
BCP typically includes other prescriptive advice and workarounds to
minimize or mitigate the impact of an outage.
As shown in Figure 1,
a disaster recovery plan is one component of the ultimate business
continuity plan that results from both the risk assessment and BIA of
identified risks. Of course, the disaster recovery plan does not simply
arise from a determination regarding the potential impact of an outage.
The purposes for which a
SharePoint farm is used, along with acceptable outage windows in the
event of a disaster, ultimately drive the technological aspects of the
disaster recovery plan that an organization crafts and implements. Two
key concepts determine what constitutes an “acceptable” outage window:
Recovery time
objective (RTO). The RTO of a disaster
recovery plan defines the amount of time that can elapse between the
occurrence of a disaster and the affected system being returned to an
agreed-upon level of operational readiness. Put simply, an RTO defines
the time you have to get a system back up and running after a disaster.
It is typically during this period that the steps of a disaster recovery
plan are executed. A highly critical SharePoint system may have a
real-time RTO (that is, the failure of a production system immediately
results in a backup system taking over). At the other extreme, a farm
that handles tertiary business functions may have an RTO that is
measured in weeks to support the acquisition of new hardware and the
ultimate rebuild of the farm from scratch.
Recovery point objective
(RPO). Whereas RTOs are forward-looking, an
RPO defines a period of time prior to any disaster where data loss may
(and likely will) occur. Crudely explained another way, an RPO defines
the maximum amount of data loss that’s deemed acceptable in a disaster.
Data that existed prior to the point in time defined by the RPO can be
restored or recovered, whereas data after that point may not. As you
might expect, a highly critical SharePoint system may have a disaster
recovery plan with a near-zero RPO that does not accept any form of data
loss. Tertiary systems, on the other hand, may have RPOs that are
measured in hours or days.
To illustrate the concepts of
RTO and RPO, consider the disaster recovery plan profile shown in Figure 2. The requirements in this plan are common of
less-critical systems, where some amount of data loss and downtime is
deemed acceptable in the event of a disaster.
In this disaster recovery
plan, a disaster occurs and is declared at 7 a.m. The disaster recovery
plan mandates an RPO of 12 hours and an RTO of 24 hours. To satisfy the
RPO requirement of this plan, a backup or some capture of relevant data
and state must have been performed in the 12 hours leading up to the
declaration of the disaster. At the same time, the RTO requirement
states that the system must be restored to a functional state (qualified
within the disaster recovery plan) within 24 hours of the disaster’s
occurrence.
Figure 3
presents a different set of requirements for recovery when the disaster
is declared at 7 a.m. The RTO and RPO shown are more common of a
SharePoint farm that is of greater importance to the organization that
utilizes it. With an RPO window of one hour and an RTO window of 30
minutes, the potential overall outage window is significantly smaller
than the one illustrated in Figure 2.
As you might imagine,
implementing a disaster recovery solution to address the RTO and RPO
requirements illustrated by the plan shown in Figure 3
carries a different set of challenges than meeting the requirements for
the plan shown in Figure 2. Technical strategies and
supplemental equipment requirements vary significantly between the two.
In a perfect world,
all disaster recovery strategies would involve no loss of data (that is,
have a zero RPO window) and provide instant failover (zero RTO).
Unfortunately, the cost of such strategies for SharePoint farms is
exceptional and prohibitive for all but the most critical of business
uses. As part of their disaster recovery planning, most organizations
discover that as RPO and RTO target windows shrink, the cost of an
associated disaster recovery strategy goes up. The challenge then
becomes balancing data loss and downtime against the total cost of
implementing an appropriate and effective disaster recovery strategy.