A key to creating a valuable backup and recovery plan
is to have a clear understanding of how the computer and network
infrastructure is configured, as well as having an understanding of how
the business operates and utilizes the infrastructure. This discovery
process involves mapping out both the computer and network systems in
place and also documenting and understanding the business processes that
depend on the infrastructure. For example, an organization might
process incoming orders from field sales representatives via fax
transmissions of contracts that are accepted by a Windows Server 2008 R2
fax service. If the fax service is not available, no orders are
processed. This is just a simple example of when downtime of a Windows
server can directly affect business operations. Understanding which
systems and services are most important to the business can help IT
staff set the order or prioritize which systems will be recovered first,
in the event of a large-scale disaster.
Identifying the Different Services and Technologies
Each deployed role, role
service, feature, or application provided by a Windows Server 2008 R2
system provides a key system function, which in many cases is critical
to the organization. Each application, role service, role, and feature
installed on a Windows Server 2008 R2 system should be identified and
documented so the IT group can have a clear view of the complexity of
the environment as backup and recovery plans are being developed. It is
very common for server and web-based applications to require special
backup and restore procedures, and these are especially important to
identify for disaster recovery purposes.
Identifying Single Points of Failure
A single point of failure is
a device, application, or service on a computer and networking
infrastructure that provides an exclusive function with no redundancy. A
common single point of failure in smaller organizations is a network
switch that provides the connectivity between all of the servers, client
workstations, firewalls, wireless access points, and routers on a
network. Within a Windows Server 2008 R2 Active Directory infrastructure
as an example, Active Directory Domain Services (AD DS) inherently
comes with its own set of single points of failure, with its Flexible
Single Master Operations (FSMO) roles. These roles provide an exclusive
function to the entire Active Directory forest or just a single domain,
and if the designated domain controller hosting that role fails, these
hosted FSMO roles become unavailable. Even though the FSMO roles are
single points of failure, recovering a domain controller can be very
simple and painless if proper backup and recovery planning is performed.
Evaluating Different Disaster Scenarios
Before
a backup and disaster recovery plan can be formulated, IT managers and
administrators should meet with the business owners to discuss and
decide on which types of failures or disasters should be planned for.
Planning for every disaster
scenario is nearly impossible or, more commonly, will exceed an
organization’s backup and recovery budget, but discussing the likelihood
of each scenario and evaluating how the scenario can impact the
business is necessary.
Physical Disaster
A physical disaster is
anything that can keep employees or customers from reaching their
desired office or store location. Examples include natural disasters
such as floods, fires, earthquakes, hurricanes, or tornadoes that can
destroy an office. A physical disaster can also be a physical
limitation, such as a damaged bridge or highway blockage caused by a car
accident. When only physical access is limited or restricted, a remote
access solution could reestablish connectivity between users and the
corporate network.
Power Outage or Rolling Blackouts
Power outages can occur
at any time unexpectedly. Some power outages are caused by bad weather
and other natural disasters, but other times they can be caused by high
power consumption that causes system overloads. When power systems are
overloaded, rolling blackouts may occur. A rolling blackout is when a
power company shuts off power to certain power subscribers or areas of
service, so that it maintains power to critical services, such as fire
departments, police departments, hospitals, and traffic lights. The
rolling part of rolling blackouts is that the blackout is managed; after
a predetermined amount of the time, the power company will shut down a
different power grid and restore power to a previously shutdown grid. Of
course, during power outages, many businesses are unable to function
because the core of their work is conducted on computers or even
telephone systems that require power to function.
Network Outage
Organizations that
share data and applications between multiple offices and require access
to the Internet as part of their daily business operations are
susceptible to network outages that can cause severe loss of employee
productivity and possibly revenue. Network outages can affect just a
single computer, the entire office, or multiple offices depending on the
cause of the outage. IT staff must take network outages into
consideration when creating the backup and recovery plans.
Hardware Failures
Hardware failures seem to be the
most common disaster encountered and coincidentally are the most common
type of problem organizations plan for. Server hardware failures include
failed motherboards, processors, memory, network interface cards,
network cables, fiber cables, disk and HBA controllers, power supplies,
and, of course, the hard disks in the local server or in a storage area
network (SAN). Each of these failures can be dealt with differently, but
to provide system- or server-level redundancy, key services should be deployed
in a redundant cluster configuration, such as is provided with Windows
Server 2008 R2, Enterprise Edition Failover Clustering, or Network Load
Balancing (NLB).
Hard Drive Failure
Hard drives are indeed the most
common type of computer- and network-related hardware failure
organizations have to deal with. Windows Server 2008 R2 supports
hot-swappable hard drives and two types of disks: basic disks, which
provide backward compatibility, and dynamic disks, which allow
software-level disk arrays to be configured without a separate
hardware-based disk array controller. Also, both basic and dynamic
disks, when used as data disks, can be moved to other servers easily to
provide data or disk capacity elsewhere if a system hardware failure
occurs and the data on these disks needs to be made available as soon as
possible. Windows Server 2008 R2 also contains tools to provision,
connect, and configure storage located on a SAN and can easily mount VHD
files as operating system disks using Disk Manager or diskpart.
Note
If hardware-level RAID is
configured, the controller card stores the disk array configuration and
the manufacturer should be contacted to provide the necessary tools or
documentation necessary to back up, restore, rebuild, or re-create the
configuration should a controller failure occur or if the disk needs to
be moved to a different machine with the same type of controller.
Software Corruption
Software corruption can occur
at many different levels. Operating system files could be corrupted,
antivirus software can interfere with the writing of a file or database
causing corruption, or a new application or driver installation could
overwrite a critical file leaving a system unstable or in a failed
state. Also, more commonly found in today’s networks, a security,
application, or system update conflicts with an existing application or
service causing undesirable issues.
Prioritizing the Recovery
After all of the computer
services and applications used on a network are identified, as well as
deciding which typical disaster scenarios will be considered in the
backup and recovery plan, the next step is to organize or prioritize how
the recovery of critical systems and services will be executed. The
prioritization usually involves getting the most critical services up
and running first; this usually requires networking services such as DNS
and DHCP, as well as Active Directory domain controllers, especially on
corporate networks that utilize Microsoft Windows servers and client
operating systems.
Maintaining up-to-date
backup and recovery plans requires following strict processes when
changing an organization’s computer and network infrastructure. With an
up-to-date technology priority list, administrators can tackle the
planning for the most important services first to ensure that if a
disaster strikes sooner rather than later, the most important systems
are always protected and recoverable.
Identifying Bare Minimum Services
The
bare minimum services are the fewest possible services and applications
that must be up and running for business operations to continue. Only
the top few services and applications in the technology prioritized list
will become part of the bare minimum services list. For example, a bare
minimum computer service for a retail outlet could be a server that
runs the retail software package and manages the register and receipt
printer. For a web-based company, it could be the web and e-commerce
servers that process online orders.
Determining the Service-Level Agreement and Return-to-Operation Requirements
A service-level
agreement (SLA) is an estimated planned uptime or availability time
frame for a system, service, or application. SLAs are usually defined by
hours per day, week, month, or year and are expressed in percentages.
For example, if the corner grocery store claims to be open 24 hours a
day, every day of the year, the grocery store SLA is 100%. Another
example could be an organization’s electronic fax services that should
be available 7 days a week between the hours of 5:00 a.m. and 11:00 p.m.
Many organizations
hope to achieve and maintain operation of the most critical services 24
hours a day, 7 days a week or 100% planned uptime as logistically
possible. A few common SLA targets are included in the following list:
99.999% planned uptime results in 5.25 minutes of planned downtime or maintenance per year.
99.99% planned uptime results in 52.5 minutes of planned downtime or maintenance per year.
99.9% planned uptime results in 8 hours, 45.6 minutes of planned downtime or maintenance per year.
99.7% planned uptime results in 26 hours and 17 minutes of planned downtime or maintenance per year.
99% planned uptime results in 87 hours and 36 minutes of planned downtime or maintenance per year.
Executives and managers alike
all know that maintaining 100% of planned uptime is not usually
possible because of a number of factors. Also, many professionals might
also consider that the SLA must account for the time to recover after a
failure or disaster is encountered. Ensure that the definition of the
SLA is understood by all as “planned” uptime or “planned and unplanned.”
The difference is huge. A recommendation is that an SLA is defined as
planned uptime. The unplanned recovery time frame is defined as the
Return to Operation (RTO) number for the remainder of this section.
The RTO defines how long it
will take to recover a system, service, application, or business
operation after a failure or disaster has occurred. Of course, the
shorter the RTO time frame is, the more likely the backup and recovery
solution costs will increase. For example, deploying a Windows Server
2008 R2 failover cluster can provide system recovery within seconds or
minutes, but the hardware and software licensing costs would easily exceed
the costs of a recovery plan that included diagnosing a hardware issue
and waiting for a replacement part to arrive within a 4-hour window. The
business owners or executives of an organization need to clearly
understand how long it will take to recover from certain failures and
that will help derive the final accepted backup and recovery solution.
Separating the SLA
and RTO in disaster recovery documentation can be a very valuable tool
to use when presenting the current or proposed computer and network
infrastructure disaster recovery solution to executives, managers,
auditors, and customers. For example, a service might be presented to
customers with a 99.99% SLA. The same system can be presented in the
finer details to have a maximum of an 8-hour RTO, which will still meet a
99.9% uptime in the event of a major disaster. This can also be worded
as “This service will provide 99.9% to 99.99% availability.”