Backing Up the Windows Server 2008 R2 Environment : Understanding Your Backup and Recovery Needs and Options

5/17/2011 4:37:43 PM

A key to creating a valuable backup and recovery plan is to have a clear understanding of how the computer and network infrastructure is configured, as well as having an understanding of how the business operates and utilizes the infrastructure. This discovery process involves mapping out both the computer and network systems in place and also documenting and understanding the business processes that depend on the infrastructure. For example, an organization might process incoming orders from field sales representatives via fax transmissions of contracts that are accepted by a Windows Server 2008 R2 fax service. If the fax service is not available, no orders are processed. This is just a simple example of when downtime of a Windows server can directly affect business operations. Understanding which systems and services are most important to the business can help IT staff set the order or prioritize which systems will be recovered first, in the event of a large-scale disaster.

Identifying the Different Services and Technologies

Each deployed role, role service, feature, or application provided by a Windows Server 2008 R2 system provides a key system function, which in many cases is critical to the organization. Each application, role service, role, and feature installed on a Windows Server 2008 R2 system should be identified and documented so the IT group can have a clear view of the complexity of the environment as backup and recovery plans are being developed. It is very common for server and web-based applications to require special backup and restore procedures, and these are especially important to identify for disaster recovery purposes.

Identifying Single Points of Failure

A single point of failure is a device, application, or service on a computer and networking infrastructure that provides an exclusive function with no redundancy. A common single point of failure in smaller organizations is a network switch that provides the connectivity between all of the servers, client workstations, firewalls, wireless access points, and routers on a network. Within a Windows Server 2008 R2 Active Directory infrastructure as an example, Active Directory Domain Services (AD DS) inherently comes with its own set of single points of failure, with its Flexible Single Master Operations (FSMO) roles. These roles provide an exclusive function to the entire Active Directory forest or just a single domain, and if the designated domain controller hosting that role fails, these hosted FSMO roles become unavailable. Even though the FSMO roles are single points of failure, recovering a domain controller can be very simple and painless if proper backup and recovery planning is performed.

Evaluating Different Disaster Scenarios

Before a backup and disaster recovery plan can be formulated, IT managers and administrators should meet with the business owners to discuss and decide on which types of failures or disasters should be planned for. Planning for every disaster scenario is nearly impossible or, more commonly, will exceed an organization’s backup and recovery budget, but discussing the likelihood of each scenario and evaluating how the scenario can impact the business is necessary.

Physical Disaster

A physical disaster is anything that can keep employees or customers from reaching their desired office or store location. Examples include natural disasters such as floods, fires, earthquakes, hurricanes, or tornadoes that can destroy an office. A physical disaster can also be a physical limitation, such as a damaged bridge or highway blockage caused by a car accident. When only physical access is limited or restricted, a remote access solution could reestablish connectivity between users and the corporate network.

Power Outage or Rolling Blackouts

Power outages can occur at any time unexpectedly. Some power outages are caused by bad weather and other natural disasters, but other times they can be caused by high power consumption that causes system overloads. When power systems are overloaded, rolling blackouts may occur. A rolling blackout is when a power company shuts off power to certain power subscribers or areas of service, so that it maintains power to critical services, such as fire departments, police departments, hospitals, and traffic lights. The rolling part of rolling blackouts is that the blackout is managed; after a predetermined amount of the time, the power company will shut down a different power grid and restore power to a previously shutdown grid. Of course, during power outages, many businesses are unable to function because the core of their work is conducted on computers or even telephone systems that require power to function.

Network Outage

Organizations that share data and applications between multiple offices and require access to the Internet as part of their daily business operations are susceptible to network outages that can cause severe loss of employee productivity and possibly revenue. Network outages can affect just a single computer, the entire office, or multiple offices depending on the cause of the outage. IT staff must take network outages into consideration when creating the backup and recovery plans.

Hardware Failures

Hardware failures seem to be the most common disaster encountered and coincidentally are the most common type of problem organizations plan for. Server hardware failures include failed motherboards, processors, memory, network interface cards, network cables, fiber cables, disk and HBA controllers, power supplies, and, of course, the hard disks in the local server or in a storage area network (SAN). Each of these failures can be dealt with differently, but to provide system- or server-level redundancy, key services should be deployed in a redundant cluster configuration, such as is provided with Windows Server 2008 R2, Enterprise Edition Failover Clustering, or Network Load Balancing (NLB).

Hard Drive Failure

Hard drives are indeed the most common type of computer- and network-related hardware failure organizations have to deal with. Windows Server 2008 R2 supports hot-swappable hard drives and two types of disks: basic disks, which provide backward compatibility, and dynamic disks, which allow software-level disk arrays to be configured without a separate hardware-based disk array controller. Also, both basic and dynamic disks, when used as data disks, can be moved to other servers easily to provide data or disk capacity elsewhere if a system hardware failure occurs and the data on these disks needs to be made available as soon as possible. Windows Server 2008 R2 also contains tools to provision, connect, and configure storage located on a SAN and can easily mount VHD files as operating system disks using Disk Manager or diskpart.

Note

If hardware-level RAID is configured, the controller card stores the disk array configuration and the manufacturer should be contacted to provide the necessary tools or documentation necessary to back up, restore, rebuild, or re-create the configuration should a controller failure occur or if the disk needs to be moved to a different machine with the same type of controller.

Software Corruption

Software corruption can occur at many different levels. Operating system files could be corrupted, antivirus software can interfere with the writing of a file or database causing corruption, or a new application or driver installation could overwrite a critical file leaving a system unstable or in a failed state. Also, more commonly found in today’s networks, a security, application, or system update conflicts with an existing application or service causing undesirable issues.

Prioritizing the Recovery

After all of the computer services and applications used on a network are identified, as well as deciding which typical disaster scenarios will be considered in the backup and recovery plan, the next step is to organize or prioritize how the recovery of critical systems and services will be executed. The prioritization usually involves getting the most critical services up and running first; this usually requires networking services such as DNS and DHCP, as well as Active Directory domain controllers, especially on corporate networks that utilize Microsoft Windows servers and client operating systems.

Maintaining up-to-date backup and recovery plans requires following strict processes when changing an organization’s computer and network infrastructure. With an up-to-date technology priority list, administrators can tackle the planning for the most important services first to ensure that if a disaster strikes sooner rather than later, the most important systems are always protected and recoverable.

Identifying Bare Minimum Services

The bare minimum services are the fewest possible services and applications that must be up and running for business operations to continue. Only the top few services and applications in the technology prioritized list will become part of the bare minimum services list. For example, a bare minimum computer service for a retail outlet could be a server that runs the retail software package and manages the register and receipt printer. For a web-based company, it could be the web and e-commerce servers that process online orders.

Determining the Service-Level Agreement and Return-to-Operation Requirements

A service-level agreement (SLA) is an estimated planned uptime or availability time frame for a system, service, or application. SLAs are usually defined by hours per day, week, month, or year and are expressed in percentages. For example, if the corner grocery store claims to be open 24 hours a day, every day of the year, the grocery store SLA is 100%. Another example could be an organization’s electronic fax services that should be available 7 days a week between the hours of 5:00 a.m. and 11:00 p.m.

Many organizations hope to achieve and maintain operation of the most critical services 24 hours a day, 7 days a week or 100% planned uptime as logistically possible. A few common SLA targets are included in the following list:

99.999% planned uptime results in 5.25 minutes of planned downtime or maintenance per year.
99.99% planned uptime results in 52.5 minutes of planned downtime or maintenance per year.
99.9% planned uptime results in 8 hours, 45.6 minutes of planned downtime or maintenance per year.
99.7% planned uptime results in 26 hours and 17 minutes of planned downtime or maintenance per year.
99% planned uptime results in 87 hours and 36 minutes of planned downtime or maintenance per year.

Executives and managers alike all know that maintaining 100% of planned uptime is not usually possible because of a number of factors. Also, many professionals might also consider that the SLA must account for the time to recover after a failure or disaster is encountered. Ensure that the definition of the SLA is understood by all as “planned” uptime or “planned and unplanned.” The difference is huge. A recommendation is that an SLA is defined as planned uptime. The unplanned recovery time frame is defined as the Return to Operation (RTO) number for the remainder of this section.

The RTO defines how long it will take to recover a system, service, application, or business operation after a failure or disaster has occurred. Of course, the shorter the RTO time frame is, the more likely the backup and recovery solution costs will increase. For example, deploying a Windows Server 2008 R2 failover cluster can provide system recovery within seconds or minutes, but the hardware and software licensing costs would easily exceed the costs of a recovery plan that included diagnosing a hardware issue and waiting for a replacement part to arrive within a 4-hour window. The business owners or executives of an organization need to clearly understand how long it will take to recover from certain failures and that will help derive the final accepted backup and recovery solution.

Separating the SLA and RTO in disaster recovery documentation can be a very valuable tool to use when presenting the current or proposed computer and network infrastructure disaster recovery solution to executives, managers, auditors, and customers. For example, a service might be presented to customers with a 99.99% SLA. The same system can be presented in the finer details to have a maximum of an 8-hour RTO, which will still meet a 99.9% uptime in the event of a major disaster. This can also be worded as “This service will provide 99.9% to 99.99% availability.”

Other -----------------

- Active Directory Domain Services 2008 : Manage the Active Directory Domain Services Schema - Deactivate Classes

- Active Directory Domain Services 2008 : Manage the Active Directory Domain Services Schema - Create Classes

- Active Directory Domain Services 2008 : Configuring Attributes Not to Be Indexed for Containerized Searches & Configure Attribute Range

- SharePoint 2010 PerformancePoint Services : Working with the Monitoring API - Custom Object Editors

- SharePoint 2010 PerformancePoint Services : Working with the Monitoring API - Custom Objects and Editors

- SharePoint 2010 PerformancePoint Services : Working with the Monitoring API - Working with PPS Objects

- BizTalk 2010 Recipes : EDI Solutions - Subscribing to EDI Promoted Properties

- BizTalk 2010 Recipes : EDI Solutions - Creating Custom EDI Pipelines

- Monitoring Exchange Server 2010 : Monitoring Mail Flow (part 3) - Managing Messages

- Monitoring Exchange Server 2010 : Monitoring Mail Flow (part 2) - Monitoring Transport Queues