Building fault-tolerant Windows Server 2008 R2
systems by utilizing the built-in clustering technologies consists of
carefully planning and configuring server hardware and software,
planning and configuring the network devices that connect the server to
the network, and providing reliable power for the server. Purchasing
high-quality server and network hardware is a good start to building a
fault-tolerant system, but the proper configuration and selection of
this hardware is equally important. Also, providing this equipment with
reliable power and redundant backup power from battery sources and
possibly generators can increase reliability of the servers as well as
the networking infrastructure. Last, but not least, properly tuning the
server operating systems to streamline performance for the desired
roles, role services, features, and applications helps enhance server
availability and stability.
Powering the Computer and Network Infrastructure
Powering Windows Server
2008 R2 servers and network hardware with battery or generator-backed
power sources not only provides these devices with conditioned line
power by removing voltage spikes and providing steady line voltage
levels, but it also provides alternative power when unexpected blackouts
or brownouts occur. Many organizations cannot afford to implement
redundant power sources or generators to power the offices, data
centers, and server rooms. For these organizations, the best approach to
providing reliable power to the computer and network infrastructure is
to deploy uninterruptible power supplies (UPSs) with battery-backed
power. With a UPS, power is normally supplied from the batteries, which
are continually charged by the utility line power. When the line power
fails, a properly sized UPS provides ample time for end users to save
their data to the server and to gracefully shut down the server or
network device without risk of damaging hardware or corrupting data.
UPS
manufacturers commonly provide software that can send network
notifications, run scripts, or even gracefully shut down servers
automatically when power thresholds are reached. Of course, if end-user
data is important, each end-user workstation and the network switches
that connect these workstations to the computer and network
infrastructure should also be protected with UPSs that can provide at
least 5 to 10 minutes of battery-backup power.
One final word on power is
that most computer and network hardware manufacturers offer device
configurations that incorporate redundant power supplies designed to
keep the system powered up in the event of a single power supply
failure.
Designing Fault-Tolerant IP Networks
Network design can
also incorporate fault tolerance by creating redundant network routes
and by utilizing technologies that can group devices together for the
purposes of load balancing and device failover. Load balancing is the
process of spreading requests across multiple devices to keep individual
device load at an acceptable level. Failover is the process of moving
services offered on one device to another upon device failure, to
maintain availability. Common scenarios for creating fault-tolerant IP
networks can include, but are not limited to, the following:
Acquiring multiple network connections between the data center and the Internet—
This includes using different Internet service providers and,
hopefully, each of the connections is not connected to the same external
telco box on the street as this becomes the single point of failure if
hit by a car, truck, or cut off from communications.
Deploying
multiple and redundant firewalls, virtual private networks (VPNs), and
network routers that will failover to one another—
This usually involves software or hardware configurations that allow
each of the devices to communicate with one another to detect failures.
These devices, when deployed in redundant configurations, can be
leveraged in an active/passive configuration where only a single primary
device is used and the secondary device only comes online when the
primary fails. Alternatively, in many cases these devices can be used in
an active/active configuration that disperses or distributes the load
and requests across each device and when a single device fails, the
remaining device handles the entire load.
Deploying critical servers with multiple network adapters connected to separate network switches—
This allows a server to be connected and available on different
switches in case a single network card in the server fails or if the
port or the entire network switch or blade fails.
Deploying hardware-based NLB devices—
Many network switches, routers, and certain devices created just for
this purpose can provide some, if not all, of the functionality included
in Windows Server 2008 R2 NLB. This, of course, might be the best
choice for load balancing at the network level when organizations deploy
and support systems other than Windows Server 2008 R2 and when they
also need to load-balance network devices, such as firewalls and VPN
devices.
Deploying servers with multiple network adapters using third-party network teaming software— This
configuration uses third-party software installed and configured on a
server to create a new virtual network adapter that is used to provide
access to the server system through a single or all of the physical
network adapters on the server that are part of this configuration.
Windows Server 2008 R2 supports teamed network adapters as long as the
drivers and software are certified to work with Windows Server 2008 R2.
Note
If the Windows Server
2008 R2 system utilizes iSCSI storage, the network adapters designated
for iSCSI communications are not supported on teamed network adapters.
Designing Fault-Tolerant Server Disks
Many Windows Server 2008 R2
systems that will be used for NLB or failover clusters are deployed with
local disk storage. The local disks commonly store the operating system
files as well as the necessary service or application files. Each
system that will participate in a cluster should have the local disks
and volumes configured exactly the same, including drive letters and any
mount point assignments. When local disks are used to provide the
operating system and application or service core files, the local disks
should be deployed using redundant, fault-tolerant configurations. There
are mainly two different ways to add fault tolerance to the local disks
in a Windows Server 2008 R2 system. The first is creating redundant
arrays of inexpensive disks (RAID) using disk controller configuration
utilities (also known as hardware-level RAID), and the second is
creating RAID volumes using dynamic disks using the Disk Management
console from within the operating system (known as software-level RAID).
Using two or more disks,
different RAID-level arrays can be configured to provide fault tolerance
that can withstand disk failures and still provide uninterrupted disk
access. Implementing hardware-level RAID configured, stored, and managed
by the system’s disk controllers is preferred over the software-level
RAID configurable within Windows Server 2008 R2. Windows Server 2008 R2
dynamic disk mirrored and RAID-5 volumes are managed by the system and
add some load to the system. Additionally, another good reason to
provide hardware-level RAID is that the configuration of the disks does
not depend on the operating system, which gives administrators greater
flexibility when it comes to recovering server systems and performing
upgrades.
As a best practice, Windows
Server 2008 R2 can be deployed with the operating system disks stored on
RAID-1, or mirrored, disks and presented to the operating system as the
“C” volume. A second volume in the system can be used to store
application data and files and, when possible, this data should be
placed on different redundant disks or at least on separate volumes to
prevent impact to the space available in the operating system volume.
Increasing Service and Application Availability
A
service and/or application’s reliability is greatly dependent on the
underlying software code, the hardware the system is running on, and how
it interacts with the host operating system. Windows Server 2008 R2 is a
very stable platform partly because third-party applications and
services must use only the system files provided by Microsoft when
interacting with the operating system and the system hardware.
Furthermore, when third-party services and applications require
additional drivers, these drivers must be certified for Windows Server
2008 R2 and the drivers must be digitally signed by the Windows Quality
Hardware labs to ensure the highest reliability. Administrators can
disable the strict device driver signing requirements, but on failover
clusters, this would place the system in an unsupported configuration
and is not advisable. Remember that the only reason to deploy failover
clusters or NLB clusters is to provide high availability or very
scalable services; deploying systems using unsigned or untested drivers
can reduce the overall reliability of each system and the entire
cluster.