The most common computer hardware malfunction is
probably a hard disk failure. Even though hard disks have become more
reliable over time, they are still subject to failure, especially during
their first month or so of use. They are also vulnerable to both
catastrophic and degenerative failures caused by power problems.
Fortunately, disk arrays have become the norm for servers, and good
fault-tolerant hardware RAID systems are available and supported on SBS. The choice of RAID and the particulars of how you configure your RAID
system can significantly affect the cost of your servers. To make an
informed choice for your environment and needs, you must understand the
tradeoffs and the differences in fault tolerance, speed,
configurability, and so on.
1. Hardware vs. Software
RAID can be implemented at the
hardware level, using RAID controllers, or at the software level, either
by the operating system or by a third-party add-on. SBS supports both hardware RAID and its own software RAID.
Hardware RAID implementations require dedicated controllers and cost somewhat more than an equivalent level of software
RAID. However, for that extra price, you get a faster, more flexible,
and more fault-tolerant RAID. When compared to the software RAID provided in SBS 2010, a good hardware RAID controller supports more levels
of RAID, on-the-fly reconfiguration of the arrays, hot-swap and
hot-spare drives , and dedicated
caching of both reads and writes.
Software RAID requires that
you convert your disks to dynamic disks. We don’t recommend converting
your system disk or boot disks, because dynamic disks can be more
difficult to access if a problem occurs, and the SBS setup and
installation program provides only limited support. For maximum fault
tolerance, we recommend using hardware mirroring (RAID-1) on your system
drive. Dynamic disks, and the software RAID they support, are also a
problem for virtualization and should not be used when you are
virtualizing SBS.
2. RAID Levels for Fault Tolerance
Except for level 0, RAID is a
mechanism for storing sufficient information on a group of hard disks
so that even if one hard disk in the group fails, no information is
lost. Some RAID arrangements go even further, providing protection in
the event of multiple hard disk failures. The more common levels of RAID
and their appropriateness in a fault-tolerant environment are shown in Table 1.
Table 1. RAID levels and their fault tolerance
LEVEL | NUMBER OF DISKS | SPEED | FAULT TOLERANCE | DESCRIPTION |
---|
0 | N | +++ | - - - | Striping
alone. Not fault-tolerant—it actually increases your risk of
failure—but does provide for the fastest read and write performance. |
1 | 2N | + | ++ | Mirror
or duplex. Slightly faster read than single disk, but no gain during
write operations. Failure of any single disk causes no loss in data and
minimal performance hit. |
3 | N+1 | ++ | + | Byte-level
parity. Data is striped across multiple drives at the byte level with
the parity information written to a single dedicated drive. Reads are
much faster than with a single disk, but writes operate slightly slower
than a single disk because parity information must be generated and
written to a single disk. Failure of any single disk causes no loss of
data but can cause a significant loss of performance. |
4 | N+1 | ++ | + | Block-level parity with a dedicated parity disk. Similar to RAID-3 except that data is striped at the block level. |
5 | N+1 | + | ++ | Interleaved
block-level parity. Parity information is distributed across all
drives. Reads are much faster than a single disk, but writes are
significantly slower. Failure of any single disk provides no loss of
data but results in a major reduction in performance. |
6 | N+2 | + | +++ | Replicated
interleaved block-level parity. Parity information is distributed
across all drives, with two parity blocks on separate drives for every
stripe. Reads are much faster than a single disk, but writes are
significantly slower. Failure of any two disks provides no loss of data
but results in a major reduction in performance. |
0+1 and 10 | 2N | +++ | ++ | Striped mirrored disks or mirrored
striped disks. Data is striped across multiple mirrored disks, or
multiple striped disks are mirrored. Failure of any one disk causes no
data loss and no speed loss. Failure of a second disk could result in
data loss. Faster than a single disk for both reads and writes. |
Other | Varies | +++ | +++ | Array of RAID arrays. Different hardware vendors have different proprietary names for this RAID
concept. Excellent read and write performance. Failure of any one disk
results in no loss of performance and continued redundancy. |
Note:
RAID
is an excellent solution for fault tolerance, but it can’t protect you
against corruption caused by hardware or software failures. Only a good
backup of data from before the corruption can protect against that.
When choosing the RAID level to use for a given application or server, consider the following factors:
Intended use
Will this application be primarily read-intensive, such as file
serving, or will it be predominantly write-intensive, such as a
transactional database? SBS servers are heavily write-intensive, at
least on the disks that Microsoft Exchange uses. Virtualization is also
highly disk-intensive.
Fault tolerance How critical is this data, and how much can you afford to lose?
Availability
Does this server or application need to be available at all times, or
can you afford to reboot it or otherwise take it offline for brief
periods?
Performance
Is this application or server heavily used, with large amounts of data
being transferred to and from it, or is this server or application less
I/O-intensive? If this is your main SBS server, it’s heavily used.
Cost
Are you on a tight budget for this server or application, or is the
cost of data loss or unavailability the primary driving factor?
You need to evaluate each
of these factors when you decide which type of RAID to use for a server
or portion of a server. No single answer fits all cases, but the final
answer requires you to carefully weigh each of these factors and balance
them against your situation and your needs. The following sections take
a closer look at each factor and how it weighs in the overall
decision-making process.
2.1. Intended Use
The intended use, and the kind of disk access associated with that use, plays an important role in determining the best RAID level
for your application. Think about how write-intensive the application
is and whether the manner in which the application uses the data is more
sequential or random. Is your application a three-square-meals-a-day
kind of application, with relatively large chunks of data being read or
written at a time, or is it more of a grazer or nibbler, reading and
writing little bits of data from all sorts of different places?
If your application is relatively write-intensive, you’ll want to avoid software RAID or RAID-5 and RAID-6
if other considerations don’t require them. With RAID-5 and RAID-6, any
application that requires more than 50 percent writes to reads is
likely to be at least somewhat slower, if not much slower, than it would
be on a single disk or a RAID-1 mirror. You can mitigate this to some
extent by using more but smaller drives in your array and by using a
hardware controller with a large cache to offload the parity processing
as much as possible. RAID-1, in either a mirror or duplex configuration,
provides a high degree of fault tolerance with no significant penalty
during write operations—a good choice for the system disk.
If your application is primarily read-intensive and the data is stored and referenced sequentially, RAID-3 or RAID-4
might be a good choice. Because the data is striped across many drives,
you have parallel access to it, improving your throughput. And because
the parity information is stored on a single drive rather than dispersed
across the array, sequential read operations don’t have to skip over
the parity information and are therefore faster. However, write
operations are substantially slower, and the single parity drive can
become an I/O bottleneck during write operations.
Note:
RAID-3 and RAID-4 have been largely supplanted by other RAID
technologies, primarily RAID-5 and RAID-10. In an SBS environment,
RAID-3 and RAID-4 are unlikely to be an appropriate choice, and you
should consider them only for specialized applications.
If your application is
primarily read-intensive and not necessarily sequential, RAID-5 and
RAID-6 are obvious choices. They provide a good balance of speed and
fault tolerance, and the cost is substantially lower than the cost of
RAID-1 or RAID-10. Disk accesses are evenly distributed across multiple
drives, and no single drive has the potential to be an I/O bottleneck.
However, writes require calculation of the parity information and the
extra write of that parity, slowing write operations down significantly.
Windows Small Business Server file shares are a good fit for RAID 5 and
RAID 6, but avoid them for the volume that holds write-intensive
database files.
If your application
provides other mechanisms for data recovery or uses large amounts of
temporary storage that doesn’t require fault tolerance, a simple RAID-0,
with no fault tolerance but fast reads and writes, is a possibility.
However, we strongly advise against RAID-0 on an SBS server unless you
clearly understand that anything on a RAID-0 array is completely
unprotected and is actually more likely to fail than a single disk.
2.2. Fault Tolerance
Carefully examine the fault tolerance of each of the possible RAID choices for your intended use. All RAID
levels except RAID-0 provide some degree of fault tolerance, but the
effect of a failure and the ability to recover from subsequent failures
are different.
If a drive in a RAID-1 mirror or
duplex array fails, a full, complete, exact copy of the data remains.
Access to your data or application is unimpeded, and performance
degradation is minimal, although you do lose the benefit gained on read
operations of being able to read from either disk. Until the failed disk
is replaced, however, you have no fault tolerance on the remaining
disk. Once you replace the failed disk, overall performance is
significantly reduced while the new disk is initialized and the mirror
is rebuilt. Modern RAID controllers can vary the speed of data
reconstruction when replacing a failed disk, allowing you to balance the
speed of regeneration against the performance degradation.
In a RAID-3 or RAID-4 array, if
one of the data disks fails, a significant performance degradation
occurs because the missing data needs to be reconstructed from the
parity information. Also, you’ll have no fault tolerance until the
failed disk is replaced. If the parity disk fails, you’ll have no fault
tolerance until it is replaced, but also no performance degradation.
Once you replace the failed disk, overall performance is significantly
reduced while the new disk is initialized and the parity information or
data is rebuilt.
In a RAID-5 array, the loss of any
disk results in a significant performance degradation, and your fault
tolerance will be gone until you replace the failed disk. Once you
replace the disk, you won’t return to fault tolerance until the entire
array has a chance to rebuild itself, and performance is seriously
degraded during the rebuild process.
In a RAID-6
array, the loss of any disk results in a significant performance
degradation, but you will still be fault tolerant. The failure of a
second disk will not cause data loss, but it will leave you with no
fault tolerance. Once you replace a failed disk, you won’t return to
full fault tolerance until the entire array has a chance to rebuild
itself, and performance is seriously degraded during the rebuild
process.
If a drive in a RAID 0+1 or RAID-10
array fails, a full, complete, exact copy of the data remains. Access
to your data or application is unimpeded, and performance degradation is
minimal. Until the failed disk is replaced, however, you have
incomplete fault tolerance on the array. A second disk failure, if it
occurs on the opposite side of the mirror, will cause data loss. Once
you replace the failed disk, overall performance is significantly
reduced while the new disk is initialized and the mirror is rebuilt.
Modern RAID controllers can vary the speed of data reconstruction when
replacing a failed disk, allowing you to balance the speed of
regeneration against the performance degradation.
RAID
systems that are arrays of arrays can provide for multiple failure
tolerance. These arrays provide for multiple levels of redundancy and
are appropriate for mission-critical applications that must be able to
withstand the failure of more than one drive in an array.
2.3. Availability
All levels of RAID,
except RAID-0, provide higher availability than a single drive.
However, if availability is expanded to also include the overall
performance level during failure mode, some RAID levels provide definite advantages over others. Specifically, RAID-1 and its derivatives, RAID-10 and RAID
0+1, provide enhanced availability when compared to RAID levels 3, 4,
5, and 6 during failure mode. The performance degradation is minimal
when compared to a single disk if one half of a mirror fails, whereas a RAID-5 or RAID-6 array has substantially compromised performance until the failed disk is replaced and the array is rebuilt.
In addition, RAID systems that
are based on an array of arrays can provide higher availability than
RAID levels 1 through 6. Running on multiple controllers, these arrays
are able to tolerate the failure of more than one disk and the failure
of one of the controllers, providing protection against the single point
of failure inherent in any single-controller arrangement. RAID 1 that
uses duplexed disks running on different controllers—as opposed to
RAID-1 that uses mirroring on the same controller—also provides this additional protection and improved availability.
Hot-swap drives and hot-spare
drives can further improve
availability in critical environments, especially hot-spare drives. By
providing for automatic failover and rebuilding, they can reduce your
exposure to catastrophic failure and provide for maximum availability.
2.4. Performance
The relative performance of
each RAID level depends on the intended use. The best compromise for
many situations is arguably RAID-5 or RAID-6, but you should question
the adequacy of that compromise if your application is fairly
write-intensive. Especially for relational database data and index files
where the database is moderately or highly write-intensive, the
performance hit of using RAID-5 or RAID-6 can be substantial. A better
alternative is to use RAID 0+1 or RAID-10.
Whatever level of
RAID you choose for your particular application, it will benefit from
using more small disks rather than a few large disks. The more drives
contributing to the stripe of the array, the greater the benefit of
parallel reading and writing you’ll be able to realize—and your array’s
overall throughput will improve.
2.5. Cost
The delta in cost between RAID
configurations is primarily the cost of drives, potentially including
the cost of additional array enclosures because more drives are required
for a particular level of RAID. RAID-1—either duplexing or mirroring—is
the most expensive of the conventional RAID levels because it requires
at least 33 percent more raw disk space for a given amount of net
storage space than other RAID levels.
Another consideration is
that RAID levels that include mirroring or duplexing must use drives in
pairs. Therefore, it’s more difficult (and more expensive) to add on to
an array if you need additional space on the array. A net 144-gigabyte
(GB) RAID 0+1 array, comprising four 72-GB drives, requires four more
72-GB drives to double in size—a somewhat daunting prospect if your
array cabinet has bays for only six drives, for example. A net 144-GB
RAID-5 array of three 72-GB drives, however, can be doubled in size
simply by adding two more 72-GB drives, for a total of five drives.
RAID arrays based on
2.5-inch drives are rapidly replacing traditional 3.5-inch drives. The
smaller 2.5-inch drives take up less physical space for the same amount
of total storage, while consuming substantially less power and
generating less heat. The initial cost of the array is essentially
similar to that of an equivalent array using 3.5-inch drives, but the
ongoing costs are less. Our current preferred array system uses eight
2.5-inch SAS drives configured as RAID 0+1. The entire array fits in the
space of a pair of standard CD/DVD drives.
3. Hot-Swap and Hot-Spare Disk Systems
Hardware
RAID systems can provide for both hot-swap and hot-spare capabilities. A
hot-swap disk system allows failed hard disks to be removed and a
replacement disk to be inserted into the array without powering down the
system or rebooting the server. When the new disk is inserted, it is
automatically recognized and either will be automatically configured
into the array or can be manually configured into it. Additionally, many
hot-swap RAID
systems allow you to add hard disks into empty slots dynamically and
automatically or manually increase the size of the RAID volume on the
fly without a reboot.
A hot-spare
RAID configuration uses an additional, preconfigured disk or disks to
automatically replace a failed disk. These systems can be configured to
automatically regenerate the array in the event of a failure, thus
maintaining maximal redundancy. When combined with a RAID configuration
that can withstand multiple drive failures, such as RAID-6, a hot-spare
system provides a very high degree of redundancy and availability.
Even where you don’t have a
hot-spare drive already configured into your array, it makes sense to
always keep a matching spare drive available in your replacement-parts
cabinet. Hard drives aren’t all that expensive, and having a spare will
save you time if you have a drive failure in your array. Plus, with
drive sizes and technology changing rapidly, it can be annoying to try
to find a matching drive two or three years after you buy the original
array.