Establishing Service Level Agreements
The most common question
from Exchange administrators is “How should I be doing my backups?” The
answer to this question is quite simple. You should be doing them such
that they support your service level agreements around recoverability
for Exchange services.
Based on this concept,
it quickly becomes apparent that the first step in planning out your
backups is to determine exactly what you’ve committed yourself to. This
is commonly referred to as a service level agreement or simply an SLA.
Establishing a Service Level Agreement for Each Critical Service
Exchange 2007 is
often deployed such that roles are distributed across multiple servers.
This distribution of roles might vary from site to site. However, the
SLAs will likely remain constant across the enterprise.
It
is important to understand the implication of SLAs for each aspect of
Exchange 2007 as the SLA drives your design and must be considered up
front and not as an afterthought to a deployed Exchange 2007
environment.
Determining SLAs for Mailbox Servers
One of the most
important aspects of Exchange 2007 is the Mailbox server. If the
Mailbox server isn’t up, users can’t access their mail. This is usually
the first thing that triggers the help desk phone to ring. Most
companies start their SLAs around the Mailbox servers. In most
environments, a 2-hour recovery for a mailbox database is acceptable.
This means that if your database fails, you need to be able to recover
that data within 2 hours. If you know that your system is capable of
restoring 10GB of data per hour, you know that, based on your backup
process, you can only support 20GB per database.
If your SLA for an
entire Mailbox server recovery is 4 hours and you know that it takes 2
hours to rebuild a new server with Exchange 2007, then you only have 2
hours to restore data, which, based on the preceding example, means you
can only have 40GB of data on the server. If you had planned to allow
users 200MB of storage each, this limits the server to 100 users. If you
wanted to support more users per server, you either have to alter the
SLA or you have to change your backup strategy to allow you to restore
more data in the same period of time. This is what allows you to safely
support large numbers of users with good SLAs. This is where you have to
balance the costs of the backup/restore system with the cost of adding
additional servers.
Determining SLAs for Client Access Servers
Another major
component of Exchange 2007 is the Client Access server (CAS). These are
the systems that allow mobile devices and web browsers to access users’
email. When determining SLAs for this function, it is helpful to view
the service and the servers as two entities. Although you likely want
very high availability on the service, you can likely worry less about
the servers individually if they are designed with redundancy in mind.
So, if you have two or more CASs, you have plenty of time to rebuild one
server if it fails because there is already another that is taking up
the load. Keep this in mind when designing your Exchange environment.
Also keep in mind that the data on a CAS is mostly static. Building a
new CAS might be faster than restoring an existing one.
Determining SLAs for Edge Transport Servers
For systems like the
Edge Transport servers in Exchange 2007, it is more useful to view the
SLA for this role as being for the service as opposed to the servers
themselves. In the case of Edge Transport servers, the service they
provide is sending and receiving external email to and from the
Internet. In this sense, most companies try to enforce a fairly
aggressive SLA on the service itself. For example, if Internet mail
connectivity were to fail, they’d want the service restored within an
hour or two. In most environments, this is fairly easy to accomplish
because there is typically two or more Edge Transport servers to provide
redundancy and minimize wide area network (WAN) traffic. In the case of
the SLAs on the servers themselves, typically a 1-day recovery is
acceptable. Because the Edge Transport servers don’t hold any unique
data, they can easily be replaced in the event of a failure.
Determining SLAs for Hub Transport Servers
The
role of the Hub Transport server is to transfer mail from one site to
another connected site. As such, when a Hub Transport server fails, the
site it served is effectively cut off from other sites. As such, a
company would most likely want a fairly aggressive SLA on the Hub
Transport servers. In most environments, the Hub Transport server role
is combined with other roles because, in most cases, it won’t justify
being on an isolated server. As such, the SLA for recovery is often
overwritten by the SLA for another role that it supports. As such, it is
recommended that, when possible, two or more systems per site should
host the Hub Transport server role.
Supporting Backups with Documentation
Performing
trustworthy backups is a critical process in any Exchange environment.
One of the simplest ways to ensure that your backups are being done
properly is to document your requirements and your processes.
A mechanism needs to be in
place to track the success of backups and a process to follow if a
backup fails. Sticking to this process and not conflicting with the set
policies ensures that backups are valid and recoverable in the event of a
failure.
Documenting Backup Policy and Procedures
When building
your documentation around your backups, it is best to start with a
policy that will support not only the SLAs for your Exchange environment
but one that complies with any existing rules from your Information
Security group or Regulatory Compliance group.
Management should
review and approve your backup policies to ensure that they are in line
with any established SLAs. Policies should include items such as the
following:
Frequency and type of backups
Acceptable standards for offsite storage and retrieval
Escalation path for failed backups
Decision criteria for overrun jobs
Clear statement of what is and isn’t backed up
Whether the backups are password protected
Data retention periods
In this way, everyone
knows what is and isn’t covered by Exchange backups and there are no
surprises in the future. Having this policy documented is also very
helpful if you are required to pass any audits or verify regulatory
compliance.
Maintaining Documentation on the Exchange Environment
Systems
like Exchange often outlast the employees who built them. This means
that it’s easy to lose track of exactly how systems are deployed, where
various roles are located, and the specific needs of each participating
system. For this reason, it is very important to maintain accurate
documentation regarding the server configurations, the network, and the
path of mail flow. In addition, it is also important to track the
configuration of firewalls and switches that could potentially impact
the overall Exchange environment if they were to fail and need to be
replaced.
Server Configuration Documentation
Server
documentation is essential for any environment regardless of size,
number of servers, or disaster recovery budget. A server configuration
document contains a server’s name, network configuration information,
hardware and driver information, disk and volume configuration, or
information about the applications installed. This complete server
configuration document contains all the necessary configuration
information a qualified administrator would need if the server needed to
be restored and the operating system could not be restored efficiently.
A server configuration document can also be used as a reference when
server information needs to be collected.
Tip
To assist
with gathering information, administrators can use the WINMSD tool to
collect server data and configuration information to assist in producing
server build documents. In the Run dialog box, type winmsd in the Open
text box, and click OK to view the Systems Information screen in Windows
Server 2003.
The Server Build Document
A server build document
contains step-by-step instructions on how to build a particular type of
server for an organization. The details of this document should be
tailored to the skill of the person intended to rebuild the server. For
example, if this document was created for disaster recovery purposes, it
might be detailed enough that anyone with basic computer skills could
rebuild the server. This type of information could also be used to help
information technology (IT) staff follow a particular server build
process to ensure that when new servers are added to the network, they
all meet company server standards.
Hardware Inventory
Documenting the
hardware inventory of an entire network might not be necessary. If the
entire network does need to be inventoried, and if the organization is
large, the Microsoft Systems Management Server can help automate the
hardware inventory task. If the entire network does not need to be
inventoried, hardware inventory can be collected for all the production
and lab servers and networking hardware, including specifications such
as serial numbers, amount of memory, disk space, processor speed, and
operating system platform and version.
Network Configurations
Network
configuration documentation is essential when network outages occur.
Current, accurate network configuration documentation and network
diagrams can help simplify and isolate network troubleshooting when a
failure occurs.
WAN Connection
WAN connectivity
should be documented for enterprise networks that contain many sites to
help IT staff understand the enterprise network topology. This document
is very helpful when a server is restored and data should be
synchronized enterprisewide after the restore. Knowing the link
performance between sites helps administrators understand how long an
update made in Site A will take to reach Site B. This document should
contain information about each WAN link, including circuit numbers,
Internet service provider (ISP) contact names, ISP technical support
phone numbers, and the network configuration on each end of the
connection, and can be used to troubleshoot and isolate WAN connectivity
issues.
Router, Switch, and Firewall Configurations
Firewalls,
routers, and, sometimes, switches can run proprietary operating systems
with a configuration that is exclusive to the device. During a system
recovery, certain gateway connections, configuration routing
information, routing table data, and other information might need to be
reset on the restored server. Information should be collected from these
devices, including logon passwords and current configurations. When a
configuration change is planned for any of these devices, the newly
proposed configuration should be created using a text or graphical
editor, but the change should be approved before it is made on the
production device. A rollback plan should be created first to ensure
that the device can be restored to the original state if the change does
not deliver the desired results.
Updating Documentation
One of the most
important, yet sometimes overlooked, areas around documentation is
maintaining their accuracy as changes are applied to server systems.
Documentation is tedious, but outdated documentation can be worthless if
changes have occurred to a server’s software configuration since the
document was created. For example, if a server configuration document
was used to re-create a server from scratch but many changes were
applied to the server after the document was created, the correct
security patches might not be applied, applications might be configured
incorrectly, or data restore attempts could be unsuccessful. Whenever a
change will be made to a network device, printer, or server,
documentation outlining the previous configuration, proposed changes,
and rollback plan should be created before the change is approved and
carried out on the production device. After the change is carried out
and the device is functioning as desired, the documentation associated
with that device or server should be updated.