3. System and Environment Health Models
The term system health model
refers to the process of ordering and tracking the overall system state
of individual servers throughout the environment as time progresses.
Within the enterprise, this process is particularly crucial because it
gives you an indication of how much productivity is being lost because
of equipment failures and a lack of application availability.
In some corporations, the loss
of a single server can bring the entire company to its knees in terms of
production, especially if there is no backup. However, usually in a
large enterprise the loss of one machine just means that the state of
the overall environment is "other than 100 percent." In other words, the
environment is still functioning but not as well as it theoretically
could be functioning.
Windows Server health, and
server health in general, usually falls into several categories, each of
which can be monitored and evaluated on an individual level. These
categories include the following:
Server availability
Server uptime
Server downtime
CPU usage
Memory usage
Page file usage
Disk utilization
Network utilization
Service availability
Service downtime
Application availability
Application downtime
Backup availability
Backup downtime
Most of these are
discussed throughout the process of becoming a Microsoft Certified
Windows Server Professional. At the enterprise administrator level,
you're interested primarily in the top four categories: server
availability, uptime, downtime, and hardware statistics.
3.1. Server Availability
Server availability
refers to the period of time in which a server is up, running, and not
inaccessible. This can be achieved only when a server is operating at
its full capacity. For example, a server cannot be considered
"available" if the server is running but the network cable has somehow
become unplugged and is now inaccessible to the rest of the network
infrastructure. Thus, in order for this state to be achieved, the server
has to be set up properly and functioning.
As shown in Figure 2,
most administrators keep a chart of the times and availability of their
server on a day-to-day basis, using a score of 0 percent to 100
percent. In most companies, this is usually available to all IT staff in
a centralized intranet or Internet location so that it can be accessed
for future study.
The importance of this chart
and this area of study is that it determines the overall state of the
enterprise. The closer it is to 100 percent, the better off the entire
organization is. With anything less than 100 percent, the company is not
functioning as well as it could be.
3.2. Server Uptime
Server uptime
refers to the period of time that the server has been running with
power in which it has not experienced a software- or hardware-based
failure resulting in the loss of the critical components of an operating
system. Generally, the causes of a loss of server uptime include the
following:
Usually, server
uptime is used in conjunction with server availability to determine
whether software on the operating system is causing failures.
Additionally, this indicates the status of power availability and
reliability throughout the infrastructure. Normally, administrators who
need a justification for expensive hardware, such as battery backups,
will use this statistic along with server availability to illustrate
that most productivity loss comes from a lack of available power in the
case of a failure.
3.3. Server Downtime
The opposite of server uptime is server downtime,
which refers to the period of time in which the operating system is not
up, running, receiving power, or functioning as it should be
functioning. Ideally, the amount of server downtime is zero. Whenever
this statistic is present, it means the network is not functioning as
well as it could be.
3.4. Service Availability
When working with Windows Server, an important component to the overall environment health is service availability.
When using Active Directory, a simple network service such as NetLogon
can result in the entire Active Directory infrastructure becoming
useless if it is an overall environment outage. With this statistic, you
can pay careful attention to the availability of services overall.
If these services are
automatic, you can check them against a chart that compares whether the
automatic services are being enabled as they should be (based on need).
If these services are manual, you can even do a human task-oriented
analysis to see whether these services are enabled as they should be
based on job roles and duties.
Overall, service
availability plays a large role in overall server availability, because
it is one of the deciding factors in determining how a computer is
functioning. A server can be up, running, and operating but not have a
service started and therefore not be fulfilling its given purpose,
especially in a specialized client environment.
3.5. Hardware Statistics
The next general category
of system health that's important is divided amongst the many different
components of general server health. This includes functions such as
memory usage, disk access, and the use of overall hardware. At the
enterprise level, you're won't be quite as concerned with this as an
operator of an individual server. You're much more concerned with the
overall health of the entire infrastructure. However, it's important to
note this general category.