3. Smart Database Design
More than a few databases do not adhere
to the principles of information architecture, and as a result, fail to
meet organization's needs. In nearly every case, the root cause of the
failure was the database design. It was too complex, too clumsy, or just
plain inadequate. The side effects of a poor database design include
poorly written code because developers work around, not with, the
database schema; poor performance because the database engine is dealing
with improperly structured data; and an inflexible model that can't
grow with the organization it is supposed to support. The bottom line is
that good database design makes life easier for anyone who touches the
database. The database schema is the foundation of the database project;
and an elegant, simple database design outperforms a complex database
both for the development process and the final performance of the
database application. This is the basic idea behind the Smart Database
Design.
Database System
A database system is a complex system,
which consists of multiple components that interact with one another.
The performance of one component affects the performance of other
components and thus the entire system. Stated another way, the design of
one component can set up other components and the whole system to
either work well together or to frustrate those trying to make the
system work.
Every database system contains four broad
technologies or components: the database, the server platform, the
maintenance jobs, and the client's data access code, as shown in Figure 2. Each component affects the overall performance of the database system:
- The server environment is the physical hardware configuration
(CPUs, memory, disk spindles, and I/O bus), the operating system, and
the SQL Server instance configuration, which together provide the
working environment for the database. The server environment is
typically optimized by balancing the CPUs, memory, and I/O, and
identifying and eliminating bottlenecks.
- The database maintenance jobs are the steps that keep the
database running optimally (index defragmentation, DBCC integrity
checks, and maintaining index statistics).
- The client application is the collection of data access
layers, middle tiers, front-end applications, ETL (extract, transform,
and load) scripts, report queries, or SQL Server Integration Services
(SSIS) packages that access the database. These cannot only affect the
user's perception of database performance, but can also reduce the
overall performance of the database system.
- Finally, the database component includes everything within
the data file: the physical schema, T-SQL code (queries, stored
procedures, user-defined functions [UDFs], and views), indexes, and
data.
All four database components must function well
together to produce a high-performance database system; if one of the
components is weak, then the database system will fail or perform
poorly.
However, of these four components, the database is
the most difficult component to design and the one that drives the
design of the other three components. For example, the database workload
determines the hardware requirements. Maintenance jobs and data access
code are both designed around the database; and an overly complex
database can complicate both the maintenance jobs and the data access
code.
Physical Schema
The base layer of Smart Database Design
is the database's physical schema. The physical schema includes the
database's tables, columns, primary and foreign keys, and constraints.
Basically, the “physical” schema is what the server creates when you run
Data Definition Language (DDL) commands. Designing an elegant,
high-performance physical schema typically involves a team effort and
requires numerous design iterations and reviews.
Well-designed physical schemas avoid
over-complexity by generalizing similar types of objects, thereby
creating a schema with fewer entities. While designing the physical
schema, make the data obvious to the developer and easy to query. The
prime consideration when converting the logical database design into a
physical schema is how much work is required for a query to navigate the
data structures while maintaining a correctly normalized design. Not
only is the schema then a joy to use, but it also makes it easier to
code against, reducing the chance of data integrity errors caused by
faulty queries.
Conversely, a poorly designed (either
non-normalized or overly complex) physical schema encourages developers
to write iterative code, code that uses temporary buckets to manipulate
data, or code that will be difficult to debug or maintain.
Agile Modeling
Agile development is popular for
good reasons. It gets the job done quickly and often produces a better
result than traditional methods. Agile development also fits well with
database design and development.
The traditional waterfall process
steps through four project phases: requirements gathering, design,
development, and implementation. Although this method may work well for
some endeavors, when creating software, the users often don't know what
they want until they see it, which pushes discovery beyond the
requirements gathering phase and into the development phase.
Agile development addresses this
problem by replacing the single long waterfall with numerous short
cycles or iterations. Each iteration builds out a working model that can
be tested and enables users to play with the software and further
discover their needs. When users see rapid progress and trust that new
features can be added, they become more willing to allow features to be
planned into the life cycle of the software, instead of insisting that
every feature be implemented in the next version.
A project might consist of a dozen of
these tight iterations; and with each iteration, more features are
fleshed out in the database and code. The principle of extensibility, is highlighted by an agile
development process; as you cycle through iterations, an extensible
database absorbs new business requirements with less refactoring.
Frankly, this is how requirements evolve. You never know everything up
front, so plan for and embrace the idea that your database needs to
evolve along with the rest of the project. This might include time built
into your schedule for design refactoring, aligning database design
tasks with iterative application coding cycles, or deferring design
decisions until requirements become more robust.
Set-Based Queries
SQL Server is designed to handle data in
sets. SQL is a declarative language, meaning that the SQL query
describes the problem, and the Query Optimizer generates an execution
plan to resolve the problem as a set.
Iterative T-SQL code is code that acts upon data
one row at a time instead of as a set. It is typically implemented via
cursors and forces the database engine to perform thousands of wasteful
single-row operations, instead of handling the problem in one larger,
more efficient set. The performance cost of these single-row operations
is huge. Depending on the task, SQL cursors perform about half as well
as set-based code, and the performance differential grows with the size
of the data. This is why set-based queries, based on an obvious physical
schema, are so critical to database performance.
A good physical schema and set-based queries set
up the database for excellent indexing, further improving the
performance of the query (refer to Figure 2).
However, queries cannot overcome the errors of a
poor physical schema and won't solve the performance issues of poorly
written code. It's simply impossible to fix a clumsy database design by
throwing code at it. Poor database designs tend to require extra code,
which performs poorly and is difficult to maintain. Unfortunately,
poorly designed databases also tend to have code that is tightly coupled
(refers directly to tables), instead of code that accesses the
database's abstraction layer (stored procedures and views). This makes
it harder to refactor the database.
Indexing
An index is an organized pointer used to
locate information in a larger collection. An index is only useful when
it matches the needs of a question. In this case, it becomes the
shortcut between a question and the right answer. The key is to design
the fewest number of shortcuts between the right questions and the right
answers.
A sound indexing strategy identifies a handful of
queries that represent 90 percent of the workload and, with judicious
use of clustered indexes and covering indexes, solves the queries
without expensive lookup operations.
An elegant physical schema, well-written set-based
queries, and excellent indexing reduce transaction duration, which
implicitly improves concurrency and sets up the database for
scalability.
Nevertheless, indexes cannot overcome the
performance difficulties of iterative code. Poorly written SQL code that
returns unnecessary columns is difficult to index and will likely not
take advantage of covering indexes. Moreover, it's difficult to properly
index an overly complex or non-normalized physical schema.
Concurrency
SQL Server, as an ACID-compliant
database engine, supports transactions that are atomic, consistent,
isolated, and durable. Whether the transaction is a single statement or
an explicit transaction within BEGIN TRAN…COMMIT TRAN
statements, locks are typically used to prevent one transaction from
seeing another transaction's uncommitted data. Transaction isolation is
great for data integrity, but locking and blocking hurt performance.
Multi-user concurrency can be tuned by limiting
the extraneous code within logical transactions, setting the transaction
isolation level no higher than required, and keeping trigger code to a
minimum.
A database with an excellent physical schema,
well-written set-based queries, and the right set of indexes will have
tight transactions and perform well with multiple users.
When a poorly designed database displays symptoms
of locking and blocking issues, transaction isolation level tuning only
partially alleviates the problem. The sources of the concurrency issue
are the long transactions and additional workload caused by the poor
database schema, lack of set-based queries, or missing indexes.
Concurrency tuning cannot overcome the deficiencies of a poor database
design.
Advanced Scalability
With each release, Microsoft has
consistently enhanced SQL Server for the enterprise. These technologies
can enhance the scalability of heavy transaction databases.
The Resource Governor can restrict the resources
available for different sets of queries, enabling the server to maintain
the service-level agreement (SLA) for some queries at the expense of
other less critical queries.
Indexed views were introduced in SQL Server 2000.
They actually materialize the view as a clustered index and can enable
queries to select from joined data without hitting the joined tables, or
to pre-aggregate data. In effect, an indexed view is a custom covering
index that can cover across multiple tables.
Partitioned tables can automatically segment data
across multiple filegroups, which can serve as an auto-archive device.
By reducing the size of the active data partition, the requirements for
maintaining the data, such as defragging the indexes, are also reduced.
Service Broker can collect transactional data and
process it after the fact, thereby providing an “over time” load
leveling as it spreads a 5-second peak load over a 1-minute execution
without delaying the calling transaction.
Column store indexes, introduced in SQL Server
2012, are column-based indexes (rather than traditional row-based
indexes) that can greatly improve query speed in certain database
environments.
Although these high-scalability features can
extend the scalability of a well-designed database, they are limited in
their ability to add performance to a poorly designed database, and they
cannot overcome long transactions caused by a lack of indexes,
iterative code, or all the multiple other problems caused by an overly
complex database design.
The database component is the principle factor
determining the overall monetary cost of the database. A well-designed
database minimizes hardware costs, simplifies data access code and
maintenance jobs, and significantly lowers both the initial and the
total cost of the database system.
A Performance Framework
By describing the dependencies between
the schema, queries, indexing, transactions, and scalability, Smart
Database Design is a framework for performance.
The key to mastering Smart Database Design is to
understand the interaction, or cause-and-effect relationship, between
these hierarchical layers (schema, queries, indexing, and concurrency).
Each layer enables the next layer; conversely, no layer can overcome
deficiencies in lower layers. The practical application of Smart
Database Design takes advantage of these dependencies when developing or
optimizing a database by employing the right best practices within each
layer to support the next layer.
Reducing the aggregate workload of the database
component has a positive effect on the rest of the database system. An
efficient database component reduces the performance requirements of the
server platform, increasing capacity. Maintenance jobs are easier to
plan and also execute faster when the database component is designed
well. There is less client access code to write and the code that needs
to be written is easier to write and maintain. The result is an overall
database system that's simpler to maintain, cheaper to run, easier to
connect to from the data access layer, and that scales beautifully.
Although it's not a perfect analogy, picturing a
water fountain on a hot summer day can help demonstrate how shorter
transactions improve overall database performance. If everyone takes a
small, quick sip from the fountain, then no queue forms; but as soon as
someone fills up a liter-sized bottle, others begin to wait. Regardless
of the amount of hardware resources available to a database, time is
finite, and the greatest performance gain is obtained by eliminating the
excess work of wastefully long transactions. Striving for database
design excellence is a smart business move with an excellent estimated
return on investment (ROI). Further, early investment in thoughtful
database design can pay huge dividends in saved development and
maintenance time. In the long term, it's far cheaper to design the
database correctly than to throw money or labor at project overruns or
hardware upgrades.
The cause-and-effect relationship between the
layers helps diagnose performance problems as well. When a system is
experiencing locking and blocking problems, the cause is likely found in
the indexing or query layers. These issues can be caused by poorly
written code. However, the root cause isn't always the code; it is often
the overly complex, antinormalized database design that is driving the
developers to write horrid code.
The bottom line? Designing an elegant database
schema is the first step to maximize the performance of the overall
database system while reducing costs.
Issues and Objections
There are some objections to the Smart
Database Design framework addressed here. Some say that buying more
hardware is the best way to improve performance. Although this is a
viable and sometimes necessary solution, it can mask underlying data
architecture issues that will probably crop up later. Performance
problems tend to grow exponentially as DB size grows, whereas hardware
performance grows more or less linearly over time. You can almost
predict when the “best” hardware available no longer suffices to get
acceptable performance. Sometimes companies spend incredible amounts to
upgrade their hardware and see little or no improvement because the
bottleneck is the transaction locking and blocking and poor code.
Sometimes, a faster CPU only waits faster. Strategically, reducing the
workload is cheaper than increasing the capacity of the hardware.
Some argue that they would like to apply Smart
Database Design but can't because the database is a third-party
database, and they can't modify the schema or the code. True, for most
third-party products, the database schema and queries are not open for
optimization, and this can be frustrating if the database needs
optimization. However, most vendors are interested in improving their
product and keeping their clients happy. (Both clients and vendors have
contracted with the author to help identify areas of opportunity and
suggest solutions for the next revision.)
Some say they'd like to apply Smart Database
Design, but they can't because any change to the schema would break
hundreds of other objects. It's true — databases without abstraction
layers are expensive to alter. An abstraction layer decouples the
database from the client applications, making it possible to change the
database component without affecting the client applications. In the
absence of a well-designed abstraction layer, the first step toward
gaining system performance is to create one. As expensive as it may seem
to refactor the database and every application so that all
communications go through an abstraction layer, the cost of not doing so
could be that IT can't respond to the organization's needs, forcing the
company to outsource or develop wasteful extra databases. At the worst,
the failure of the database to be extensible could force the end of the
organization.
In both the case of the third-party database and
the lack of abstraction, it's still a good idea to optimize at the
lowest level possible and then move up the layers. However, the best
performance gains are made when you can start optimizing at the lowest
level of the database component, the physical schema.