SQL Server 2012 : Data Architecture (part 2) - Smart Database Design

6/24/2013 11:08:24 AM

3. Smart Database Design

More than a few databases do not adhere to the principles of information architecture, and as a result, fail to meet organization's needs. In nearly every case, the root cause of the failure was the database design. It was too complex, too clumsy, or just plain inadequate. The side effects of a poor database design include poorly written code because developers work around, not with, the database schema; poor performance because the database engine is dealing with improperly structured data; and an inflexible model that can't grow with the organization it is supposed to support. The bottom line is that good database design makes life easier for anyone who touches the database. The database schema is the foundation of the database project; and an elegant, simple database design outperforms a complex database both for the development process and the final performance of the database application. This is the basic idea behind the Smart Database Design.

Database System

A database system is a complex system, which consists of multiple components that interact with one another. The performance of one component affects the performance of other components and thus the entire system. Stated another way, the design of one component can set up other components and the whole system to either work well together or to frustrate those trying to make the system work.

Every database system contains four broad technologies or components: the database, the server platform, the maintenance jobs, and the client's data access code, as shown in Figure 2. Each component affects the overall performance of the database system:

Figure 2 Smart Database Design is the premise that an elegant physical schema makes the data intuitively obvious and enables writing great set-based queries that respond well to indexing. This in turn creates short, tight transactions, which improves concurrency and scalability, while reducing the aggregate workload of the database. This flow from layer to layer becomes a methodology for designing and optimizing databases.

The server environment is the physical hardware configuration (CPUs, memory, disk spindles, and I/O bus), the operating system, and the SQL Server instance configuration, which together provide the working environment for the database. The server environment is typically optimized by balancing the CPUs, memory, and I/O, and identifying and eliminating bottlenecks.
The database maintenance jobs are the steps that keep the database running optimally (index defragmentation, DBCC integrity checks, and maintaining index statistics).
The client application is the collection of data access layers, middle tiers, front-end applications, ETL (extract, transform, and load) scripts, report queries, or SQL Server Integration Services (SSIS) packages that access the database. These cannot only affect the user's perception of database performance, but can also reduce the overall performance of the database system.
Finally, the database component includes everything within the data file: the physical schema, T-SQL code (queries, stored procedures, user-defined functions [UDFs], and views), indexes, and data.

All four database components must function well together to produce a high-performance database system; if one of the components is weak, then the database system will fail or perform poorly.

However, of these four components, the database is the most difficult component to design and the one that drives the design of the other three components. For example, the database workload determines the hardware requirements. Maintenance jobs and data access code are both designed around the database; and an overly complex database can complicate both the maintenance jobs and the data access code.

Physical Schema

The base layer of Smart Database Design is the database's physical schema. The physical schema includes the database's tables, columns, primary and foreign keys, and constraints. Basically, the “physical” schema is what the server creates when you run Data Definition Language (DDL) commands. Designing an elegant, high-performance physical schema typically involves a team effort and requires numerous design iterations and reviews.

Well-designed physical schemas avoid over-complexity by generalizing similar types of objects, thereby creating a schema with fewer entities. While designing the physical schema, make the data obvious to the developer and easy to query. The prime consideration when converting the logical database design into a physical schema is how much work is required for a query to navigate the data structures while maintaining a correctly normalized design. Not only is the schema then a joy to use, but it also makes it easier to code against, reducing the chance of data integrity errors caused by faulty queries.

Conversely, a poorly designed (either non-normalized or overly complex) physical schema encourages developers to write iterative code, code that uses temporary buckets to manipulate data, or code that will be difficult to debug or maintain.

Agile Modeling

Agile development is popular for good reasons. It gets the job done quickly and often produces a better result than traditional methods. Agile development also fits well with database design and development.

The traditional waterfall process steps through four project phases: requirements gathering, design, development, and implementation. Although this method may work well for some endeavors, when creating software, the users often don't know what they want until they see it, which pushes discovery beyond the requirements gathering phase and into the development phase.

Agile development addresses this problem by replacing the single long waterfall with numerous short cycles or iterations. Each iteration builds out a working model that can be tested and enables users to play with the software and further discover their needs. When users see rapid progress and trust that new features can be added, they become more willing to allow features to be planned into the life cycle of the software, instead of insisting that every feature be implemented in the next version.

A project might consist of a dozen of these tight iterations; and with each iteration, more features are fleshed out in the database and code. The principle of extensibility, is highlighted by an agile development process; as you cycle through iterations, an extensible database absorbs new business requirements with less refactoring. Frankly, this is how requirements evolve. You never know everything up front, so plan for and embrace the idea that your database needs to evolve along with the rest of the project. This might include time built into your schedule for design refactoring, aligning database design tasks with iterative application coding cycles, or deferring design decisions until requirements become more robust.

Set-Based Queries

SQL Server is designed to handle data in sets. SQL is a declarative language, meaning that the SQL query describes the problem, and the Query Optimizer generates an execution plan to resolve the problem as a set.

Iterative T-SQL code is code that acts upon data one row at a time instead of as a set. It is typically implemented via cursors and forces the database engine to perform thousands of wasteful single-row operations, instead of handling the problem in one larger, more efficient set. The performance cost of these single-row operations is huge. Depending on the task, SQL cursors perform about half as well as set-based code, and the performance differential grows with the size of the data. This is why set-based queries, based on an obvious physical schema, are so critical to database performance.

A good physical schema and set-based queries set up the database for excellent indexing, further improving the performance of the query (refer to Figure 2).

However, queries cannot overcome the errors of a poor physical schema and won't solve the performance issues of poorly written code. It's simply impossible to fix a clumsy database design by throwing code at it. Poor database designs tend to require extra code, which performs poorly and is difficult to maintain. Unfortunately, poorly designed databases also tend to have code that is tightly coupled (refers directly to tables), instead of code that accesses the database's abstraction layer (stored procedures and views). This makes it harder to refactor the database.

Indexing

An index is an organized pointer used to locate information in a larger collection. An index is only useful when it matches the needs of a question. In this case, it becomes the shortcut between a question and the right answer. The key is to design the fewest number of shortcuts between the right questions and the right answers.

A sound indexing strategy identifies a handful of queries that represent 90 percent of the workload and, with judicious use of clustered indexes and covering indexes, solves the queries without expensive lookup operations.

An elegant physical schema, well-written set-based queries, and excellent indexing reduce transaction duration, which implicitly improves concurrency and sets up the database for scalability.

Nevertheless, indexes cannot overcome the performance difficulties of iterative code. Poorly written SQL code that returns unnecessary columns is difficult to index and will likely not take advantage of covering indexes. Moreover, it's difficult to properly index an overly complex or non-normalized physical schema.

Concurrency

SQL Server, as an ACID-compliant database engine, supports transactions that are atomic, consistent, isolated, and durable. Whether the transaction is a single statement or an explicit transaction within BEGIN TRAN…COMMIT TRAN statements, locks are typically used to prevent one transaction from seeing another transaction's uncommitted data. Transaction isolation is great for data integrity, but locking and blocking hurt performance.

Multi-user concurrency can be tuned by limiting the extraneous code within logical transactions, setting the transaction isolation level no higher than required, and keeping trigger code to a minimum.

A database with an excellent physical schema, well-written set-based queries, and the right set of indexes will have tight transactions and perform well with multiple users.

When a poorly designed database displays symptoms of locking and blocking issues, transaction isolation level tuning only partially alleviates the problem. The sources of the concurrency issue are the long transactions and additional workload caused by the poor database schema, lack of set-based queries, or missing indexes. Concurrency tuning cannot overcome the deficiencies of a poor database design.

Advanced Scalability

With each release, Microsoft has consistently enhanced SQL Server for the enterprise. These technologies can enhance the scalability of heavy transaction databases.

The Resource Governor can restrict the resources available for different sets of queries, enabling the server to maintain the service-level agreement (SLA) for some queries at the expense of other less critical queries.

Indexed views were introduced in SQL Server 2000. They actually materialize the view as a clustered index and can enable queries to select from joined data without hitting the joined tables, or to pre-aggregate data. In effect, an indexed view is a custom covering index that can cover across multiple tables.

Partitioned tables can automatically segment data across multiple filegroups, which can serve as an auto-archive device. By reducing the size of the active data partition, the requirements for maintaining the data, such as defragging the indexes, are also reduced.

Service Broker can collect transactional data and process it after the fact, thereby providing an “over time” load leveling as it spreads a 5-second peak load over a 1-minute execution without delaying the calling transaction.

Column store indexes, introduced in SQL Server 2012, are column-based indexes (rather than traditional row-based indexes) that can greatly improve query speed in certain database environments.

Although these high-scalability features can extend the scalability of a well-designed database, they are limited in their ability to add performance to a poorly designed database, and they cannot overcome long transactions caused by a lack of indexes, iterative code, or all the multiple other problems caused by an overly complex database design.

The database component is the principle factor determining the overall monetary cost of the database. A well-designed database minimizes hardware costs, simplifies data access code and maintenance jobs, and significantly lowers both the initial and the total cost of the database system.

A Performance Framework

By describing the dependencies between the schema, queries, indexing, transactions, and scalability, Smart Database Design is a framework for performance.

The key to mastering Smart Database Design is to understand the interaction, or cause-and-effect relationship, between these hierarchical layers (schema, queries, indexing, and concurrency). Each layer enables the next layer; conversely, no layer can overcome deficiencies in lower layers. The practical application of Smart Database Design takes advantage of these dependencies when developing or optimizing a database by employing the right best practices within each layer to support the next layer.

Reducing the aggregate workload of the database component has a positive effect on the rest of the database system. An efficient database component reduces the performance requirements of the server platform, increasing capacity. Maintenance jobs are easier to plan and also execute faster when the database component is designed well. There is less client access code to write and the code that needs to be written is easier to write and maintain. The result is an overall database system that's simpler to maintain, cheaper to run, easier to connect to from the data access layer, and that scales beautifully.

Although it's not a perfect analogy, picturing a water fountain on a hot summer day can help demonstrate how shorter transactions improve overall database performance. If everyone takes a small, quick sip from the fountain, then no queue forms; but as soon as someone fills up a liter-sized bottle, others begin to wait. Regardless of the amount of hardware resources available to a database, time is finite, and the greatest performance gain is obtained by eliminating the excess work of wastefully long transactions. Striving for database design excellence is a smart business move with an excellent estimated return on investment (ROI). Further, early investment in thoughtful database design can pay huge dividends in saved development and maintenance time. In the long term, it's far cheaper to design the database correctly than to throw money or labor at project overruns or hardware upgrades.

The cause-and-effect relationship between the layers helps diagnose performance problems as well. When a system is experiencing locking and blocking problems, the cause is likely found in the indexing or query layers. These issues can be caused by poorly written code. However, the root cause isn't always the code; it is often the overly complex, antinormalized database design that is driving the developers to write horrid code.

The bottom line? Designing an elegant database schema is the first step to maximize the performance of the overall database system while reducing costs.

Issues and Objections

There are some objections to the Smart Database Design framework addressed here. Some say that buying more hardware is the best way to improve performance. Although this is a viable and sometimes necessary solution, it can mask underlying data architecture issues that will probably crop up later. Performance problems tend to grow exponentially as DB size grows, whereas hardware performance grows more or less linearly over time. You can almost predict when the “best” hardware available no longer suffices to get acceptable performance. Sometimes companies spend incredible amounts to upgrade their hardware and see little or no improvement because the bottleneck is the transaction locking and blocking and poor code. Sometimes, a faster CPU only waits faster. Strategically, reducing the workload is cheaper than increasing the capacity of the hardware.

Some argue that they would like to apply Smart Database Design but can't because the database is a third-party database, and they can't modify the schema or the code. True, for most third-party products, the database schema and queries are not open for optimization, and this can be frustrating if the database needs optimization. However, most vendors are interested in improving their product and keeping their clients happy. (Both clients and vendors have contracted with the author to help identify areas of opportunity and suggest solutions for the next revision.)

Some say they'd like to apply Smart Database Design, but they can't because any change to the schema would break hundreds of other objects. It's true — databases without abstraction layers are expensive to alter. An abstraction layer decouples the database from the client applications, making it possible to change the database component without affecting the client applications. In the absence of a well-designed abstraction layer, the first step toward gaining system performance is to create one. As expensive as it may seem to refactor the database and every application so that all communications go through an abstraction layer, the cost of not doing so could be that IT can't respond to the organization's needs, forcing the company to outsource or develop wasteful extra databases. At the worst, the failure of the database to be extensible could force the end of the organization.

In both the case of the third-party database and the lack of abstraction, it's still a good idea to optimize at the lowest level possible and then move up the layers. However, the best performance gains are made when you can start optimizing at the lowest level of the database component, the physical schema.

Related -----------------

- SQL Server 2012 : Data Architecture (part 1) - Information Architecture Principle, Database Objectives

Other -----------------

- Microsoft Dynamic AX 2009 : .NET Business Connector - Usage Scenarios for .NET Business Connector

- Microsoft Dynamic AX 2009 : .NET Business Connector - Inside .NET Business Connector

- Microsoft Dynamic AX 2009 : .NET Business Connector - Introduction

- Windows Server 2012 : Installing and Managing Hyper-V in Full or Server Core Mode - Installing Windows Server 2012 and Microsoft Hyper-V Server 2012

- Windows Server 2012 : Installing and Managing Hyper-V in Full or Server Core Mode - Enabling the Hyper-V role

- Windows Server 2012 : Installing and Managing Hyper-V in Full or Server Core Mode - Verifying Hyper-V requirements

- Windows Server 2012 Requirements and Installation : Customizing the Interface with Features on Demand

- Windows Server 2012 Requirements and Installation : Deploying Minimal Server Interface

- Windows Server 2012 Requirements and Installation : Switching Between Install Modes

- Windows Server 2012 Requirements and Installation : Installing Server 2012 (part 2) - Server with a GUI Install