SQL Server 2008 : Index Vocabulary, Structure, and Concepts

6/17/2011 11:35:35 AM

1. Heaps

When data is stored for a table within your database when no clustered indexes are present, the table is stored in what is called a heap. System objects, like sys.indexes and sys.partitions, with index ID values of zero identify the table as a heap table. In most instances, heaps are not the preferred method of storing your tables because of the access method required for retrieving data. When no indexes exist on a table, then scanning the entire table is the only method for retrieving data from the table. After the execution of a query against the table, SQL Server has to check every row in the table to determine if it meets the conditions of the query. The method of scanning an entire heap table is referred to as a table scan. Most of the time, table scans are not the most efficient method of retrieving data from a table; however, there are times when SQL Server optimizer will determine that a table scan is the most efficient method to retrieve the data.

2. Clustered Indexes

Clustered indexes are one of the two main types of indexes supported in SQL Server. A clustered index stores the rows of its underlying table in the sorted order of the clustered index keys. The clustered index keys are the column(s) you select to represent how you want data ordered on disk. Since the order of the table is determined by the clustered index, only one clustered index can be defined on a table. Clustered indexes are represented in system tables and views, like sys.objectssys.partitions, by the value 1 in the index ID column. and

Queries against tables that contain clustered indexes may perform at extremely fast rates or dreadfully slow rates. When indexes are involved, you can scan an entire index looking for the data that meets your criteria or seek a record based on the supplied criteria. In other words, if you are looking for an employee with an identification number of 100 and the table has a clustered index on the identification number, then SQL Server will seek employee 100, which is very fast. However, if you are looking for every employee who has a first name of Joe and you have a clustered index on an identification number, then SQL Server is going to scan the entire clustered index looking for every Joe. That can take a while if you have a large number of employees in the table.

3. Nonclustered Indexes

Nonclustered indexes are the second main type of index supported in SQL Server. A nonclustered index contains the index key value and the row identifier for finding the rest of the data for every row in the table. Nonclustered indexes can be created on clustered index and heap tables. When a nonclustered index is created on a clustered index table, the row identifier holds the clustered index key and that key points to the actual data. If the nonclustered index exists on a heap table, then the row identifier points to the address of the data pages. Regardless of the table type, clustered or heap, retrieving the actual data pages requires additional steps, thus increasing IO. Ideally, you want to decrease IO, not increase it.

SQL Server 2008 allows you to create up to 999 clustered indexes on a given table, and you can specify the list of columns for each of those. But keep this in mind: Just because SQL Server 2008 allows you to perform certain actions does not mean that you are required to do so.

Nonclustered indexes are represented in SQL Server 2008 system tables and views with an index ID greater than 1. SQL Server 2008 will scan or seek on a nonclustered index when the index keys are useful in finding the data based on the criteria of the query being executed.

4. Structure of Indexes and the Heap

Indexes within SQL Server 2008 are stored in B-tree data structure. The B-tree data structure is divided into three basic parts: the root (top-most level of the tree), the leaf level (bottom-most portion of the tree), and the intermediate level (everything in between). The root and intermediate levels contain the keys specified during creation and depending on the index type, clustered or nonclustered index, the row identifier with the clustered index key or the address of the data pages. On the intermediate and leaf level pages, SQL Server uses a double linked list to navigate from page to page. Based on the search criteria of the query, SQL Server will traverse the tree starting at the root heading downward until it reaches the leaf level pages. Let's take a closer look at the structure of both types of indexes.

4.1. Clustered Index Structure

Keep in mind that clustered indexes determine the order of the data on disk. The key of a clustered index will be sorted in ascending or descending order. For the next couple of examples, let's assume that you have a table with over 300 employees in it. The employee table has several columns that include employee ID, first name, and last name. If you cluster the employee table on the employee ID, then Figure 1 demonstrates the structure of that clustered index on disk. The root level page contains employee ID 100 and 200, with pointers to the pages that are greater than 200 or less than 200. The intermediate level contains the key value of employee IDs 1 and 100 and 200 and 300, along with pointers to the page where that data is stored. The leaf level pages actually contain all the data of each employee ID within the table.

Figure 1. The clustered index structure

The following steps list the process that SQL Server goes through to retrieve data from clustered index structures.

Query the sys.system_internals_allocation_units to identify the root node page address for the index.
Compare the value from the query to the values on the root page.
Identify the uppermost key value that is less than or equal to the value from the query.
Trace the pointer to the page specified key value down to the next level.
Repeat steps 3 and 4 until you reach the leaf level pages.
Once the leaf level pages are reached, SQL Server searches through data pages looking for the query value. If no data pages are found, the query will return no results.

4.2. Nonclustered Index Structure

Nonclustered index structures look highly similar to the clustered index structures. Nonclustered indexes have three main levels as well: the root level, the leaf level, and the intermediate levels. The main difference between the clustered index structure and the nonclustered index structure is the data stored within the leaf level and the data row locator within the structure. Remember that the leaf level pages in a nonclustered index contain references to data pages or to index keys instead of the actual data.

Figure 2 shows the structure of the nonclustered index. The example uses a first name for the index key. The root level of the nonclustered index contains the first name along with the pointer to the next page and the data row locator. The intermediate level has the first name ranges divided evenly, as well as the next page and data row locators. The leaf level pages contain the keys, or first names, in this example. If the underlying table is organized by a clustered index, then the leaf level pages of the nonclustered index will contain clustered index keys (also called data row locators). If the underlying table is a heap, then the leaf level pages will contain pointers to data pages in the heap.

Figure 2. The nonclustered indexes with pointers to the clustered index

4.3. Heap Structure

We would like to briefly mention the structure of heaps in this section. Unlike clustered and nonclustered indexes, the data stored within a heap table has no structure. The data stored within a heap table is linked by the index allocation map (IAM) pages that are allocated for the object. The sys.system_internals_allocation_units systems object points to the first IAM page where the data is stored. Because of the lack of order within the heap, SQL Server has to scan each page of the entire heap to determine what data meets the criteria of any given query. Because of this, table scans on heaps are often times inefficient. You should avoid scans on heaps as much as possible.

5. Indexes Created by Constraints

This section ensures you understand the terminology used when people mention primary and unique indexes.

5.1. Primary Indexes

What in the world is a primary index? I really don't know, but I have heard this term used frequently by non-database administrators. (Well, that's my story and I'm sticking to it.) Generally, most people who use the term primary index are actually referring to the index created after they specify the primary key for a table. The primary key can create a clustered or nonclustered index that forces uniqueness within the key values and does not allow null values. When someone says that the primary index is the main index used for retrieving data, just smile and say yes; the clustered index will determine the order of the data and will be used frequently. You can correct the person if you want, but we often find it easier to be subtle and slide the correct terms in there when you get the chance.

5.2. Unique Indexes

Unique indexes are created on the key columns of an index. A unique index forces distinctness among the data for the specified columns. Tables can have multiple unique indexes. Unique indexes allow nulls to exist in the key columns. From a SQL Server perspective, it does not matter if a unique index or a unique constraint is created first; the validation process of the data is the same, and the result will consist of a unique index. After creating the index, validation of the index keys occur for the data in the table to ensure that the values are unique.

The validation of the key values of a unique index happens before you insert data into the table. If you are attempting to insert multiple records into a table and one record violates the unique index, then none of the data will be inserted and the entire statement will fail unless you set the IGNORE_DUP_KEY option to ON.

In batches where unique indexes are violated, the IGNORE_DUP_KEY option allows insertion of the records that don't violate the unique index, while only the offenders fail.

6. Other Ways to Categorize Indexes

There are many ways to categorize index types. The distinction between clustered and nonclustered is fundamental, and you'll commonly encounter the terms primary index and unique index. But there are other ways to divide indexes into different types. The following sections describe some of the types you'll encounter in your work as a DBA.

6.1. Composite Indexes

Composite indexes are those created on objects that contain multiple columns. Composite indexes may include up to 16 columns that are all from the same table or view. The combined column values of a composite index cannot be greater than 900 bytes. Pay close attention to the structure of your columns while creating your indexes. Make sure the index keys are in the same order as the WHERE clause of your queries. SQL Server is more likely to use the index when the two are in the same order.

6.2. Filtered Indexes

Filtered indexes are a new feature of SQL Server 2008 whereby you can create nonclustered indexes on subsets of data from the table. Think about a WHERE clause on a table: If you want to minimize the number of records retrieved from a table, then you specify a condition within the WHERE clause to prevent that data from being returned. Well, the filtered index works in a similar fashion—you specify the rules to exclude data from an index. Remember, nonclustered indexes contain rows for all the data within a table. Filtered indexes will prevent all of the data from being stored within the index.

Filtered indexes provide benefits like improved query performance and data storage. From a performance perspective, looking for data within a smaller subset of records will decrease the seek time and increase query performance. Filtered indexes will also utilize less space on disk because the index does not have to store a record for every row within the table.

6.3. XML Indexes

SQL Server 2008 supports indexes on the XML data type. XML indexes support two types: a primary index and a secondary index. The primary XML index must be created first and represents all of the tags, data, and paths for each row in the XML column within the table. To increase performance, you can create secondary XML indexes on the path, value, and properties of the XML column depending on how the XML columns are queried. Once you understand the data that will be retrieved from the XML column, spend some time designing the appropriate index strategy on the column.

7. Other Index Concepts and Terminology

The following sections describe other terms and concepts that you'll encounter when talking about indexes. Understanding and applying these concepts will increase your chances of creating effective and useful indexes.

7.1. Include Columns

Include columns are non-key columns stored on the leaf level of nonclustered indexes. Keep in mind that key columns of nonclustered indexes are stored at the intermediate and leaf levels of indexes, while include columns are only stored on the leaf levels. Include columns also allow you to bypass the restrictions set for the composite keys by allowing you to have more than 16 columns of a size larger than 900 bytes. You can have 1023 columns in an INCLUDE statement without a size limit on disk. The primary benefit of include columns and composite keys stem from the usefulness of covering your queries.

7.2. Covering Queries

Covering your query generally means having the values needed for the SELECT clause and the WHERE clause available in your nonclustered indexes. Because the data is available in your nonclustered indexes, SQL Server does not have to perform costly lookup operations to gather the remaining information needed for the query. Your first thought may be to cover all queries, but there is a cost associated with that. You have to be careful and determine just which queries you want to cover based on the priority of the queries within your system.

7.3. Searchable Arguments

Searchable arguments (SARGs) are mentioned in SQL Server when discussing methods for minimizing or filtering the results of a query. In other words, a searchable argument is used when comparing a constant to columns in WHERE clauses and ON statements in a JOIN clause. Examples of searchable arguments include the following:

LastName = 'carstarphen'
income > 50000
birthMonth between 'January' and 'February'
jobTitle like '%Administrator%'

Using SARGs are the methods SQL Server uses to identify an index to aid in data retrieval. When creating indexes, you definitely want to think about the frequently accessed searchable arguments used by the application.

7.4. Cardinality

When creating indexes in SQL Server, you always want to think about the cardinality of your data. Cardinality in SQL Server refers to the uniqueness of the data within the columns. The higher the cardinality, the less the data is duplicated in a given column. The lower the cardinality, the more the data is the same within the column. The query optimizer will utilize the cardinality of the data when determining the most efficient execution plan to use.

Indexes that have a lot of duplication, or low cardinality, will not be nearly as useful as an index with very little duplication, or a high cardinality. Think about a person table that has three columns in it: Social Security Number, Age, and Favorite Color (only basic colors). The Social Security Number column will have a higher cardinality than the Age column, which will have a higher cardinality than Favorite Color. When creating your indexes, you want to create them on Social Security number first, followed by age, and then by color if you absolutely have to. Clearly, looking for someone's Social Security number will be faster than finding a person by their favorite color. You are going to return more rows when searching by colors, requiring some additional filtering after returning the data.

Other -----------------

- BizTalk 2009 : The Enterprise Service Bus Toolkit 2.0 - The Architecture (part 2) - Adapter Providers & Mediation Policies

- BizTalk 2009 : The Enterprise Service Bus Toolkit 2.0 - The Architecture (part 1) - Mediation Components & Resolvers

- BizTalk 2009 : The Enterprise Service Bus Toolkit 2.0 - BizTalk and the ESB Concept

- SQL Server 2008 High Availability : Log Shipping (part 2) - SharePoint and Log Shipping

- SQL Server 2008 High Availability : Log Shipping (part 1) - How to Configure Log Shipping

- Windows Server 2008 R2 : Manage Internet Information Services (part 2) - Remotely Manage IIS Servers & Manage IIS with PowerShell

- Windows Server 2008 R2 : Manage Internet Information Services (part 1) - Work with the IIS Management Console

- Microsoft Dynamics CRM 2011 : Merging Account or Contact Records

- Microsoft Dynamics CRM 2011 : Assigning Accounts and Contacts to Other Users

- BizTalk 2009 : Host Integration Server 2009 - Performance Testing and Tuning