SharePoint 2010 Search : Setting Up the Crawler - The Search Service Application & Indexing

8/18/2011 4:31:25 PM

1. The Search Service Application

SharePoint 2010 is designed to achieve many business tasks, and a logical structure is important to control and organize all those functions. For this reason, SharePoint is broken into separate services. Many of the essential services delivered by SharePoint are broken into what Microsoft has called service applications, which can control, independently, the different tasks that SharePoint performs. They can also be individually configured for performance and scaling.

The Search components of SharePoint 2010, for many reasons, including scaling, configurability, and performance, are therefore isolated into the Search service application, which is an application layer for configuring the back-end functionality of SharePoint search. Almost all the configuration directly related to the search components is done in the Search service application. However, as we will see, a great deal of supporting configuration may be required in the User Profile service application, the Managed Metadata service, or the Business Data Connectivity service. These services help extend SharePoint 2010 Search to address a variety of business needs.

There are often many ways to get to the same pages in SharePoint. The most direct route is outlined here.

Open Central Administration. On the main page of SharePoint Central Administration, there are eight sections. Under Application Management (as shown in Figure 1), choose "Manage service applications".

Figure 1. Choose "Manage service applications" from the Application Management menu.
The Service Applications page shows all the service applications running in the SharePoint farm and their status. Scroll down and choose the Search Service Application option (Figure 2).

Figure 2. The Search Service Application option
The Search Service Application page shows a System Status and a Crawl History section as well as a navigation to the left with four sections: Administration, Crawling, Queries and Results, and Reports. Examine the information in the System Status section. This is the starting point for most Search-related administration tasks.

1.1. Default Content Access Account

SharePoint's crawler requires a user to access content and makes requests to SharePoint and other content sources. It makes standard requests to these content sources much the same way that a user requests content through a browser and waits for a reply. The reply it gets often depends on what user it makes those requests with. Some content sources may restrict access to specific content based on user credentials, and having the wrong user applied to SharePoint's default content access account (Figure 3 ) can adversely affect the outcome of crawls.

Make sure a user with appropriate permissions to crawl SharePoint is set on the default content access account on the Search Service Application page. This user should have read access to all content that should be crawled. This user should not be an administrator, as documents in an unpublished state would be crawled.

Figure 3. The default content access account

If there are content sources that do not recognize the default content access account, special crawl rules can be created to use a different user for those sources.

2. Indexing

Indexing is the process of collecting data and storing it in a data structure that can be accessed by an application that can query the index and point to data in a database. This data structure is usually called a search index. Some indexes contain all the searchable information. Others, such as SharePoint's, store the words found in the documents and pointers to more information about those documents in another database. In SharePoint the index is held on the query servers, and the document data and data related to the crawler and its administration are held on the database servers. However, for the purpose of this section, we will discuss only indexing as the process to create both the indexes and the related search databases.

SharePoint 2010 can crawl and index a number of different file types and content types from different sources. In this section, we will discuss the different content sources and how to set up the crawler to index each one.

Out of the box, SharePoint can index the following content sources:

Web content (HTTP and HTTPS)
SharePoint user profile databases
Lotus Notes
Exchange public folders
File shares
Business Connectivity Services-connected content
Other sources where a connector is provided (e.g., Documentum)

These different sources can be divided into two different types: structured and unstructured content.

2.1. Structured Content

Structured content is content that has a defined structure that can generally be queried to retrieve specific items. Relational databases, such as Microsoft SQL Server, are structures that allow their content to be retrieved if you know the row and column ID of the cell where that data sits. Databases allow their content to be retrieved if the user or the user interface knows how to acquire the location of the data. Most relational databases have their own indices to help locate these IDs. These are generally not very performant and do not support free text search well. A search engine database structure will perform much better at finding all of the occurrences of a particular term in a timely manner.

When we marry unstructured and structured content or even two disparate structured content sources, we lose the ability to simply look up cell IDs to find the specific data. Additionally, different databases' indices seldom, if ever, work together. This is where a search engine becomes crucial. SharePoint's search components can index both unstructured and structured content, store them together, return them in a homogenized result set, filter based on determined metadata, and lead the end user to the specific source system.

SharePoint 2010 has a powerful feature for indexing structured content. This feature, called Business Connectivity Services, allows administrators to define connectors to structured data sources and index the content from them in a logical and organized manner, making that data searchable and useful from SharePoint.

BCS is capable of collecting content out of the box from

MS SQL Databases
.Net assemblies

Additionally, custom connectors can be created to allow it to index almost any other content source, including

Other databases
Line-of-business applications such as Seibel and SAP
Other enterprise resource planning (ERP) systems
Many other applications and databases

2.2. Unstructured Content

Unstructured content refers to content that is not set in a strict structure such as a relational database. Unstructured content can be e-mails, documents, or web pages. Unstructured content is the biggest challenge for searching as it requires the search engine to look for specific terms across a huge corpus of free text. Unstructured search is often referred to as "free text" search.

Out of the box, SharePoint 2010 can index the following unstructured content sources: