1. The Search Service Application
SharePoint 2010 is designed
to achieve many business tasks, and a logical structure is important to
control and organize all those functions. For this reason, SharePoint
is broken into separate services. Many of the essential services
delivered by SharePoint are broken into what Microsoft has called
service applications, which can control, independently, the different
tasks that SharePoint performs. They can also be individually configured
for performance and scaling.
The Search components of
SharePoint 2010, for many reasons, including scaling, configurability,
and performance, are therefore isolated into the Search service
application, which is an application layer for configuring the back-end
functionality of SharePoint search. Almost all the configuration
directly related to the search components is done in the Search service
application. However, as we will see, a great deal of supporting
configuration may be required in the User Profile service application,
the Managed Metadata service, or the Business Data Connectivity service.
These services help extend SharePoint 2010 Search to address a variety
of business needs.
There are often many
ways to get to the same pages in SharePoint. The most direct route is
outlined here.
Open
Central Administration. On the main page of SharePoint Central
Administration, there are eight sections. Under Application Management
(as shown in Figure 1), choose "Manage service applications".
The
Service Applications page shows all the service applications running in
the SharePoint farm and their status. Scroll down and choose the Search
Service Application option (Figure 2).
The
Search Service Application page shows a System Status and a Crawl
History section as well as a navigation to the left with four sections:
Administration, Crawling, Queries and Results, and Reports. Examine the
information in the System Status section. This is the starting point for
most Search-related administration tasks.
1.1. Default Content Access Account
SharePoint's crawler requires a
user to access content and makes requests to SharePoint and other
content sources. It makes standard requests to these content sources
much the same way that a user requests content through a browser and
waits for a reply. The reply it gets often depends on what user it makes
those requests with. Some content sources may restrict access to
specific content based on user credentials, and having the wrong user
applied to SharePoint's default content access account (Figure 3) can adversely affect the outcome of crawls.
Make sure a user with
appropriate permissions to crawl SharePoint is set on the default
content access account on the Search Service Application page. This user
should have read access to all content that should be crawled. This
user should not be an administrator, as documents in an unpublished
state would be crawled.
If there are content
sources that do not recognize the default content access account,
special crawl rules can be created to use a different user for those
sources.
2. Indexing
Indexing is the process of
collecting data and storing it in a data structure that can be accessed
by an application that can query the index and point to data in a
database. This data structure is usually called a search index. Some
indexes contain all the searchable information. Others, such as
SharePoint's, store the words found in the documents and pointers to
more information about those documents in another database. In
SharePoint the index is held on the query servers, and the document data
and data related to the crawler and its administration are held on the
database servers. However, for the purpose of this section, we will
discuss only indexing as the process to create both the indexes and the
related search databases.
SharePoint 2010 can crawl and
index a number of different file types and content types from different
sources. In this section, we will discuss the different content sources
and how to set up the crawler to index each one.
Out of the box, SharePoint can index the following content sources:
Web content (HTTP and HTTPS)
SharePoint user profile databases
Lotus Notes
Exchange public folders
File shares
Business Connectivity Services-connected content
Other sources where a connector is provided (e.g., Documentum)
These different sources can be divided into two different types: structured and unstructured content.
2.1. Structured Content
Structured content is
content that has a defined structure that can generally be queried to
retrieve specific items. Relational databases, such as Microsoft SQL
Server, are structures that allow their content to be retrieved if you
know the row and column ID of the cell where that data sits. Databases
allow their content to be retrieved if the user or the user interface
knows how to acquire the location of the data. Most relational databases
have their own indices to help locate these IDs. These are generally
not very performant and do not support free text search well. A search
engine database structure will perform much better at finding all of the
occurrences of a particular term in a timely manner.
When we marry unstructured
and structured content or even two disparate structured content sources,
we lose the ability to simply look up cell IDs to find the specific
data. Additionally, different databases' indices seldom, if ever, work
together. This is where a search engine becomes crucial. SharePoint's
search components can index both unstructured and structured content,
store them together, return them in a homogenized result set, filter
based on determined metadata, and lead the end user to the specific
source system.
SharePoint 2010 has a
powerful feature for indexing structured content. This feature, called
Business Connectivity Services, allows administrators to define
connectors to structured data sources and index the content from them in
a logical and organized manner, making that data searchable and useful
from SharePoint.
BCS is capable of collecting content out of the box from
MS SQL Databases
.Net assemblies
Additionally, custom connectors can be created to allow it to index almost any other content source, including
Other databases
Line-of-business applications such as Seibel and SAP
Other enterprise resource planning (ERP) systems
Many other applications and databases
2.2. Unstructured Content
Unstructured content
refers to content that is not set in a strict structure such as a
relational database. Unstructured content can be e-mails, documents, or
web pages. Unstructured content is the biggest challenge for searching
as it requires the search engine to look for specific terms across a
huge corpus of free text. Unstructured search is often referred to as
"free text" search.
Out of the box, SharePoint 2010 can index the following unstructured content sources: