1. Introducing Enterprise Search
With the release of SharePoint
2010, Microsoft has significantly improved its enterprise search
capabilities. The acquisition of FAST provided Microsoft with its
flagship enterprise search product.
The goals for these enterprise search solutions are to
Ensure that
enterprise data from multiple systems can be indexed. This includes
collaborative data stored in SharePoint sites, files in file shares, Web
pages in other websites, third-party repositories, and other
line-of-business systems such as CRM databases or ERP solutions.
Ensure
that content from multiple enterprise repositories systems can be
queried both independently and from within the context of your business
applications.
Ensure
that search results provide accurate ranking of the result set, if you
expect users to adopt and use those search capabilities.
Ensure that your enterprise search solution identifies people and expertise within your organization.
SharePoint 2010 provides an enterprise search platform for fulfilling these aims. As a brief overview,
SharePoint 2010 includes a Connector Framework that enables the crawler
to index files, metadata, and other types of data from various types of
content sources. It also provides an indexing engine that stores the
crawled data in an efficient manner in index files, and it provides
query servers, query object models, and user interfaces for performing
searches on the indexed data.
The suite of SharePoint 2010 technologies now includes five search products.
Microsoft SharePoint Foundation 2010
Microsoft Search Server 2010 Express
Microsoft Search Server 2010
Microsoft SharePoint 2010
FAST Search Server 2010 for SharePoint
The first two have no additional licensing cost above that of the server operating system and IIS connectivity. Table 1 illustrates the high-level comparisons of the search features provided by each product.
Table 1. Comparing Features of the SharePoint 2010 Search Products
FEATURE | SHAREPOINT FOUNDATION 2010 | SEARCH SERVER 2010 EXPRESS | SEARCH SERVER 2010 | SHAREPOINT 2010 | FAST SEARCH SERVER 2010 FOR SHAREPOINT |
---|
Basic site search | X | X | X | X | X |
Best bets | | X | X | X | X |
Visual best bets | | | | | X |
Similar results | | | | | X |
Duplicate results | | | | | X |
Search scopes | | X | X | X | X |
Search enhancement with user context | | | | | X |
Crawled and managed properties | | X | X | X | X |
Query federation | | X | X | X | X |
Query suggestions | | X | X | X | X |
Sort results on managed properties or rank profiles | | | | | X |
Relevancy tuning by document or site promotions | | X | X | X | X |
Shallow results refinement | | X | X | X | X |
Deep results refinement | | | | | X |
Document Preview | | | | | X |
Windows 7 Federation | | X | X | X | X |
People search | | | | X | X |
Social search | | | | X | X |
Taxonomy integration | | | | X | X |
Multi-tenant hosting | | | | X | X |
Rich Web indexing support | | | | | X |
Although there are no hard-coded limits for the number of items that can be indexed by any of the products listed in Table 11-1, performance places some practical guidelines.
SharePoint Foundation 2010 can index and search up to 10 million items per search server.
Search
Server 2010 Express can index and search up to 300,000 items if it is
used with SQL Server Express, due to database size limitations. With SQL
2008, it can index 10 million items but is limited to a single
index/query server role.
A scaled-out Search Server 2010 farm or SharePoint 2010 farm can index and search up to 100 million items.
A FAST Search Server 2010 for SharePoint installation can support extreme scale and can index and search over a billion items.
2. Search Architecture
SharePoint Server
2007 defined two search roles, Query and Index. With the modularity of
SharePoint 2010, these functions are defined as components, since any
server could host multiple instances of each. Also, with the changes to
the indexing process, the Index role has been renamed as the crawl
component.
In SharePoint Server 2007,
there were several bottlenecks in the search components. The most
obvious was the single index server. However, even with multiple query
components, the index was a single component and its size presented
issues with replication, loading into memory, and traversing to retrieve
results.
The design of search components in SharePoint 2010 included three goals addressed by scalability.
These goals were
accomplished by separating the system into components, some of which can
be scaled out to remove bottlenecks or can be duplicated for
resiliency. These components will be introduced as part of the
discussion of search processes.
2.1. Search Tools
Before discussing the
major functions of Search, you need to be familiar with three tools that
are critical to both crawl and query processes. In addition, there are a
couple of tools used only by the crawl process.
2.1.1. Language Detection
SharePoint 2010 language support for search includes 53 languages, which covers 97 percent of all people in the world.
2.1.2. Word Breakers
When the formatting is
removed from crawled content and query strings, part of that formatting
includes the spaces between words. A process is then required to break
the string of characters into words before other processes can be
applied to the words. Word
breakers (or tokenizers) separate words into “tokens” when content is
indexed and when queries are submitted. Word breakers, in general,
separate words at spaces, punctuation, and special characters—except
underscores. Word breakers are used during both indexing and querying.
This process is language
dependent. SharePoint can recognize the language within the stream even
if the language changes within a document. Along with language
detection, word breakers have been greatly enhanced in this new
product, especially in languages other than English. Particular
attention has been focused on compound words, which are handled
differently in different languages.
2.1.3. Custom Dictionaries
Custom
dictionaries are still supported in SharePoint 2010. A custom
dictionary file defines words or terms for the word breaker to consider
as complete words. Custom
dictionaries are not provided by default and are language specific, so
you must create a separate custom dictionary for each language in which
you need to modify the word breaker’s behavior. The language-neutral
word breaker does not use a custom dictionary. Thesaurus and Noise Word
files can be application specific, but custom dictionaries apply farm-wide to all search applications.
Custom
dictionaries instruct the word breaker of their language that a
particular combination of characters is a complete word that should not
be broken up into separate tokens. Both indexing and querying processes
will use existing custom
dictionaries instructions that support the language and dialect of the
word breaker being used. If a word exists in the custom dictionary, it
is considered complete for indexing.
Terms that include special characters would be prime candidates for custom
dictionaries. For instance, AT&T includes an ampersand that would
be used by a word breaker to separate the term into two tokens, “AT” and
“T,” which are both noise (or stop) words and would be ignored in
search queries. If the term “AT&T” were included in a custom
dictionary for the appropriate language, then it would be treated as a
unique word in both instances. You might also have hyphenated words that
your organization’s search system needs to treat as complete unique
words; you would add such words to a custom dictionary.
Many industries have number
sequences that need to be indexed. For instance, if you had a formula
of numbers in the “725.5046.1.1” format, these numbers would be broken
into tokens at the decimals. To index the complete sequence of numbers,
you must add all instances of the numbers to custom dictionaries for
each language in which they appear so that the entire number is treated
as a unique token by the index and query processes.
Custom dictionaries can be created in Notepad but must be saved in Unicode format. The file naming convention must follow the CustomNNNN.lex format, where NNNN is the language hex code of the language.
Within the file, each
entry must be on separate lines separated by a carriage return (CR) and
line feed (LF). Other custom dictionary rules are
Entries are not case sensitive.
The pipe (|) character is not permitted.
No spaces are permitted.
The pound sign (#) character can be used anywhere in an entry except at the beginning.
Any other alphanumeric characters, punctuation, symbols, and breaking characters are valid.
The maximum length of an entry is 128 (Unicode) characters.
The custom dictionary files
must be saved in the folder that contains the word breakers on all index
and query servers in the farm. By default, this is the C:\Program
Files\Microsoft Office Servers\14.0\bin folder.
You must restart the
osearch14 service on all index and query servers in the farm after these
custom dictionary files have been created or modified. Do not use the
Services On Server page in Central Administration to stop and start the
service. Either use the Services MMC in Administrative Tools or type net stop osearch14 or net start osearch14 at the command line.
Note:
MORE INFO For a list of the supported language hex codes, see http://technet.microsoft.com/en-us/library/cc263242.aspx.
2.1.4. iFilters
iFilters
tell the crawler how to “crack open” a file and to identify and index
its contents. They are only used by the crawling process. Many iFilters
have shipped with the new versions of SharePoint, including ascx, asm,
asp, aspx, bat, c, cmd, cpp, css, cxx, dev, dic, doc, docm, docx, dot,
eml, h, hhc, hht, hta, htm, html, htw, lnk, mht, mhtml, mpx, msg, odc,
pot, pps, ppt, pptm, pptx, pub, stm, tif, trf, vsd, xlb, xlc, xls, xlsm,
xlsx, xlt, and xml.
If you do not have an iFilter
for a defined file type, the crawler gathers only the metadata for that
document. You can add third-party iFilters or write your own. For
example, if you need to index PDF files, then you can go to the Adobe
website and download their iFilter for indexing PDF files. The download
will give you the executable file and other files to successfully
install the PDF iFilter.
2.1.5. Connectors
SharePoint 2010 indexing can use either the Connector Framework or the protocol handler API so your custom protocol
handlers written for SharePoint Server 2007 will still work. The
protocol handler API will not be supported in future products, however.
The SharePoint 2010
indexing connector has improved reliability and performance provided by
higher fidelity conversation with SharePoint Web Service. It also
reduces the load on the SharePoint sites with the addition of
security-only crawls, improved batching of change log processing, and
caching for common list item information. Improvements in telemetry and
reporting provides views of security transactions processed in crawls
and new crawl log functionality, such as filtering the crawl log by “top
level errors.”
The connector framework provides improvements over the protocol handler API.
Attachments can now be crawled.
Item-level security descriptors can now be retrieved for external data exposed by Business Connectivity Services.
When crawling a Business Connectivity Services entity, additional entities can be crawled, maintaining the entity relationships.
Time stamp–based incremental crawls are supported.
Change log crawls that can remove items that have been deleted since the last crawl are supported.
Standard shared SharePoint 2010 indexing connectors that are also used by FAST Search for SharePoint include the following.
SharePoint content
The crawler accesses data through the SharePoint Web service and uses
Windows authentication (integrated) credentials. This connector supports
full crawl using enumeration of content, but for incremental crawls it
uses the change logs, including deleted items. It has built-in support
for both NT and pluggable security trimming.
File shares
The crawler accesses the file shares through their hierarchical
structure with Windows authentication (integrated). ACLs are collected
during crawl time and changes can be detected with time stamps or
changes to the ACL.
Websites This protocol
handler uses link traversal as the crawl method but does not provide a
security trimmer. SharePoint sites can be configured to use this
connector when they permit anonymous access, which reduces the overhead
on both sides of the crawling effort. By default, if SharePoint sites
are listed in a content source that uses this connector, the crawler
will automatically switch to the SharePoint content connector.
People profiles
People profiles are enumerated and crawled via the profile pages of the
My Site host. Since the “Exposed” information selections are not truly
NT ACLs, only information within a profile that is exposed to “Everyone”
will be crawled.
Lotus Notes
This connector used a protocol handler in Microsoft Office SharePoint
Server 2007 but has been changed to use the Connector Framework in this
version.
Exchange public folders Likewise, Exchange public folders were crawled with an HTTP protocol handler previously but now use the Connector Framework.
External systems
Custom connectors are much easier to build in SharePoint Designer 2010
using the Connector Framework than with a protocol handler. These custom
connectors can be built by creating external content types for the
Business Data Connectivity Service or by using existing external content
types.
Note:
Microsoft SharePoint 2010 Indexing Connector for Documentum can be downloaded from http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=32d0980a-4b9a-4b0d-868e-9be9c0d75435.
2.2. Search Components and Processes
The term content source
refers to both the target servers that contain the content that needs
to be indexed and the definitions of the start addresses, indexing
schedules, and basic rules to instruct the crawler on how to crawl
(index) the content. Crawl
rules are used to further define how the content of specific URLs will
be crawled, such as defining a different security context under which
the crawler should crawl the content source instead of using the default
crawling account.
Connectors contain the ability to connect to different content sources using different connection protocols, such as FTP, HTTP, or RPC. Protocol
handlers used in SharePoint Server 2007 are still supported but are
being replaced by the Connector Framework. Some connectors can be built
in Microsoft SharePoint Designer without writing managed code.
There are really two parts to
gathering the information from a content source. The first part is the
enumeration of the content items that should be crawled.
The connectors will connect to the content source and will walk the URL
addresses of each content item, depositing each URL of each content
item in the MSSCrawlQueue
table. Although the basic search engine runs under Mssearcch.exe
(O14Search), when a crawl is started, a new process under Msssearch.exe
named Mssdmn.exe is spawned, and it is under this process that the
content source is both enumerated and then crawled. After a sufficient
number of URLs are placed in the MSSCrawlQueue, another Mssdmn.exe is
spawned, and this process goes back to the content source connecting to
each URL in batches (as determined by the crawler impact rules), opening
each document at each URL in the current batch and then downloading
first the metadata about the document and then the document’s contents.
Both of these data streams are run through several components in the
process pipeline and then the content is placed in the full-text index
while the metadata is placed in the SQL database property store.
When you start a new crawl of a
content source, you’ll notice that for a brief period of time, the
status of the content source will appears as Starting. What this means
is that if the Mssdmn.exe process (if it isn’t already started) will be
spawned, the connector is connecting to the content source to establish
the connection through which the content will be indexed, and the URLs
of each content item are being deposited into the MSSCrawlQueue. You’ll
see the status change to Crawling when the crawler is ready to open
documents and download their content.
Increasing the number
of start addresses within a single content source or increasing the
number of content sources that are crawling multiple start addresses
does not necessarily increase the number of Mssdmn.exe threads that are
spawned.
In Search 2010, the
number of crawl components across multiple servers can be increased as
the workload increases, with automatic distribution of addresses, or if
desired, specific addresses assigned to specific crawl components
through Host Distribution Rules. The reduced workload provided by crawl
partitioning speeds the crawling, which provides a faster refresh of
indexes. In addition, the crawl component only builds and retains
portions of the full text index until all designated query components
obtain their copy. The crawl component never retains a full copy of the
index.
The information used to
track crawling now resides in tables of a separate
Search_Service_Application database so that the crawling process does
not impact other databases. This permits the crawling process to be
stateless, since its status is stored in the SQL database, not in crawl
logs on the crawl server. This permits the completion of a crawl by an
alternate server if the first crawl component fails and is unable to
complete the assigned tasks.
The indexing process extracts information from items retrieved by crawl components
and places it in index format. This information includes full text
index, metadata, URLs, and ACLs. Query components accept requests from
users for queries against the SharePoint indexes and return results in
the requested XML format.
Index partitions are
introduced by SharePoint 2010 as subsets of the overall index. With
index partitions, no single query component searches the entire index.
The workload is spread across multiple servers, reducing the query
response time even though maintaining index partitions slightly
increases the crawl
effort. Multiple query components can host the same index partition,
providing both reliability and throughput. Index partitions also can be
supported by clusters of mirrored crawl components, providing resiliency
in case of failures.
Index partitioning
is based on a hash of the documentID assigned to each document. This
basis permits indexes to remain roughly equivalent in size, which is
essential to optimal response time. Query components can be identified
as failover nodes that would host the same index as their partners,
similar to a mirrored SQL database. A failover query component will
receive queries only if the primary query component for an index
partition fails.
Index propagation works much
like it does in SharePoint Server 2007, including the crawl components
pushing index files to the query components and waiting until they all successfully absorb the index before acknowledging that the documents are successfully crawled.
Indexes pushed to
query components for a partitioned index are just the appropriate part
of that index. The current propagation information is stored in the
Search_Service_Application database in the MSSPropagationTasks table, and the MSSPropagationLog
table keeps records of past events. The MSSPropagationTasks table is
populated and depopulated by the crawl components, and the query
components populate the MSSPropagationTaskCompletions table in response. The MSSPropagationErrors
table will reflect any current deficiency, and the information there is
entered every 10 minutes in a warning level event on the search admin
component’s server.
Indexes are absorbed
by query components but aren’t necessarily served in queries for a few
seconds until the appropriate merges have occurred. Index propagation
tasks that have stalled for at least five minutes because of a lack of
success from a query component trigger a re-crawl of the contained data.
Query components can be taken offline so they don’t hold up the
crawling process.
Query
Federation is the formatting of queries in OpenSearch definition so
that they may be processed by any OpenSearch query component, including
SharePoint. In SharePoint 2010, the search object model and all query
components are built around the OpenSearch
model. Essentially, federated queries go to multiple query servers that
respond individually, and the results are compiled to be presented in a
Web Part.
Since no single query component holds the complete index, the Query
Processor service must manage disbursing the queries and processing the
multiple results lists returned. This is accomplished, using a
round-robin load-balanced method, by one of the servers running the Search
Query and Site Settings service (an Internet Information Services [IIS]
service). By default, this service runs on each server that hosts a
search query component. The service manages the query processing tasks,
which include sending queries to one or more of the appropriate query components
and building the consolidated results set to be returned to the Web
front-end (WFE) server that constructed the query.
Note:
At least one instance of
the Search Query and Site Settings service must be running to serve
queries. The service should be started on all servers that host query
components that can be identified in the Search Application Topology
section of the Search Service Administration page. It can be started
from the Services On Server page in Central Administration or with the
following Windows PowerShell cmdlet.
Start-SPEnterpriseSearchQueryAndSiteSettingsServiceInstance
Each query component
responds to queries and sends the results from the index partition that
it holds to the query processor from which it received the query. The
query component is also responsible for the word breaking, noise word
removal, and stemming (if stemming is enabled) for the search terms
provided by the query processor. The multiple responses are combined to
produce the results list. Since each partition contains only a portion
of the complete index, the workload of compiling results lists is spread
across multiple query components, producing a faster query response
time. Each partition can also be mirrored on separate query components,
providing increased performance or resiliency should a single instance
of the partition fail.
The search information stored in
SQL databases has also been spread across additional databases. Just as
the full text index can now be partitioned, the metadata or property
databases can be divided and placed on separate SQL servers for
performance or can be mirrored for resiliency.
Finally, the search administration component synchronizes the crawling
and query activities using information stored in the admin database. It
is the admin component that assigns tasks to specific servers and
reassigns them in case of a server failure. There can only be one search
administration component per farm, and it resides on the server where
the search service application was created.
A built-in load balancer distributes hosts from content sources across crawl databases unless overruled by a Host
Distribution Rule. The crawl components then retrieve the content
assigned to their crawl database when initiated by the admin component. A
Host Distribution Rule can assign a specific host to a crawl database.
This is particularly useful if a third-party connector is licensed per
server or if crawling specific content requires additional crawl
component resources.
Whereas SharePoint Server 2007
depended on SQL for cluster and mirror failover, SharePoint 2010 has
native SQL mirror support and all databases can be mirrored.