SharePoint 2010 : Search Server 2010 and FAST Search - Search Architecture

7/8/2011 5:52:36 PM

1. Introducing Enterprise Search

With the release of SharePoint 2010, Microsoft has significantly improved its enterprise search capabilities. The acquisition of FAST provided Microsoft with its flagship enterprise search product.

The goals for these enterprise search solutions are to

Ensure that enterprise data from multiple systems can be indexed. This includes collaborative data stored in SharePoint sites, files in file shares, Web pages in other websites, third-party repositories, and other line-of-business systems such as CRM databases or ERP solutions.
Ensure that content from multiple enterprise repositories systems can be queried both independently and from within the context of your business applications.
Ensure that search results provide accurate ranking of the result set, if you expect users to adopt and use those search capabilities.
Ensure that your enterprise search solution identifies people and expertise within your organization.

SharePoint 2010 provides an enterprise search platform for fulfilling these aims. As a brief overview, SharePoint 2010 includes a Connector Framework that enables the crawler to index files, metadata, and other types of data from various types of content sources. It also provides an indexing engine that stores the crawled data in an efficient manner in index files, and it provides query servers, query object models, and user interfaces for performing searches on the indexed data.

The suite of SharePoint 2010 technologies now includes five search products.

Microsoft SharePoint Foundation 2010
Microsoft Search Server 2010 Express
Microsoft Search Server 2010
Microsoft SharePoint 2010
FAST Search Server 2010 for SharePoint

The first two have no additional licensing cost above that of the server operating system and IIS connectivity. Table 1 illustrates the high-level comparisons of the search features provided by each product.

Table 1. Comparing Features of the SharePoint 2010 Search Products
FEATURE	SHAREPOINT FOUNDATION 2010	SEARCH SERVER 2010 EXPRESS	SEARCH SERVER 2010	SHAREPOINT 2010	FAST SEARCH SERVER 2010 FOR SHAREPOINT
Basic site search	X	X	X	X	X
Best bets		X	X	X	X
Visual best bets					X
Similar results					X
Duplicate results					X
Search scopes		X	X	X	X
Search enhancement with user context					X
Crawled and managed properties		X	X	X	X
Query federation		X	X	X	X
Query suggestions		X	X	X	X
Sort results on managed properties or rank profiles					X
Relevancy tuning by document or site promotions		X	X	X	X
Shallow results refinement		X	X	X	X
Deep results refinement					X
Document Preview					X
Windows 7 Federation		X	X	X	X
People search				X	X
Social search				X	X
Taxonomy integration				X	X
Multi-tenant hosting				X	X
Rich Web indexing support					X

Although there are no hard-coded limits for the number of items that can be indexed by any of the products listed in Table 11-1, performance places some practical guidelines.

SharePoint Foundation 2010 can index and search up to 10 million items per search server.
Search Server 2010 Express can index and search up to 300,000 items if it is used with SQL Server Express, due to database size limitations. With SQL 2008, it can index 10 million items but is limited to a single index/query server role.
A scaled-out Search Server 2010 farm or SharePoint 2010 farm can index and search up to 100 million items.
A FAST Search Server 2010 for SharePoint installation can support extreme scale and can index and search over a billion items.

2. Search Architecture

SharePoint Server 2007 defined two search roles, Query and Index. With the modularity of SharePoint 2010, these functions are defined as components, since any server could host multiple instances of each. Also, with the changes to the indexing process, the Index role has been renamed as the crawl component.

In SharePoint Server 2007, there were several bottlenecks in the search components. The most obvious was the single index server. However, even with multiple query components, the index was a single component and its size presented issues with replication, loading into memory, and traversing to retrieve results.

The design of search components in SharePoint 2010 included three goals addressed by scalability.

Sub-second query latencies at large scale
Fresher indexes
Better resiliency and higher availability

These goals were accomplished by separating the system into components, some of which can be scaled out to remove bottlenecks or can be duplicated for resiliency. These components will be introduced as part of the discussion of search processes.

2.1. Search Tools

Before discussing the major functions of Search, you need to be familiar with three tools that are critical to both crawl and query processes. In addition, there are a couple of tools used only by the crawl process.

2.1.1. Language Detection

SharePoint 2010 language support for search includes 53 languages, which covers 97 percent of all people in the world.

2.1.2. Word Breakers

When the formatting is removed from crawled content and query strings, part of that formatting includes the spaces between words. A process is then required to break the string of characters into words before other processes can be applied to the words. Word breakers (or tokenizers) separate words into “tokens” when content is indexed and when queries are submitted. Word breakers, in general, separate words at spaces, punctuation, and special characters—except underscores. Word breakers are used during both indexing and querying.

This process is language dependent. SharePoint can recognize the language within the stream even if the language changes within a document. Along with language detection, word breakers have been greatly enhanced in this new product, especially in languages other than English. Particular attention has been focused on compound words, which are handled differently in different languages.

2.1.3. Custom Dictionaries

Custom dictionaries are still supported in SharePoint 2010. A custom dictionary file defines words or terms for the word breaker to consider as complete words. Custom dictionaries are not provided by default and are language specific, so you must create a separate custom dictionary for each language in which you need to modify the word breaker’s behavior. The language-neutral word breaker does not use a custom dictionary. Thesaurus and Noise Word files can be application specific, but custom dictionaries apply farm-wide to all search applications.

Custom dictionaries instruct the word breaker of their language that a particular combination of characters is a complete word that should not be broken up into separate tokens. Both indexing and querying processes will use existing custom dictionaries instructions that support the language and dialect of the word breaker being used. If a word exists in the custom dictionary, it is considered complete for indexing.

Terms that include special characters would be prime candidates for custom dictionaries. For instance, AT&T includes an ampersand that would be used by a word breaker to separate the term into two tokens, “AT” and “T,” which are both noise (or stop) words and would be ignored in search queries. If the term “AT&T” were included in a custom dictionary for the appropriate language, then it would be treated as a unique word in both instances. You might also have hyphenated words that your organization’s search system needs to treat as complete unique words; you would add such words to a custom dictionary.

Many industries have number sequences that need to be indexed. For instance, if you had a formula of numbers in the “725.5046.1.1” format, these numbers would be broken into tokens at the decimals. To index the complete sequence of numbers, you must add all instances of the numbers to custom dictionaries for each language in which they appear so that the entire number is treated as a unique token by the index and query processes.

Custom dictionaries can be created in Notepad but must be saved in Unicode format. The file naming convention must follow the CustomNNNN.lex format, where NNNN is the language hex code of the language.

Within the file, each entry must be on separate lines separated by a carriage return (CR) and line feed (LF). Other custom dictionary rules are

Entries are not case sensitive.
The pipe (|) character is not permitted.
No spaces are permitted.
The pound sign (#) character can be used anywhere in an entry except at the beginning.
Any other alphanumeric characters, punctuation, symbols, and breaking characters are valid.
The maximum length of an entry is 128 (Unicode) characters.

The custom dictionary files must be saved in the folder that contains the word breakers on all index and query servers in the farm. By default, this is the C:\Program Files\Microsoft Office Servers\14.0\bin folder.

You must restart the osearch14 service on all index and query servers in the farm after these custom dictionary files have been created or modified. Do not use the Services On Server page in Central Administration to stop and start the service. Either use the Services MMC in Administrative Tools or type net stop osearch14 or net start osearch14 at the command line.

Note:

MORE INFO For a list of the supported language hex codes, see http://technet.microsoft.com/en-us/library/cc263242.aspx.

2.1.4. iFilters

iFilters tell the crawler how to “crack open” a file and to identify and index its contents. They are only used by the crawling process. Many iFilters have shipped with the new versions of SharePoint, including ascx, asm, asp, aspx, bat, c, cmd, cpp, css, cxx, dev, dic, doc, docm, docx, dot, eml, h, hhc, hht, hta, htm, html, htw, lnk, mht, mhtml, mpx, msg, odc, pot, pps, ppt, pptm, pptx, pub, stm, tif, trf, vsd, xlb, xlc, xls, xlsm, xlsx, xlt, and xml.

If you do not have an iFilter for a defined file type, the crawler gathers only the metadata for that document. You can add third-party iFilters or write your own. For example, if you need to index PDF files, then you can go to the Adobe website and download their iFilter for indexing PDF files. The download will give you the executable file and other files to successfully install the PDF iFilter.

2.1.5. Connectors

SharePoint 2010 indexing can use either the Connector Framework or the protocol handler API so your custom protocol handlers written for SharePoint Server 2007 will still work. The protocol handler API will not be supported in future products, however.

The SharePoint 2010 indexing connector has improved reliability and performance provided by higher fidelity conversation with SharePoint Web Service. It also reduces the load on the SharePoint sites with the addition of security-only crawls, improved batching of change log processing, and caching for common list item information. Improvements in telemetry and reporting provides views of security transactions processed in crawls and new crawl log functionality, such as filtering the crawl log by “top level errors.”

The connector framework provides improvements over the protocol handler API.

Attachments can now be crawled.
Item-level security descriptors can now be retrieved for external data exposed by Business Connectivity Services.
When crawling a Business Connectivity Services entity, additional entities can be crawled, maintaining the entity relationships.
Time stamp–based incremental crawls are supported.
Change log crawls that can remove items that have been deleted since the last crawl are supported.

Standard shared SharePoint 2010 indexing connectors that are also used by FAST Search for SharePoint include the following.

SharePoint content The crawler accesses data through the SharePoint Web service and uses Windows authentication (integrated) credentials. This connector supports full crawl using enumeration of content, but for incremental crawls it uses the change logs, including deleted items. It has built-in support for both NT and pluggable security trimming.
File shares The crawler accesses the file shares through their hierarchical structure with Windows authentication (integrated). ACLs are collected during crawl time and changes can be detected with time stamps or changes to the ACL.
Websites This protocol handler uses link traversal as the crawl method but does not provide a security trimmer. SharePoint sites can be configured to use this connector when they permit anonymous access, which reduces the overhead on both sides of the crawling effort. By default, if SharePoint sites are listed in a content source that uses this connector, the crawler will automatically switch to the SharePoint content connector.
People profiles People profiles are enumerated and crawled via the profile pages of the My Site host. Since the “Exposed” information selections are not truly NT ACLs, only information within a profile that is exposed to “Everyone” will be crawled.
Lotus Notes This connector used a protocol handler in Microsoft Office SharePoint Server 2007 but has been changed to use the Connector Framework in this version.
Exchange public folders Likewise, Exchange public folders were crawled with an HTTP protocol handler previously but now use the Connector Framework.
External systems Custom connectors are much easier to build in SharePoint Designer 2010 using the Connector Framework than with a protocol handler. These custom connectors can be built by creating external content types for the Business Data Connectivity Service or by using existing external content types.

Note:

Microsoft SharePoint 2010 Indexing Connector for Documentum can be downloaded from http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=32d0980a-4b9a-4b0d-868e-9be9c0d75435.

2.2. Search Components and Processes

The term content source refers to both the target servers that contain the content that needs to be indexed and the definitions of the start addresses, indexing schedules, and basic rules to instruct the crawler on how to crawl (index) the content. Crawl rules are used to further define how the content of specific URLs will be crawled, such as defining a different security context under which the crawler should crawl the content source instead of using the default crawling account.

Connectors contain the ability to connect to different content sources using different connection protocols, such as FTP, HTTP, or RPC. Protocol handlers used in SharePoint Server 2007 are still supported but are being replaced by the Connector Framework. Some connectors can be built in Microsoft SharePoint Designer without writing managed code.

There are really two parts to gathering the information from a content source. The first part is the enumeration of the content items that should be crawled. The connectors will connect to the content source and will walk the URL addresses of each content item, depositing each URL of each content item in the MSSCrawlQueue table. Although the basic search engine runs under Mssearcch.exe (O14Search), when a crawl is started, a new process under Msssearch.exe named Mssdmn.exe is spawned, and it is under this process that the content source is both enumerated and then crawled. After a sufficient number of URLs are placed in the MSSCrawlQueue, another Mssdmn.exe is spawned, and this process goes back to the content source connecting to each URL in batches (as determined by the crawler impact rules), opening each document at each URL in the current batch and then downloading first the metadata about the document and then the document’s contents. Both of these data streams are run through several components in the process pipeline and then the content is placed in the full-text index while the metadata is placed in the SQL database property store.

When you start a new crawl of a content source, you’ll notice that for a brief period of time, the status of the content source will appears as Starting. What this means is that if the Mssdmn.exe process (if it isn’t already started) will be spawned, the connector is connecting to the content source to establish the connection through which the content will be indexed, and the URLs of each content item are being deposited into the MSSCrawlQueue. You’ll see the status change to Crawling when the crawler is ready to open documents and download their content.

Increasing the number of start addresses within a single content source or increasing the number of content sources that are crawling multiple start addresses does not necessarily increase the number of Mssdmn.exe threads that are spawned.

In Search 2010, the number of crawl components across multiple servers can be increased as the workload increases, with automatic distribution of addresses, or if desired, specific addresses assigned to specific crawl components through Host Distribution Rules. The reduced workload provided by crawl partitioning speeds the crawling, which provides a faster refresh of indexes. In addition, the crawl component only builds and retains portions of the full text index until all designated query components obtain their copy. The crawl component never retains a full copy of the index.

The information used to track crawling now resides in tables of a separate Search_Service_Application database so that the crawling process does not impact other databases. This permits the crawling process to be stateless, since its status is stored in the SQL database, not in crawl logs on the crawl server. This permits the completion of a crawl by an alternate server if the first crawl component fails and is unable to complete the assigned tasks.

The indexing process extracts information from items retrieved by crawl components and places it in index format. This information includes full text index, metadata, URLs, and ACLs. Query components accept requests from users for queries against the SharePoint indexes and return results in the requested XML format.

Index partitions are introduced by SharePoint 2010 as subsets of the overall index. With index partitions, no single query component searches the entire index. The workload is spread across multiple servers, reducing the query response time even though maintaining index partitions slightly increases the crawl effort. Multiple query components can host the same index partition, providing both reliability and throughput. Index partitions also can be supported by clusters of mirrored crawl components, providing resiliency in case of failures.

Index partitioning is based on a hash of the documentID assigned to each document. This basis permits indexes to remain roughly equivalent in size, which is essential to optimal response time. Query components can be identified as failover nodes that would host the same index as their partners, similar to a mirrored SQL database. A failover query component will receive queries only if the primary query component for an index partition fails.

Index propagation works much like it does in SharePoint Server 2007, including the crawl components pushing index files to the query components and waiting until they all successfully absorb the index before acknowledging that the documents are successfully crawled.

Indexes pushed to query components for a partitioned index are just the appropriate part of that index. The current propagation information is stored in the Search_Service_Application database in the MSSPropagationTasks table, and the MSSPropagationLog table keeps records of past events. The MSSPropagationTasks table is populated and depopulated by the crawl components, and the query components populate the MSSPropagationTaskCompletions table in response. The MSSPropagationErrors table will reflect any current deficiency, and the information there is entered every 10 minutes in a warning level event on the search admin component’s server.

Indexes are absorbed by query components but aren’t necessarily served in queries for a few seconds until the appropriate merges have occurred. Index propagation tasks that have stalled for at least five minutes because of a lack of success from a query component trigger a re-crawl of the contained data. Query components can be taken offline so they don’t hold up the crawling process.

Query Federation is the formatting of queries in OpenSearch definition so that they may be processed by any OpenSearch query component, including SharePoint. In SharePoint 2010, the search object model and all query components are built around the OpenSearch model. Essentially, federated queries go to multiple query servers that respond individually, and the results are compiled to be presented in a Web Part.

Since no single query component holds the complete index, the Query Processor service must manage disbursing the queries and processing the multiple results lists returned. This is accomplished, using a round-robin load-balanced method, by one of the servers running the Search Query and Site Settings service (an Internet Information Services [IIS] service). By default, this service runs on each server that hosts a search query component. The service manages the query processing tasks, which include sending queries to one or more of the appropriate query components and building the consolidated results set to be returned to the Web front-end (WFE) server that constructed the query.

Note:

At least one instance of the Search Query and Site Settings service must be running to serve queries. The service should be started on all servers that host query components that can be identified in the Search Application Topology section of the Search Service Administration page. It can be started from the Services On Server page in Central Administration or with the following Windows PowerShell cmdlet.

Start-SPEnterpriseSearchQueryAndSiteSettingsServiceInstance

Each query component responds to queries and sends the results from the index partition that it holds to the query processor from which it received the query. The query component is also responsible for the word breaking, noise word removal, and stemming (if stemming is enabled) for the search terms provided by the query processor. The multiple responses are combined to produce the results list. Since each partition contains only a portion of the complete index, the workload of compiling results lists is spread across multiple query components, producing a faster query response time. Each partition can also be mirrored on separate query components, providing increased performance or resiliency should a single instance of the partition fail.

The search information stored in SQL databases has also been spread across additional databases. Just as the full text index can now be partitioned, the metadata or property databases can be divided and placed on separate SQL servers for performance or can be mirrored for resiliency.

Finally, the search administration component synchronizes the crawling and query activities using information stored in the admin database. It is the admin component that assigns tasks to specific servers and reassigns them in case of a server failure. There can only be one search administration component per farm, and it resides on the server where the search service application was created.

A built-in load balancer distributes hosts from content sources across crawl databases unless overruled by a Host Distribution Rule. The crawl components then retrieve the content assigned to their crawl database when initiated by the admin component. A Host Distribution Rule can assign a specific host to a crawl database. This is particularly useful if a third-party connector is licensed per server or if crawling specific content requires additional crawl component resources.

Whereas SharePoint Server 2007 depended on SQL for cluster and mirror failover, SharePoint 2010 has native SQL mirror support and all databases can be mirrored.