SharePoint 2010 Search : Setting Up the Crawler - Crawler Impact Rules & Crawler Scheduling

10/7/2011 5:33:50 PM

1. Crawler Impact Rules

Crawler impact rules can allow the administrator to dictate whether the crawler requests content from a site more or less aggressively than default. This is useful if a site does not have a mirror for dedicated crawling or if the site is outside of the SharePoint farm and may not handle the load. In cases where it is desirable to crawl an external web site, it may be wise to limit the load put on that external site, as the administrator of that site may block the crawler due to the load and fear of a possible denial of service attack on the site. On the other hand, it may be desirable in environments that are especially performant to increase the load the crawler puts on the content servers during crawl to speed up the crawler and reduce crawl time.

Adding crawler impact rules is easy but requires consideration. To add a rule, click Crawler Impact Rules under the Crawling section of the left navigation on the Search service application. Choose Add Rule. Enter the name of the site or content source that the rule should apply to. If it is desirable to restrict the impact on the site, lower the number of requests (default is 8). If it is possible to increase the load on the target source, increase the number. If the site is particularly sensitive to load, choose "Request one document at a time and wait the specified time between requests." Add a time in seconds that it will take the site to recover from delivering the request. Remember that limiting the number of connections and adding a pause between requests will substantially slow the crawl time. See Figure 1.

Figure 1. Adding a crawler impact rule

2. Crawler Scheduling

SharePoint's crawlers can be scheduled to perform full and incremental crawls at different intervals and for different periods. This is done separately for each content source, allowing for static content to be isolated from recurring crawls and dynamic or frequently updated content to be constantly refreshed. The scheduling configuration is done on the Edit Content Source page of each content source at the end of the page.

It is recommended that SharePoint content have a relatively aggressive incremental crawl schedule while taking into consideration actual update frequency and possible hardware limitations. Other content sources should have their respective usage considered before scheduling incremental crawls.

It is wise to schedule a full crawl on a regular basis to ensure database consistency. However, this regular schedule will depend largely on the time it takes to perform a full crawl. Some organizations with large repositories may choose to avoid full crawls after their initial index is populated.

Figures 2 and 3 show the part of the Edit/Add Content Sources page where a full crawl or incremental crawl can be scheduled and the Manage Schedules page (accessed through the "Create schedule" and "Edit schedule" links) with the options for scheduling those crawls.

Figure 2. The scheduling section of the Edit/Add Content Source page

Figure 3. The Manage Schedules page

2.1. Full vs. Incremental Crawls

SharePoint 2010 has two types of crawl mechanisms, full and incremental. Incremental crawls perform more efficiently and can keep the index up to date in near real time (Figure 4 ). However, at least one full crawl of a content source is always required and there may be other occasions when a full crawl is required.

During a full crawl, the crawler queries the content source and requests all the content for the first time. It then saves that data in the index and crawl database with date stamps and item IDs. Every time a full crawl is launched, this process is begun from scratch and old data is abandoned.

A full crawl is required when

A new content source is added—any new content source requires a full crawl initially.
A new file type is added—new file types cannot be picked up on an incremental crawl.
A new managed property is mapped from a crawled property.
Managed property mappings are changed or a new crawled property is added to an existing managed property.
New crawl rules are added, changed, or removed—crawl rule modification requires a full crawl to take effect.
The index becomes corrupted or performs irregularly—this should almost never happen but should not be ruled out.

During an incremental crawl, the crawler looks at the crawl database to determine what has been crawled or not crawled, and then requests updated information from the source depending on the content source type. In this way, the crawler can collect only documents that have been added or updated since the last crawl or remove documents from the index that have been removed from the content source.

If a situation exists where an incremental crawl is inappropriate or not possible, SharePoint will start a full crawl instead. In this way, the search index will not allow for the crawler to stop crawling on schedule and will not launch a crawl that will corrupt the index.

Figure 4. Crawl control in SharePoint 2010

Other -----------------

- Securing Windows Server 2008 R2 : Active Directory Recycle Bin

- Securing Windows Server 2008 R2 : NPS & NAP

- Microsoft Dynamics AX 2009 : The MorphX Tools - Unit Test Tool (part 2)

- Microsoft Dynamics AX 2009 : The MorphX Tools - Unit Test Tool (part 1) - Test Cases

- Active Directory Domain Services 2008 : Manage Active Directory Domain Services Data - Move a Group Object

- Active Directory Domain Services 2008 : Manage Active Directory Domain Services Data - Rename a Group Object

- Windows Server 2008 Server Core : Managing Other Hardware - Determining Memory Status with the Mem Utility

- Windows Server 2008 Server Core : Managing Other Hardware - Working with Line Printers

- Microsoft SQL Server 2008 Analysis Services : Designing More Complex Dimensions - Junk dimensions & Ragged hierarchies

- Microsoft SQL Server 2008 Analysis Services : Designing More Complex Dimensions - Slowly Changing Dimensions