SharePoint 2010 Search : Setting Up the Crawler - Crawling File Shares & Crawling Web Sites

9/6/2011 3:06:17 PM

1. Crawling File Shares

SharePoint 2010 Search is truly designed as an enterprise search tool. Many do not appreciate the extensive search capabilities of SharePoint because of all its other enterprise functionality. SharePoint 2010 has extensive indexing capabilities, and one of the most useful is the indexing of file shares. We have yet to encounter an organization without hoards of data stored away on file shares. Some or all of this data may not be interesting, and care should be taken as to what is included in an index. However, the ability to index and expose the potential treasure troves of information locked away in these data graveyards is vast. This section will outline how to quickly and easily set up the SharePoint crawler to index file shares.

Like adding SharePoint sites, setting the crawler to index file shares is done in the Add Content Sources page in the Search service application (see Figure 1).

Figure 1. The Add Content Source page defining file shares to crawl

Defined paths in the Start Addresses section must be either UNC paths or paths using the file protocol (file://fileshare). Testing the paths by mapping the drives on the server is advisable before adding them as content sources. Make sure the crawl user has read access to the file shares.

Crawled files may also contain metadata that can be used by the search refiners in SharePoint 2010 or in scopes. This metadata is usually not made available by default in SharePoint, unlike many of the document properties in documents managed by SharePoint in document libraries. Making sure this metadata is crawled and mapped to managed properties in SharePoint can allow for this metadata to be used in refiners and scopes.

NOTE

File shares are often filled with document types that are not indexed by default by SharePoint's crawler. Luckily, SharePoint has the ability, via the Windows operating system, to convert and crawl text from other file types using iFilters. iFilters can be programmed for custom file types or purchased from third-party vendors.

2. Crawling Web Sites

SharePoint 2010 can also crawl web sites and has a unique crawling mechanism for indexing web content. Although SharePoint itself is essentially a very powerful web site for portal usage, the crawling mechanism differs insomuch as the web crawling mechanism of SharePoint 2010 has special capabilities for parsing HyperText Markup Language (HTML) and extracting content from HTML tags. When crawling SharePoint sites, the crawler uses a different application programming interface (API) to collect documents and a rich amount of associated information such as custom properties.

It is generally recommended that SharePoint sites, even those that are used as external web sites, should be indexed as SharePoint sites. If indexing a web site built on some other content management system or indexing all or part of an external web site, the web site definition should be used. Crawling sites as web sites will limit the crawler to indexing content retrievable on the presentation tier of the web site—that is, the web sites as anonymous visitors will see them.

There are times when it may be necessary or desirable to index SharePoint sites as web sites, and this is also possible—for example, if the SharePoint site is for a public site not owned or operated by the organization, if the site is behind a firewall, or if the SharePoint site is based on a different version of SharePoint and it is not possible to index it as a SharePoint site.

Web sites should be added by adding the entire HypterText Transfer Protocol (HTTP) or HTTP with Secure Socket Layers (SSL) path (HTTPS). See Figure 2.

Figure 2. Adding a web site content source

Web sites are anything but a standard group of items. There are nearly as many variations of how web sites are built as web sites themselves. Even though standards exist for HTML and web content, it is very difficult to find a site that follows these standards. Respect should be given to browser developers for giving end users such a rich experience, given the state of most web sites. With that said, the argument exists that if browsers were not as forgiving, web developers would be more careful. Most crawlers are usually not as forgiving. This is usually due to the fact that a crawler needs to make sense of the content it is getting in the HTML, not just display it.

Many factors make crawling web pages tricky, including

JavaScript: Crawlers generally cannot understand JavaScript and will ignore it.
Flash: Crawlers will not index the content of Flash objects in most cases.
Images: Web crawlers do not make sense of images outside of their metatags or alt tags. Scanned text is a special problem—although users see text, crawlers see only an image.
Broken HTML: Although browsers will display poorly formatted HTML, crawlers can often stumble on it.
Poor or missing metadata: Web pages can hold metadata, and this can improve the richness of the content. However, most content management systems do a poor job of managing and publishing metadata. Custom sites are even worse.
Page or site-based crawl rules: Robots metatags or robots.txt files

These issues make crawling web sites difficult and the content collected from them often not as good as that from a SharePoint site. If the administrator has control of the web site, correcting these issues can make the web content more accessible to SharePoint Search as well as to global search if the site is exposed to the World Wide Web.

Other -----------------

- Migrating to Windows Small Business Server 2011 Standard : Migrating Settings and Data (part 5) - Re-Enabling Folder Redirection

- Migrating to Windows Small Business Server 2011 Standard : Migrating Settings and Data (part 4) - Migrating Users and Groups

- Migrating to Windows Small Business Server 2011 Standard : Migrating Settings and Data (part 3) - Migrate Network Settings & Migrate Exchange Mailboxes and Settings

- Migrating to Windows Small Business Server 2011 Standard : Migrating Settings and Data (part 2) - Configure the Network

- Migrating to Windows Small Business Server 2011 Standard : Migrating Settings and Data (part 1) - Starting the Migration Wizard

- Migrating to Windows Small Business Server 2011 Standard : Creating a Migration Answer File

- Microsoft Dynamics CRM 2011 : Copying Campaign Records & Using Quick Campaigns

- Microsoft Dynamics CRM 2011 : Managing Campaigns and Quick Campaigns - Relating Campaigns & Creating Campaign Templates

- Securing Windows Server 2008 R2 : Security Configuration Wizard & Best Practice Analyzer

- Securing Windows Server 2008 R2 : BitLocker