SharePoint 2010 Search : Relevancy Algorithms

7/27/2011 5:11:51 PM

Relevancy algorithms can be complicated as there are many elements and considerations to make when attempting to determine exactly which document is best matched to a given term.

Most algorithms, including Google's famous PageRank, are available free to download and inspect from the Internet. However, developers, when creating a ranking algorithm, will almost always add their own modifications to whatever base they are using to create a unique and functional ranking algorithm. Google's present ranking mechanism is certainly exponentially more complex now than it was when its founders first invented it and published it while at Stanford. Whatever the math behind the ranking algorithm, experience and testing can confirm that the ranking in SharePoint 2010 is highly effective for enterprise content and brings relevant documents to the top of the search results page.

As mentioned, SharePoint's default algorithm is purportedly based on an algorithm called BM25. There is some reference to neural networks in literature on the Internet about SharePoint's ranking model. This is supported by a patent created by Microsoft's engineering team, which received US Patent No. 7,840,569 B2 in November 2010. The term neural network means the algorithm is designed to learn from user behavior, something that SharePoint's algorithm certainly does. However, there is no way to actually know that this ranking algorithm is the core of SharePoint's.

To simplify things, however, we can think of an enterprise search ranking algorithm as basically a formula that ranks documents based on the frequency of query terms appearing in the matching documents as well as considering overall comparative document value. SharePoint applies various ranking considerations based on different fields of which a document may be comprised. For SharePoint specifically, these fields are extended to properties and other data associated with the documents. So a single document becomes a collection of differently weighted ranking considerations. These ranking considerations include, but may not be limited to, the following areas:

Keyword matches
- Body
- Title
- Author properties
- Other property tags
- Intersite anchor
- URL
Proximity
- Query segmentation
- Query term association
Static document relationships
- Click distance
- URL depth
- File type
- Length
- Language
User-driven weighting
- Click popularity
- Skips

The first step SharePoint search takes when receiving a query is to pass the query through a word breaker to ensure the query terms match terms that may be stored in the index. SharePoint has a specific word breaker for each language that can tell where to break compound terms and tokenize them. This word breaking or tokenization happens during both the crawling of terms and the querying of terms to ensure that streams of indexed text are broken into simple items. The neutral or default word breaker in SharePoint breaks terms only for white space or filler characters like hyphens and slashes. Other language word breakers do a more complex analysis based on the grammar of the given language. Some languages have no white space between terms, so they require a special understanding of the characters. Next, the broken or tokenized terms are sent to a stemmer to reduce them to their root form. These terms are finally checked against terms in the index, and a result set is assembled based on all the documents that contain matching terms from the entire corpus. The result set is then prioritized with the item with the highest ranking value at the top of the first page and subsequent matches listed in descending order.

SharePoint search applies what it calls static and dynamic ranking. Static ranking is applied to documents at crawl time and relates to the static elements of the documents, such as distance from authoritative pages, language, file type length, etc. Dynamic ranking is applied at query time and applies value for the specific terms queried, specifically keyword matches, and proximity weighting.

In addition to or in support of these conditions, the factors discussed in the following sections are considered in ranking, but not necessarily in the order they are presented here.

1. Keyword Matches

The total number of times the terms appear on the document is, of course, important for ranking. The most obvious element to rank a document as being a match for a given term is the raw number of times that word is mentioned in the document. Documents that mention a specific topic numerous times are likely more relevant for the given topic. Similarly keywords that appear frequently in a corpus will likely have a lower overall relevancy value than those that appear relatively few times in the corpus. For example, a word like the name of the company will likely appear on almost every document. The term frequency–inverse document frequency ranking algorithm that the SharePoint search ranking rules are based upon will lower the overall value of that term and boost terms that appear only in few documents, like product codes or project numbers.

1.1. Terms in the Body Text

Probably the most obvious place to match the terms is in the body of the documents. This is where the bulk of the text will be found and the kind of ranking that most people understand due to the way that global search engines treat web pages. Also, the use of metadata to identify documents is relatively limited o much of documents' thematic value lies in the body text. This is also, unfortunately, the place where it is most difficult to improve content to affect ranking. Having good headings and using accepted terminology are two ways to influence ranking in body text.

1.2. Terms in Titles

Titles are important indicators of a document's purpose. Although there are many poorly titled documents, if a term or phrase appears in the document's title, there is a good chance that the document is about that title. The chance is so good, in fact, that titles are usually given the highest ranking values. Good titling is getting more and more attention in the enterprise, so this ranking value is increasingly effective. Most things in life are given titles by people, including people themselves. And these titles, although sometimes misleading, tell us something essential about the thing that is titled. Therefore, improving titles on documents in an information handling system such as SharePoint is one of the easiest and most useful ways to influence enterprise search ranking.

1.3. Terms in Author Properties and Other Property Tags

Metadata in SharePoint 2010, often referred to as properties, is also important for ranking. Properly applied metadata gives documents purposeful associations to terms and themes that may not be prevalent in the document. One of the most common property tags and most essential to collaboration in SharePoint is the author property. This property is often applied on a number of different document types as well as lists, libraries, and web content. It is also associated with documents that are added to SharePoint, so there is a high probability that a document will have some sort of author associated with it.

SharePoint 2010 has the new ability to include associated metadata with document data in the index, improving search performance for metadata lookup and improving the ranking of documents based on that metadata. It also has the capability of adding inferred metadata based on terms or fields from within the body of the document.

1.4. Terms in Anchor Text

SharePoint 2010 adds ranking value to the documents based on the text in referring links. For sites where users are publishing blogs, wikis, or text content on content managed pages, this referring text consideration can be very useful. When people are placing a link to another document, it is natural to describe what that document is about on the link, and usually a short and descriptive text is used. Considering this in the ranking can have a positive influence but only when it is reasonable to have a descriptive link.

2. Proximity

Proximity refers to the relative closeness of the query terms found in a document. However, there is no indication that closeness of query terms that are not in a phrase has any influence in SharePoint search. Tests indicate that a document with two terms that are simply near each other would rank evenly with one that has the two terms at either end of the document. For SharePoint, proximity is based on how terms are grouped into segments or if they are found in phrases.

2.1. Query Segmentation

In multi-term queries and document matches, there are often numerous sets of terms that may match the terms. Some of those sets may match the queries better, based on how the terms relate. For example, the query terms "great search engine" may return a document with the phrase "great enterprise search engine" and "great search for a used engine". Both of these documents have matches for all the terms. However, how these terms are broken into groups can dictate if the document about search engines is ranked above the document about great search. SharePoint takes segmentation rules into consideration when ranking, but such considerations generally offer little influence, and other values like frequency will often override such nuances.

2.2. Query Term Association

When multiple terms are queried in a phrase, terms that appear together in the phrase are naturally ranked highest, and those that are close together are given higher ranking than terms that appear farther apart in a document. If one searches for "Microsoft Exchange", one would expect a document with the phrase in it to appear above a document with the sentence "Microsoft's filings to the US Securities and Exchange Commission." However, there is no evidence that SharePoint discriminates based on word location or closeness outside of phrase matches.

3. Static Document Relationships

Static document relationship ranking considerations are those made at crawl time. The crawler breaks the text it streams into the database into what it finds as unique terms and then applies values for the documents in which those terms were found, based on a few factors such as placement in the site, language, file type, and distance from authoritative pages.

3.1. Click Distance and URL Depth

The click distance is the measure of the number of clicks it takes to get from one document to another. There are two elements to consider for click distance: click distance from what is set as or considered an authoritative page, and depth of the document in the site or URL depth. SharePoint site collections have a pyramid structure with one main entry page that leads off to sites, subsites, lists, libraries, pages, and documents. The distance between these is taken into consideration when applying ranking values. Top-level content will get higher ranking, and lower-level content with a deeper click depth will get a lower ranking, because the top-level sites are naturally considered more important and easier to access by users. The distance a document is from an authoritative page also counts. So the ranking can be influenced by setting authoritative pages close to important content. See the section on tuning search.

3.2. File Type

Certain file types are given higher static ranking on SharePoint than others. According to the latest available information, the document ranking order is web pages, PowerPoint presentations, Word documents, XML files, Excel spreadsheets, plain text files, and finally list items.

3.3. Length

Long documents have more terms and would generally be ranked higher than short documents if length were not taken into consideration. Therefore, the ranking is adjusted to consider the density and relative value of the query term to the entire document.

3.4. Language

For sites with documents in many languages, documents in the language of the search user's interface should be given ranking priority over other languages. In some cases, documents contain terms in more than one language or are mostly in one language but have a matching term from another language. In this case, documents are given a static rank at crawl time for the language SharePoint thinks is most likely the main language of the document. Additional ranking value is given at query time once the user's interface language is determined.

4. User-Driven Weighting

New to SharePoint 2010's ranking is the inclusion of social elements. This includes the adjustment of static rank values based on whether a document was selected frequently from the search result list.

4.1. Click Popularity

An additional relevancy mechanism in SharePoint 2010 is the weighting of results based on their click popularity in the result set. The links that are chosen for a specific query in a search result list add value to that specific document for that specific search term. Click-through relevancy weighting can help the organization to leverage the expertise of users by allowing them to choose specific documents from a result list and promote them. This is done without any added interaction or specific interaction by the end users. Their well-meaning information discovery helps the entire organization.

NOTE

The mathematics behind the BM25F relevancy algorithm, which is the base of SharePoint's default ranking algorithm, is explained at http://en.wikipedia.org/wiki/Okapi_BM25 . Thanks to Mark Stone, technical product manager at Microsoft, for his help with SharePoint's ranking algorithm.

Other -----------------

- Microsoft Dynamics CRM 2011 : Using Mail Merge to Generate a Word Document That Includes List Member Information

- Microsoft Dynamics CRM 2011 : Creating Opportunities from List Members

- Microsoft Dynamics CRM 2011 : Copying Members to Another Marketing List

- BizTalk 2009 : How to Tune Each Subsystem (part 2)

- BizTalk 2009 : How to Tune Each Subsystem (part 1) - ASP.NET, SOAP, and HTTP

- Microsoft PowerPoint 2010 : Organizing Clips

- Microsoft PowerPoint 2010 : Managing Pictures

- Microsoft PowerPoint 2010 : Accessing Commands Not in the Ribbon & Customizing the Way You Create Objects

- Microsoft Dynamics AX 2009 : The MorphX Tools - Code Compiler & Dynamics AX SDK

- Microsoft Dynamics AX 2009 : The MorphX Tools - Visual Form Designer and Visual Report Design