Relevancy algorithms can be complicated as there are
many elements and considerations to make when attempting to determine
exactly which document is best matched to a given term.
Most algorithms,
including Google's famous PageRank, are available free to download and
inspect from the Internet. However, developers, when creating a ranking
algorithm, will almost always add their own modifications to whatever
base they are using to create a unique and functional ranking algorithm.
Google's present ranking mechanism is certainly exponentially more
complex now than it was when its founders first invented it and
published it while at Stanford. Whatever the math behind the ranking
algorithm, experience and testing can confirm that the ranking in
SharePoint 2010 is highly effective for enterprise content and brings
relevant documents to the top of the search results page.
As mentioned,
SharePoint's default algorithm is purportedly based on an algorithm
called BM25. There is some reference to neural networks in literature on
the Internet about SharePoint's ranking model. This is supported by a
patent created by Microsoft's engineering team, which received US Patent
No. 7,840,569 B2 in November 2010. The term neural network means the algorithm is designed to
learn from user behavior, something that SharePoint's algorithm
certainly does. However, there is no way to actually know that this
ranking algorithm is the core of SharePoint's.
To simplify things, however,
we can think of an enterprise search ranking algorithm as basically a
formula that ranks documents based on the frequency of query terms
appearing in the matching documents as well as considering overall
comparative document value. SharePoint applies various ranking
considerations based on different fields of which a document may be
comprised. For SharePoint specifically, these fields are extended to
properties and other data associated with the documents. So a single
document becomes a collection of differently weighted ranking
considerations. These ranking considerations include, but may not be
limited to, the following areas:
The first step SharePoint
search takes when receiving a query is to pass the query through a word
breaker to ensure the query terms match terms that may be stored in the
index. SharePoint has a specific word breaker for each language that can
tell where to break compound terms and tokenize them. This word
breaking or tokenization happens during both the crawling of terms and
the querying of terms to ensure that streams of indexed text are broken
into simple items. The neutral or default word breaker in SharePoint
breaks terms only for white space or filler characters like hyphens and
slashes. Other language word breakers do a more complex analysis based
on the grammar of the given language. Some languages have no white space
between terms, so they require a special understanding of the
characters. Next, the broken or tokenized terms are sent to a stemmer to
reduce them to their root form. These terms are finally checked against
terms in the index, and a result set is assembled based on all the
documents that contain matching terms from the entire corpus. The result
set is then prioritized with the item with the highest ranking value at
the top of the first page and subsequent matches listed in descending
order.
SharePoint search applies
what it calls static and dynamic ranking. Static ranking is applied to
documents at crawl time and relates to the static elements of the
documents, such as distance from authoritative pages, language, file
type length, etc. Dynamic ranking is applied at query time and applies
value for the specific terms queried, specifically keyword matches, and
proximity weighting.
In addition to or in support of
these conditions, the factors discussed in the following sections are
considered in ranking, but not necessarily in the order they are
presented here.
1. Keyword Matches
The total number of times the
terms appear on the document is, of course, important for ranking. The
most obvious element to rank a document as being a match for a given
term is the raw number of times that word is mentioned in the document.
Documents that mention a specific topic numerous times are likely more
relevant for the given topic. Similarly keywords that appear frequently
in a corpus will likely have a lower overall relevancy value than those
that appear relatively few times in the corpus. For example, a word like
the name of the company will likely appear on almost every document.
The term frequency–inverse document frequency ranking algorithm that the
SharePoint search ranking rules are based upon will lower the overall
value of that term and boost terms that appear only in few documents,
like product codes or project numbers.
1.1. Terms in the Body Text
Probably the most obvious
place to match the terms is in the body of the documents. This is where
the bulk of the text will be found and the kind of ranking that most
people understand due to the way that global search engines treat web
pages. Also, the use of metadata to identify documents is relatively
limited o much of documents' thematic value lies in the body text. This
is also, unfortunately, the place where it is most difficult to improve
content to affect ranking. Having good headings and using accepted
terminology are two ways to influence ranking in body text.
1.2. Terms in Titles
Titles are important
indicators of a document's purpose. Although there are many poorly
titled documents, if a term or phrase appears in the document's title,
there is a good chance that the document is about that title. The chance
is so good, in fact, that titles are usually given the highest ranking
values. Good titling is getting more and more attention in the
enterprise, so this ranking value is increasingly effective. Most things
in life are given titles by people, including people themselves. And
these titles, although sometimes misleading, tell us something essential
about the thing that is titled. Therefore, improving titles on
documents in an information handling system such as SharePoint is one of
the easiest and most useful ways to influence enterprise search
ranking.
1.3. Terms in Author Properties and Other Property Tags
Metadata in SharePoint
2010, often referred to as properties, is also important for ranking.
Properly applied metadata gives documents purposeful associations to
terms and themes that may not be prevalent in the document. One of the
most common property tags and most essential to collaboration in
SharePoint is the author property. This property is often applied on a
number of different document types as well as lists, libraries, and web
content. It is also associated with documents that are added to
SharePoint, so there is a high probability that a document will have
some sort of author associated with it.
SharePoint 2010 has the new
ability to include associated metadata with document data in the index,
improving search performance for metadata lookup and improving the
ranking of documents based on that metadata. It also has the capability
of adding inferred metadata based on terms or fields from within the
body of the document.
1.4. Terms in Anchor Text
SharePoint 2010 adds ranking
value to the documents based on the text in referring links. For sites
where users are publishing blogs, wikis, or text content on content
managed pages, this referring text consideration can be very useful.
When people are placing a link to another document, it is natural to
describe what that document is about on the link, and usually a short
and descriptive text is used. Considering this in the ranking can have a
positive influence but only when it is reasonable to have a descriptive
link.
2. Proximity
Proximity refers to
the relative closeness of the query terms found in a document. However,
there is no indication that closeness of query terms that are not in a
phrase has any influence in SharePoint search. Tests indicate that a
document with two terms that are simply near each other would rank
evenly with one that has the two terms at either end of the document.
For SharePoint, proximity is based on how terms are grouped into
segments or if they are found in phrases.
2.1. Query Segmentation
In multi-term queries and
document matches, there are often numerous sets of terms that may match
the terms. Some of those sets may match the queries better, based on how
the terms relate. For example, the query terms "great search engine"
may return a document with the phrase "great enterprise search engine"
and "great search for a used engine". Both of these documents have
matches for all the terms. However, how these terms are broken into
groups can dictate if the document about search engines is ranked above
the document about great search. SharePoint takes segmentation rules
into consideration when ranking, but such considerations generally offer
little influence, and other values like frequency will often override
such nuances.
2.2. Query Term Association
When multiple terms are
queried in a phrase, terms that appear together in the phrase are
naturally ranked highest, and those that are close together are given
higher ranking than terms that appear farther apart in a document. If
one searches for "Microsoft Exchange", one would expect a document with
the phrase in it to appear above a document with the sentence
"Microsoft's filings to the US Securities and Exchange Commission."
However, there is no evidence that SharePoint discriminates based on
word location or closeness outside of phrase matches.
3. Static Document Relationships
Static document
relationship ranking considerations are those made at crawl time. The
crawler breaks the text it streams into the database into what it finds
as unique terms and then applies values for the documents in which those
terms were found, based on a few factors such as placement in the site,
language, file type, and distance from authoritative pages.
3.1. Click Distance and URL Depth
The click distance is the
measure of the number of clicks it takes to get from one document to
another. There are two elements to consider for click distance: click
distance from what is set as or considered an authoritative page, and
depth of the document in the site or URL depth. SharePoint site
collections have a pyramid structure with one main entry page that leads
off to sites, subsites, lists, libraries, pages, and documents. The
distance between these is taken into consideration when applying ranking
values. Top-level content will get higher ranking, and lower-level
content with a deeper click depth will get a lower ranking, because the
top-level sites are naturally considered more important and easier to
access by users. The distance a document is from an authoritative page
also counts. So the ranking can be influenced by setting authoritative
pages close to important content. See the section on tuning search.
3.2. File Type
Certain file types are given
higher static ranking on SharePoint than others. According to the
latest available information, the document ranking order is web pages,
PowerPoint presentations, Word documents, XML files, Excel spreadsheets,
plain text files, and finally list items.
3.3. Length
Long documents have more
terms and would generally be ranked higher than short documents if
length were not taken into consideration. Therefore, the ranking is
adjusted to consider the density and relative value of the query term to
the entire document.
3.4. Language
For sites with documents
in many languages, documents in the language of the search user's
interface should be given ranking priority over other languages. In some
cases, documents contain terms in more than one language or are mostly
in one language but have a matching term from another language. In this
case, documents are given a static rank at crawl time for the language
SharePoint thinks is most likely the main language of the document.
Additional ranking value is given at query time once the user's
interface language is determined.
4. User-Driven Weighting
New to SharePoint 2010's
ranking is the inclusion of social elements. This includes the
adjustment of static rank values based on whether a document was
selected frequently from the search result list.
4.1. Click Popularity
An additional relevancy
mechanism in SharePoint 2010 is the weighting of results based on their
click popularity in the result set. The links that are chosen for a
specific query in a search result list add value to that specific
document for that specific search term. Click-through relevancy
weighting can help the organization to leverage the expertise of users
by allowing them to choose specific documents from a result list and
promote them. This is done without any added interaction or specific
interaction by the end users. Their well-meaning information discovery
helps the entire organization.
NOTE
The mathematics
behind the BM25F relevancy algorithm, which is the base of SharePoint's
default ranking algorithm, is explained at http://en.wikipedia.org/wiki/Okapi_BM25. Thanks to Mark Stone, technical product manager at Microsoft, for his help with SharePoint's ranking algorithm.