SharePoint 2010 Search : Tuning Search (part 2) - The Thesaurus & Custom Dictionaries

8/2/2011 3:08:11 PM

4. The Thesaurus

SharePoint has the capability to match terms with potential synonyms and return documents based on those synonyms. For example, a user may be looking for the project plan for the windmill his or her company is consulting on without realizing that it is actually a wind turbine he or she is searching for (windmills mill grain, wind turbines produce electricity). Searching for "windmill" will not return any hits for his or her query and ultimately cause frustration, and probably a call to the engineers and some laughter and deriding (engineers can be callous). To avoid this, a potential synonym match between windmill and wind turbine could be entered in the thesaurus if it proves to be a common enough mistake to warrant it.

The thesaurus in SharePoint is installed into the same folder as the stop word files, and similarly there is one file for each supported language as well as a language-neutral file, C:\Program Files\Microsoft Office Servers\14.0\Data\Office Server\Config. These are a virgin set of the thesaurus and stop word files. When the Search service application is created in the SharePoint farm, SharePoint copies out a set to all query servers in the location C:\Program Files\Microsoft Office Servers\14.0\Data\Applications\GUID\Config. If the virgin files are edited, when a new Search service application is created, the edited files will be staged out. But any existing Search service applications and thesaurus files will not be copied over and must be individually edited.

By default, the thesaurus files are lacking any active entries, and the examples within are in comments. So the administrator must edit the files for them to function and copy them out to each query server. The files will be called only if the search is initiated in the specific language context for that file. The different language files are shown in Table 1.

Table 1. The Thesaurus Files in SharePoint*
Language	File name	Language	File name	Language	File name	Language	File name
Language-neutral	tsneu.xml	English (United States)	tsenu.xml	Lithuanian	tslit.xml	Serbian (Latin)	tssbl.xml
Arabic	tsara.xml	Finnish	tsfin.xml	Malay (Malaysian)	tsmal.xml	Slovak	tssvk.xml
Bengali	tsben.xml	French (Standard)	tsfra.xml	Malayalam	tsmly.xml	Slovenian	tsslo.xml
Bulgarian	tsbul.xml	German (Standard)	tsdeu.xml	Marathi	tsmar.xml	Spanish	tsesn.xml
Catalan	tscat.xml	Gujarati	tsguj.xml	Norwegian (Bokmal)	tsnor.xml	Swedish	tssve.xml
Chinese (Simplified)	tschs.xml	Hungarian	tshun.xml	Polish	tsplk.xml	Tamil	tstam.xml
Chinese (Traditional)	tscht.xml	Icelandic	tsice.xml	Portuguese (Brazil)	tsptb.xml	Telugu	tstel.xml
Croatian	tscro.xml	Indonesian	tsind.xml	Portuguese (Portugal)	tspor.xml	Thai	tstha.xml
Czech	tsces.xml	Italian	tsita.xml	Punjabi	tspun.xml	Turkish	tstur.xml
Danish	tsdan	Japanese	tsjpn.xml	Romanian	tsrom.xml	Ukrainian	tsukr.xml
Dutch (Netherlands)	tsnld.xml	Kannada	tskan.xml	Russian	tsrus.xml	Urdu (Pakistan)	tsurd.xml
English (United Kingdom)	tseng.xml	Korean	tskor.xml	Serbian (Cyrillic)	tssbc.xml	Vietnamese	tsvie.xml
*Source: http://technet.microsoft.com/en-us/library/dd361734.aspx

The file tsneu.xml is the language-neutral file that is the default file utilized if no specific language value is passed. On installation it has the following structure and values:

<XML ID="Microsoft Search Thesaurus">
<!--
  Commented out

    <thesaurus xmlns="x-schema:tsSchema.xml">
        <diacritics_sensitive>0</diacritics_sensitive>
        <expansion>
            <sub>Internet Explorer</sub>
            <sub>IE</sub>
            <sub>IE5</sub>
        </expansion>
        <replacement>
            <pat>NT5</pat>
            <pat>W2K</pat>
            <sub>Windows 2000</sub>
        </replacement>
        <expansion>
            <sub>run</sub>
            <sub>jog</sub>
        </expansion>
    </thesaurus>
  -->
  </XML>

The first thing to do when utilizing it is to remove the comment tags . This will make the entries active. The first tag to consider is the diacritics sensitivity tag. By default it has a value of zero, which is the setting for off. To enable it, set it to 1. Diacritical marks are marks that some languages utilize to modify the sounds of certain letters, such as accents or umlauts. Many European languages will benefit from diacritical sensitivity. English is not one of them.

The remaining sections are either expansion or replacement. Expansion is used to add synonyms to a given term. If the term in a <sub></sub> tag pair is searched for, the remaining terms in sub tag pairs will also be searched for. Any term in an expansion set will trigger the other terms in the set. Each set of synonyms must be encapsulated in their own expansion tag pair and sub tags.

<expansion>
<sub>windmill</sub>
<sub>wind turbine</sub>
<sub>generator</sub>
</expansion>

The replacement section is used to substitute a term for another term. This is useful when dealing with spelling mistakes or unused synonyms terms. The query term is not actually searched for, but the replacement terms are.

<replacement>
<pat>scarepoint</pat>
<sub>sharepoint</sub>
<sub>SP2010</sub>
</replacement>

As is shown in these examples, there can be a one-to-many relationship with both expansion and replacement sections. Microsoft does not recommend more than 10,000 entries in a single thesaurus file. Each entry where a term is defined (<sub> or <pat>) is considered one entry.

Save the thesaurus files as Unicode. If you are editing them in Notepad, this is the default encoding. Other text editors may require special care. After updating a thesaurus file, the Search service application needs to be restarted before changes will take effect. This can be accomplished by opening the services snap-in and restarting SharePoint Server Search 14.

5. Custom Dictionaries

Custom dictionaries are lists of words that the search engine may match exactly and pass as a query. These dictionaries supersede the built-in word breakers in SharePoint.

Word breakers are a hidden part of the index and query processes of SharePoint search that manage how terms are handled by the query process. They are small programs or routines that break complex terms into shorter, more understandable terms. As we have seen in the "Stop Words " section, not all words are interesting to search, and the most common ones can be disregarded safely. Similarly, there are many characters that do not conform to the standard conception of what makes up a word. Special characters, such as ampersands (&), dollar signs ($), stars (*), the "at" character (@), and hyphens (−), among many others, are very common in digital information. Many organizations rely on combinations of these characters with letters and numbers to identify documents or products. The "at" character is seen in every e-mail address.

Usually, when put in context, many of these characters can be seen as word separators and hold little contextual value. For this reason, word breakers are employed to break these terms into smaller terms that are more likely to be searched for and make sense. For example, the phrase search-driven application contains a hyphen, linking "search" and "driven". It's common to combine words like this, but I might search for "search driven application" and expect to get results. If the search engine keeps "search-driven" as a single term, I won't find the document with the hyphenated version. Therefore, a word breaker is employed to break apart the term and allow for both variations to be searched.

This doesn't always make sense. Say, for example, an oil drilling company has a pipeline with many valves, and each valve has a unique ID with letters, numbers, and hyphens (e.g., VLV-123-456). If the valve is turned off without checking a document to see what the consequences will be, the whole pipeline could be shut down, or worse, a catastrophic failure could be caused. So, if the word breaker is allowed to break apart the term, all documents with vlv, 123, and 456 on them would be returned. This may be many possible documents and cause a lot of searching (mind you, "vlv 123 456" as a phrase should be returned first—this becomes more problematic when partial terms are searched and wildcards are used). So having the search terms seen as a single term and not broken into its parts can be valuable. This is where custom dictionaries come into play.

Here are the rules that must be observed when creating custom dictionaries:

Each supported language has its own custom dictionary.
Custom dictionaries (like stop word files and thesaurus files) should be saved in Unicode.
Custom dictionaries have the file type .lex and are named CustomXXXX, where XXXX is the four-digit hexadecimal language code.
Entries in the custom dictionaries are not case-sensitive.
The pipe character (|) is not accepted.
No blank spaces (white space)
The pound character or number sign (#) cannot be used at the beginning of an entry, but it can be used within it or at the end, e.g., #Test is not acceptable but T#st and Test# are OK.
Aside from the foregoing exceptions, any other character is acceptable.
The maximum length of a single entry is 128 (Unicode) characters.
There must be a copy of the custom dictionary files on each query server.

Here are the steps for creating a custom dictionary:

Create a new text file in a text editor (like Notepad).
Add your terms, taking into consideration the foregoing limitations and rules.
Save the file with the appropriate file name (e.g., Custom0009.lex) in the %ProgramFiles%\Microsoft Office Servers\14.0\Bin folder.
Restart the Search service application by running services.msc from the start menu and restarting the SharePoint Server Search 14 service.

Table 2. Support Languages for Custom Dictionaries and Their Language Codes*
Language / dialect	LCID	Language hexadecimal code	Language / dialect	LCID	Language hexadecimal code
Arabic	1025	0001	Malay	1086	003e
Bengali	1093	0045	Malayalam	1100	004c
Bulgarian	1026	0002	Marathi	1102	004e
Catalan	1027	0003	Norwegian_Bokmaal	1044	0414
Croatian	1050	001a	Portuguese	2070	0816
Danish	1030	0006	Portuguese_Braz	1046	0416
Dutch	1043	0013	Punjabi	1094	0046
English	1033	0009	Romanian	1048	0018
French	1036	000c	Russian	1049	0019
German	1031	0007	Serbian_Cyrillic	3098	0c1a
Gujarati	1095	0047	Serbian_Latin	2074	081a
Hebrew	1037	000d	Slovak	1051	001b
Hindi	1081	0039	Slovenian	1060	0024
Icelandic	1039	000f	Spanish	3082	000a
Indonesian	1057	0021	Swedish	1053	001d
Italian	1040	0010	Tamil	1097	0049
Japanese	1041	0011	Telugu	1098	004a
Kannada	1099	004b	Ukrainian	1058	0022
Latvian	1062	0026	Urdu	1056	0020
Lithuanian	1063	0027	Vietnamese	1066	002a
*Source: http://technet.microsoft.com/en-us/library/cc263242.aspx