4. The Thesaurus
SharePoint has the capability
to match terms with potential synonyms and return documents based on
those synonyms. For example, a user may be looking for the project plan
for the windmill his or her company is consulting on without realizing
that it is actually a wind turbine he or she is searching for (windmills
mill grain, wind turbines produce electricity). Searching for
"windmill" will not return any hits for his or her query and ultimately
cause frustration, and probably a call to the engineers and some
laughter and deriding (engineers can be callous). To avoid this, a
potential synonym match between windmill and wind turbine could be
entered in the thesaurus if it proves to be a common enough mistake to
warrant it.
The thesaurus in SharePoint
is installed into the same folder as the stop word files, and similarly
there is one file for each supported language as well as a
language-neutral file, C:\Program Files\Microsoft Office Servers\14.0\Data\Office Server\Config.
These are a virgin set of the thesaurus and stop word files. When the
Search service application is created in the SharePoint farm, SharePoint
copies out a set to all query servers in the location C:\Program Files\Microsoft Office Servers\14.0\Data\Applications\GUID\Config.
If the virgin files are edited, when a new Search service application
is created, the edited files will be staged out. But any existing Search
service applications and thesaurus files will not be copied over and
must be individually edited.
By default, the thesaurus
files are lacking any active entries, and the examples within are in
comments. So the administrator must edit the files for them to function
and copy them out to each query server. The files will be called only if
the search is initiated in the specific language context for that file.
The different language files are shown in Table 1.
Table 1. The Thesaurus Files in SharePoint*
Language | File name | Language | File name | Language | File name | Language | File name |
---|
Language-neutral | tsneu.xml | English (United States) | tsenu.xml | Lithuanian | tslit.xml | Serbian (Latin) | tssbl.xml |
Arabic | tsara.xml | Finnish | tsfin.xml | Malay (Malaysian) | tsmal.xml | Slovak | tssvk.xml |
Bengali | tsben.xml | French (Standard) | tsfra.xml | Malayalam | tsmly.xml | Slovenian | tsslo.xml |
Bulgarian | tsbul.xml | German (Standard) | tsdeu.xml | Marathi | tsmar.xml | Spanish | tsesn.xml |
Catalan | tscat.xml | Gujarati | tsguj.xml | Norwegian (Bokmal) | tsnor.xml | Swedish | tssve.xml |
Chinese (Simplified) | tschs.xml | Hungarian | tshun.xml | Polish | tsplk.xml | Tamil | tstam.xml |
Chinese (Traditional) | tscht.xml | Icelandic | tsice.xml | Portuguese (Brazil) | tsptb.xml | Telugu | tstel.xml |
Croatian | tscro.xml | Indonesian | tsind.xml | Portuguese (Portugal) | tspor.xml | Thai | tstha.xml |
Czech | tsces.xml | Italian | tsita.xml | Punjabi | tspun.xml | Turkish | tstur.xml |
Danish | tsdan | Japanese | tsjpn.xml | Romanian | tsrom.xml | Ukrainian | tsukr.xml |
Dutch (Netherlands) | tsnld.xml | Kannada | tskan.xml | Russian | tsrus.xml | Urdu (Pakistan) | tsurd.xml |
English (United Kingdom) | tseng.xml | Korean | tskor.xml | Serbian (Cyrillic) | tssbc.xml | Vietnamese | tsvie.xml |
|
The file tsneu.xml
is the language-neutral file that is the default file utilized if no
specific language value is passed. On installation it has the following
structure and values:
<XML ID="Microsoft Search Thesaurus">
<!--
Commented out
<thesaurus xmlns="x-schema:tsSchema.xml">
<diacritics_sensitive>0</diacritics_sensitive>
<expansion>
<sub>Internet Explorer</sub>
<sub>IE</sub>
<sub>IE5</sub>
</expansion>
<replacement>
<pat>NT5</pat>
<pat>W2K</pat>
<sub>Windows 2000</sub>
</replacement>
<expansion>
<sub>run</sub>
<sub>jog</sub>
</expansion>
</thesaurus>
-->
</XML>
The first thing to do when
utilizing it is to remove the comment tags <!-- and -->. This will
make the entries active. The first tag to consider is the diacritics
sensitivity tag. By default it has a value of zero, which is the setting
for off. To enable it, set it to 1. Diacritical marks are marks that
some languages utilize to modify the sounds of certain letters, such as
accents or umlauts. Many European languages will benefit from
diacritical sensitivity. English is not one of them.
The remaining sections are
either expansion or replacement. Expansion is used to add synonyms to a
given term. If the term in a <sub></sub> tag pair is
searched for, the remaining terms in sub tag pairs will also be searched
for. Any term in an expansion set will trigger the other terms in the
set. Each set of synonyms must be encapsulated in their own expansion
tag pair and sub tags.
<expansion>
<sub>windmill</sub>
<sub>wind turbine</sub>
<sub>generator</sub>
</expansion>
The replacement section is used
to substitute a term for another term. This is useful when dealing with
spelling mistakes or unused synonyms terms. The query term is not
actually searched for, but the replacement terms are.
<replacement>
<pat>scarepoint</pat>
<sub>sharepoint</sub>
<sub>SP2010</sub>
</replacement>
As is shown in these
examples, there can be a one-to-many relationship with both expansion
and replacement sections. Microsoft does not recommend more than 10,000
entries in a single thesaurus file. Each entry where a term is defined
(<sub> or <pat>) is considered one entry.
Save the thesaurus files as
Unicode. If you are editing them in Notepad, this is the default
encoding. Other text editors may require special care. After updating a
thesaurus file, the Search service application needs to be restarted
before changes will take effect. This can be accomplished by opening the
services snap-in and restarting SharePoint Server Search 14.
5. Custom Dictionaries
Custom dictionaries are
lists of words that the search engine may match exactly and pass as a
query. These dictionaries supersede the built-in word breakers in
SharePoint.
Word breakers are a hidden part
of the index and query processes of SharePoint search that manage how
terms are handled by the query process. They are small programs or
routines that break complex terms into shorter, more understandable
terms. As we have seen in the "Stop Words"
section, not all words are interesting to search, and the most common
ones can be disregarded safely. Similarly, there are many characters
that do not conform to the standard conception of what makes up a word.
Special characters, such as ampersands (&), dollar signs ($), stars
(*), the "at" character (@), and hyphens (−), among many others, are
very common in digital information. Many organizations rely on
combinations of these characters with letters and numbers to identify
documents or products. The "at" character is seen in every e-mail
address.
Usually, when put in context,
many of these characters can be seen as word separators and hold little
contextual value. For this reason, word breakers are employed to break
these terms into smaller terms that are more likely to be searched for
and make sense. For example, the phrase search-driven application
contains a hyphen, linking "search" and "driven". It's common to combine
words like this, but I might search for "search driven application" and
expect to get results. If the search engine keeps "search-driven" as a
single term, I won't find the document with the hyphenated version.
Therefore, a word breaker is employed to break apart the term and allow
for both variations to be searched.
This doesn't always make
sense. Say, for example, an oil drilling company has a pipeline with
many valves, and each valve has a unique ID with letters, numbers, and
hyphens (e.g., VLV-123-456). If the valve is turned off without checking
a document to see what the consequences will be, the whole pipeline
could be shut down, or worse, a catastrophic failure could be caused.
So, if the word breaker is allowed to break apart the term, all
documents with vlv, 123, and 456 on them would be returned. This may be
many possible documents and cause a lot of searching (mind you, "vlv 123
456" as a phrase should be returned first—this becomes more problematic
when partial terms are searched and wildcards are used). So having the
search terms seen as a single term and not broken into its parts can be
valuable. This is where custom dictionaries come into play.
Here are the rules that must be observed when creating custom dictionaries:
Each supported language has its own custom dictionary.
Custom dictionaries (like stop word files and thesaurus files) should be saved in Unicode.
Custom dictionaries have the file type .lex and are named CustomXXXX, where XXXX is the four-digit hexadecimal language code.
Entries in the custom dictionaries are not case-sensitive.
The pipe character (|) is not accepted.
No blank spaces (white space)
The
pound character or number sign (#) cannot be used at the beginning of
an entry, but it can be used within it or at the end, e.g., #Test is not
acceptable but T#st and Test# are OK.
Aside from the foregoing exceptions, any other character is acceptable.
The maximum length of a single entry is 128 (Unicode) characters.
There must be a copy of the custom dictionary files on each query server.
Here are the steps for creating a custom dictionary:
Create a new text file in a text editor (like Notepad).
Add your terms, taking into consideration the foregoing limitations and rules.
Save the file with the appropriate file name (e.g., Custom0009.lex) in the %ProgramFiles%\Microsoft Office Servers\14.0\Bin folder.
Restart the Search service application by running services.msc from the start menu and restarting the SharePoint Server Search 14 service.
Table 2. Support Languages for Custom Dictionaries and Their Language Codes*
Language / dialect | LCID | Language hexadecimal code | Language / dialect | LCID | Language hexadecimal code |
---|
Arabic | 1025 | 0001 | Malay | 1086 | 003e |
Bengali | 1093 | 0045 | Malayalam | 1100 | 004c |
Bulgarian | 1026 | 0002 | Marathi | 1102 | 004e |
Catalan | 1027 | 0003 | Norwegian_Bokmaal | 1044 | 0414 |
Croatian | 1050 | 001a | Portuguese | 2070 | 0816 |
Danish | 1030 | 0006 | Portuguese_Braz | 1046 | 0416 |
Dutch | 1043 | 0013 | Punjabi | 1094 | 0046 |
English | 1033 | 0009 | Romanian | 1048 | 0018 |
French | 1036 | 000c | Russian | 1049 | 0019 |
German | 1031 | 0007 | Serbian_Cyrillic | 3098 | 0c1a |
Gujarati | 1095 | 0047 | Serbian_Latin | 2074 | 081a |
Hebrew | 1037 | 000d | Slovak | 1051 | 001b |
Hindi | 1081 | 0039 | Slovenian | 1060 | 0024 |
Icelandic | 1039 | 000f | Spanish | 3082 | 000a |
Indonesian | 1057 | 0021 | Swedish | 1053 | 001d |
Italian | 1040 | 0010 | Tamil | 1097 | 0049 |
Japanese | 1041 | 0011 | Telugu | 1098 | 004a |
Kannada | 1099 | 004b | Ukrainian | 1058 | 0022 |
Latvian | 1062 | 0026 | Urdu | 1056 | 0020 |
Lithuanian | 1063 | 0027 | Vietnamese | 1066 | 002a |
|