Logo
programming4us
programming4us
programming4us
programming4us
Home
programming4us
XP
programming4us
Windows Vista
programming4us
Windows 7
programming4us
Windows Azure
programming4us
Windows Server
programming4us
Windows Phone
 
Windows Server

SharePoint 2010 Search : Tuning Search (part 2) - The Thesaurus & Custom Dictionaries

- Free product key for windows 10
- Free Product Key for Microsoft office 365
- Malwarebytes Premium 3.7.1 Serial Keys (LifeTime) 2019
8/2/2011 3:08:11 PM

4. The Thesaurus

SharePoint has the capability to match terms with potential synonyms and return documents based on those synonyms. For example, a user may be looking for the project plan for the windmill his or her company is consulting on without realizing that it is actually a wind turbine he or she is searching for (windmills mill grain, wind turbines produce electricity). Searching for "windmill" will not return any hits for his or her query and ultimately cause frustration, and probably a call to the engineers and some laughter and deriding (engineers can be callous). To avoid this, a potential synonym match between windmill and wind turbine could be entered in the thesaurus if it proves to be a common enough mistake to warrant it.

The thesaurus in SharePoint is installed into the same folder as the stop word files, and similarly there is one file for each supported language as well as a language-neutral file, C:\Program Files\Microsoft Office Servers\14.0\Data\Office Server\Config. These are a virgin set of the thesaurus and stop word files. When the Search service application is created in the SharePoint farm, SharePoint copies out a set to all query servers in the location C:\Program Files\Microsoft Office Servers\14.0\Data\Applications\GUID\Config. If the virgin files are edited, when a new Search service application is created, the edited files will be staged out. But any existing Search service applications and thesaurus files will not be copied over and must be individually edited.

By default, the thesaurus files are lacking any active entries, and the examples within are in comments. So the administrator must edit the files for them to function and copy them out to each query server. The files will be called only if the search is initiated in the specific language context for that file. The different language files are shown in Table 1.

Table 1. The Thesaurus Files in SharePoint*
LanguageFile nameLanguageFile nameLanguageFile nameLanguageFile name
Language-neutraltsneu.xmlEnglish (United States)tsenu.xmlLithuaniantslit.xmlSerbian (Latin)tssbl.xml
Arabictsara.xmlFinnishtsfin.xmlMalay (Malaysian)tsmal.xmlSlovaktssvk.xml
Bengalitsben.xmlFrench (Standard)tsfra.xmlMalayalamtsmly.xmlSloveniantsslo.xml
Bulgariantsbul.xmlGerman (Standard)tsdeu.xmlMarathitsmar.xmlSpanishtsesn.xml
Catalantscat.xmlGujaratitsguj.xmlNorwegian (Bokmal)tsnor.xmlSwedishtssve.xml
Chinese (Simplified)tschs.xmlHungariantshun.xmlPolishtsplk.xmlTamiltstam.xml
Chinese (Traditional)tscht.xmlIcelandictsice.xmlPortuguese (Brazil)tsptb.xmlTelugutstel.xml
Croatiantscro.xmlIndonesiantsind.xmlPortuguese (Portugal)tspor.xmlThaitstha.xml
Czechtsces.xmlItaliantsita.xmlPunjabitspun.xmlTurkishtstur.xml
DanishtsdanJapanesetsjpn.xmlRomaniantsrom.xmlUkrainiantsukr.xml
Dutch (Netherlands)tsnld.xmlKannadatskan.xmlRussiantsrus.xmlUrdu (Pakistan)tsurd.xml
English (United Kingdom)tseng.xmlKoreantskor.xmlSerbian (Cyrillic)tssbc.xmlVietnamesetsvie.xml
*Source: http://technet.microsoft.com/en-us/library/dd361734.aspx

The file tsneu.xml is the language-neutral file that is the default file utilized if no specific language value is passed. On installation it has the following structure and values:

<XML ID="Microsoft Search Thesaurus">
<!--
Commented out

<thesaurus xmlns="x-schema:tsSchema.xml">
<diacritics_sensitive>0</diacritics_sensitive>
<expansion>
<sub>Internet Explorer</sub>
<sub>IE</sub>
<sub>IE5</sub>
</expansion>
<replacement>
<pat>NT5</pat>
<pat>W2K</pat>
<sub>Windows 2000</sub>
</replacement>
<expansion>
<sub>run</sub>
<sub>jog</sub>
</expansion>
</thesaurus>
-->
</XML>


The first thing to do when utilizing it is to remove the comment tags <!-- and -->. This will make the entries active. The first tag to consider is the diacritics sensitivity tag. By default it has a value of zero, which is the setting for off. To enable it, set it to 1. Diacritical marks are marks that some languages utilize to modify the sounds of certain letters, such as accents or umlauts. Many European languages will benefit from diacritical sensitivity. English is not one of them.

The remaining sections are either expansion or replacement. Expansion is used to add synonyms to a given term. If the term in a <sub></sub> tag pair is searched for, the remaining terms in sub tag pairs will also be searched for. Any term in an expansion set will trigger the other terms in the set. Each set of synonyms must be encapsulated in their own expansion tag pair and sub tags.

<expansion>
<sub>windmill</sub>
<sub>wind turbine</sub>
<sub>generator</sub>
</expansion>

The replacement section is used to substitute a term for another term. This is useful when dealing with spelling mistakes or unused synonyms terms. The query term is not actually searched for, but the replacement terms are.

<replacement>
<pat>scarepoint</pat>
<sub>sharepoint</sub>
<sub>SP2010</sub>
</replacement>

As is shown in these examples, there can be a one-to-many relationship with both expansion and replacement sections. Microsoft does not recommend more than 10,000 entries in a single thesaurus file. Each entry where a term is defined (<sub> or <pat>) is considered one entry.

Save the thesaurus files as Unicode. If you are editing them in Notepad, this is the default encoding. Other text editors may require special care. After updating a thesaurus file, the Search service application needs to be restarted before changes will take effect. This can be accomplished by opening the services snap-in and restarting SharePoint Server Search 14.

5. Custom Dictionaries

Custom dictionaries are lists of words that the search engine may match exactly and pass as a query. These dictionaries supersede the built-in word breakers in SharePoint.

Word breakers are a hidden part of the index and query processes of SharePoint search that manage how terms are handled by the query process. They are small programs or routines that break complex terms into shorter, more understandable terms. As we have seen in the "Stop Words" section, not all words are interesting to search, and the most common ones can be disregarded safely. Similarly, there are many characters that do not conform to the standard conception of what makes up a word. Special characters, such as ampersands (&), dollar signs ($), stars (*), the "at" character (@), and hyphens (−), among many others, are very common in digital information. Many organizations rely on combinations of these characters with letters and numbers to identify documents or products. The "at" character is seen in every e-mail address.

Usually, when put in context, many of these characters can be seen as word separators and hold little contextual value. For this reason, word breakers are employed to break these terms into smaller terms that are more likely to be searched for and make sense. For example, the phrase search-driven application contains a hyphen, linking "search" and "driven". It's common to combine words like this, but I might search for "search driven application" and expect to get results. If the search engine keeps "search-driven" as a single term, I won't find the document with the hyphenated version. Therefore, a word breaker is employed to break apart the term and allow for both variations to be searched.

This doesn't always make sense. Say, for example, an oil drilling company has a pipeline with many valves, and each valve has a unique ID with letters, numbers, and hyphens (e.g., VLV-123-456). If the valve is turned off without checking a document to see what the consequences will be, the whole pipeline could be shut down, or worse, a catastrophic failure could be caused. So, if the word breaker is allowed to break apart the term, all documents with vlv, 123, and 456 on them would be returned. This may be many possible documents and cause a lot of searching (mind you, "vlv 123 456" as a phrase should be returned first—this becomes more problematic when partial terms are searched and wildcards are used). So having the search terms seen as a single term and not broken into its parts can be valuable. This is where custom dictionaries come into play.

Here are the rules that must be observed when creating custom dictionaries:

  • Each supported language has its own custom dictionary.

  • Custom dictionaries (like stop word files and thesaurus files) should be saved in Unicode.

  • Custom dictionaries have the file type .lex and are named CustomXXXX, where XXXX is the four-digit hexadecimal language code.

  • Entries in the custom dictionaries are not case-sensitive.

  • The pipe character (|) is not accepted.

  • No blank spaces (white space)

  • The pound character or number sign (#) cannot be used at the beginning of an entry, but it can be used within it or at the end, e.g., #Test is not acceptable but T#st and Test# are OK.

  • Aside from the foregoing exceptions, any other character is acceptable.

  • The maximum length of a single entry is 128 (Unicode) characters.

  • There must be a copy of the custom dictionary files on each query server.

Here are the steps for creating a custom dictionary:

  1. Create a new text file in a text editor (like Notepad).

  2. Add your terms, taking into consideration the foregoing limitations and rules.

  3. Save the file with the appropriate file name (e.g., Custom0009.lex) in the %ProgramFiles%\Microsoft Office Servers\14.0\Bin folder.

  4. Restart the Search service application by running services.msc from the start menu and restarting the SharePoint Server Search 14 service.

Table 2. Support Languages for Custom Dictionaries and Their Language Codes*
Language / dialectLCIDLanguage hexadecimal codeLanguage / dialectLCIDLanguage hexadecimal code
Arabic10250001Malay1086003e
Bengali10930045Malayalam1100004c
Bulgarian10260002Marathi1102004e
Catalan10270003Norwegian_Bokmaal10440414
Croatian1050001aPortuguese20700816
Danish10300006Portuguese_Braz10460416
Dutch10430013Punjabi10940046
English10330009Romanian10480018
French1036000cRussian10490019
German10310007Serbian_Cyrillic30980c1a
Gujarati10950047Serbian_Latin2074081a
Hebrew1037000dSlovak1051001b
Hindi10810039Slovenian10600024
Icelandic1039000fSpanish3082000a
Indonesian10570021Swedish1053001d
Italian10400010Tamil10970049
Japanese10410011Telugu1098004a
Kannada1099004bUkrainian10580022
Latvian10620026Urdu10560020
Lithuanian10630027Vietnamese1066002a
*Source: http://technet.microsoft.com/en-us/library/cc263242.aspx

Other -----------------
- SharePoint 2010 Search : Tuning Search (part 1) - Authoritative Pages & Result Removal
- Automating Dynamics GP 2010 : Using Reminders to remember important events
- Organizing Dynamics GP 2010 : Going straight to the site with Web Links
- Microsoft Lync Server 2010 : Collaboration Benefits & Management and Administration Benefits
- Microsoft Lync Server 2010 : Benefits for Lync Server Users & Enterprise Voice Benefits
- Configuring Role-Based Permissions for Exchange Server 2010 (part 3) - Performing Advanced Permissions Management
- Configuring Role-Based Permissions for Exchange Server 2010 (part 2) - Viewing, Adding or Removing Role Group Members & Assigning Roles Directly or via Policy
- Configuring Role-Based Permissions for Exchange Server 2010 (part 1) - Creating and Managing Role Groups
- Configuring Small Business Server 2011 in Hyper-V : Installation
- Configuring Small Business Server 2011 in Hyper-V : Hyper-V Overview
 
 
Top 10
- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Finding containers and lists in Visio (part 2) - Wireframes,Legends
- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Finding containers and lists in Visio (part 1) - Swimlanes
- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Formatting and sizing lists
- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Adding shapes to lists
- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Sizing containers
- Microsoft Access 2010 : Control Properties and Why to Use Them (part 3) - The Other Properties of a Control
- Microsoft Access 2010 : Control Properties and Why to Use Them (part 2) - The Data Properties of a Control
- Microsoft Access 2010 : Control Properties and Why to Use Them (part 1) - The Format Properties of a Control
- Microsoft Access 2010 : Form Properties and Why Should You Use Them - Working with the Properties Window
- Microsoft Visio 2013 : Using the Organization Chart Wizard with new data
 
programming4us
Windows Vista
programming4us
Windows 7
programming4us
Windows Azure
programming4us
Windows Server