Optimizing Apachesolr for non-english languages

Apachesolr is a awesome search machine. I like a lot the facetted search that comes with Apachesolr on Drupal.

But there is one drawback in the current module: it is optimized for the English language. That means that all the text analysis that is done when building up the search index or when performing a search query is based on English.

Stemming and text analysis

One important step in text analysis is stemming. Stemming reduces word variants into one basic form. As an example a stemmer would reduce the words "fishing", "fished", "fish", and "fisher" to the root word "fish". Another example for the stemming is the reductions from plural to singular like "houses" becoming "house".

When the search index is build, all documents that should be searchable are sent to the text analyzer. The text analyzer is breaking up the documents into a stream of word tokens. In the stemming step, these word tokens are reduced to their root forms. If there were for example the words "fishing", "fisher" and "fish", there is only the word "fish" left, but note, that the word occurs three times! Finally, the resulting list of tokens together with the number of occurences is stored to the index. In the former example we would store the word "fish" together with the number of occurences (3) and the relation to the source document X.

The are many algorithms for stemming (Refer to http://en.wikipedia.org/wiki/Stemming ). But it is obvious that the stemming rules depend highly on the language. A German stemmer would reduce

Häuser -> Haus

An English stemmer reduces the plural forms by removing the final "s" like

houses -> house

Stemming in Apachesolr

For Apachesolr the text analysis pipeline is defined in the config file schema.xml. This section is responsible for the index and query text analysis

<analyzer type="index">
    <charFilter class="solr.MappingCharFilterFactory"
        mapping="mapping-ISOLatin1Accent.txt" />
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.StopFilterFactory" ignoreCase="true"
        words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory"
        generateWordParts="1" generateNumberParts="1" catenateWords="1"
        catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
        dictionary="my_dictionary.txt" />
    <filter class="solr.SnowballPorterFilterFactory" 
        language="English"
        protected="protwords.txt" />
    <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
<analyzer type="query">
    <charFilter class="solr.MappingCharFilterFactory"
        mapping="mapping-ISOLatin1Accent.txt" />
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
        ignoreCase="true" expand="true" />
    <filter class="solr.StopFilterFactory" ignoreCase="true"
        words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory"
        generateWordParts="1" generateNumberParts="1" catenateWords="0"
        catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.SnowballPorterFilterFactory" 
        language="English"
        protected="protwords.txt" />
    <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>

 

You can see: the violett marked section is dedicated to the index analyzer text filter chain, whereas the orange marked sections is dedicated to the query analyzer text filter chain. Please note: for every analyzer chain, first, there is a tokenizer, then follows filter 1, filter 2 etc. The stream of tokens is generated from the tokenizer and then passed from filter to filter and thereby transformed in every step.

The stemming language is defined in the lightblue marked filter. It tells Apachesolr to use the English stemmer filter. According to the excellent Solr Wiki, you can easily switch to other languages like German, Dutch, Danish, French, Russian ... etc. The complete list of available languages is listed in the SnowballPorterFilter documentation.

If you are running Apachesolr for another language then English, this step will improve the quality of search results dramatically.

Comments

Harshal Choksi's picture

Hi we have apache Solr Search Engine for our Sitecore product, our Solr search engine is not supporting for multilingual language. How and what to do to enable the search engine for multilingual? we can see that Index has been built successfully for other non-english language.