Language support

Lunr includes optional and experimental support for languages other than English via the Natural Language Toolkit. To install Lunr with this feature use pip install lunr[languages].

The currently supported languages are:

  • Arabic
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Italian
  • Norwegian
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
>>> documents = [
...   {
...     "id": "a",
...     "text": (
...         "Este es un ejemplo inventado de lo que sería un documento en el "
...         "idioma que se más se habla en España."),
...     "title": "Ejemplo de documento en español"
...   },
...   {
...     "id": "b",
...     "text": (
...         "Según un estudio que me acabo de inventar porque soy un experto en"
...         "idiomas que se hablan en España."),
...     "title": "Español es el tercer idioma más hablado del mundo"
...   },
... ]

New in 0.5.1: the lunr function now accepts more than one language

Simply define specify one or more ISO-639-1 codes for the language(s) of your documents in the languages parameter to the lunr function.


In versions of Lunr prior to 0.5.0 the parameter's name is language and accepted a single string.

If you have a single language you can pass the language code in languages:

>>> from lunr import lunr
>>> idx = lunr('id', ['title', 'text'], documents, languages='es')
[{'ref': 'a', 'score': 0.130, 'match_data': <MatchData "invent">},
{'ref': 'b', 'score': 0.089, 'match_data': <MatchData "invent">}]


In order to construct stemmers, trimmers and stop word filters Lunr imports corpus data from NLTK which fetches data from Github and caches it in your home directory under nltk_data by default. You may see some logging indicating such activity during the creation of the index.

If you have documents in multiple language pass a list of language codes:

>>> documents.append({
     "id": "c",
     "text": "Let's say you also have documents written in English",
     "title": "A document in English"
>>> idx = lunr('id', ['title', 'text'], documents, languages=['es', 'en'])
[{'ref': 'c', 'score': 1.106, 'match_data': <MatchData "english">}]

Notes on language support

  • Using multiple languages means the terms will be stemmed once per language. This can yield unexpected results.
  • Compatibility with Lunr.js is ensured for languages that supported by both platforms, however results might differ slightly.
    • Languages supported by Lunr.js but not by
      • Thai
      • Japanese
      • Turkish
    • Languages supported by but not Lunr.js:
      • Arabic
  • The usage of the language feature is subject to NTLK corpus licensing clauses