Created: 2011-07-13 04:29
Updated: 2015-10-22 00:02
License: mit

This Python module takes text input and guesses what natural language it's in:

>>> from greek_to_me import make_pundit
>>> p = make_pundit('models')   # The models/ dir in this distro
>>> p.best_guess('hello world')
'en'   # English
>>> p.best_guess('hola mundo')
'es'   # Espanol

You can also build new models and ask the pundit for more info if you want a measure of confidence or want to make more subtle discriminations, e.g. to combine this textual evidence with an Accept-Language header.

See the code for docs. shows some sample usage.

The judgments use a character n-gram model of each language. Supplied with this module in models/ are some bigram models built from the Europarl and Leipzig parallel corpora, mostly for European languages. (In code not supplied here, I first used to screen out text in other languages like Mandarin. So why not use guess-language for the whole job? Because it works poorly on very short inputs like search queries; our approach needs less evidence to reach a reasonable judgement.)

IIRC trigram models do noticeably better but take an order of magnitude more space; I didn't feel like checking 4MB into this repo.

See for a similar but more sophisticated package in Java.

