Language Detection

This is a refined and re-implemented version of the archived plugin for ElasticSearch elasticsearch-langdetect, which itself builds upon the original work by Nakatani Shuyo, found at https://github.com/shuyo/language-detection. The aforementioned implementation by Nakatani Shuyo serves as the default language detection component within Apache Solr.

About this library

The library leverages an n-gram probabilistic model, utilizing n-grams of sizes ranging from 1 to 3 (incl.), alongside a Bayesian classifier (Naive Bayes classification algorithm, see LanguageDetector#detectBlock(String)) that incorporates various normalization techniques and feature sampling methods.

The accuracy exceeds 99% across 79 languages, encompassing a diverse range, including five (5) Celtic languages that are still actively spoken and ten (10) languages from the African continent.

See the following PR description to read about the benchmaks done by @yanirs : jprante/elasticsearch-langdetect#69

Enhancements over past implementations

The current version of the library introduces several enhancements compared to previous implementations, which may offer improvements in efficiency and performance under specific conditions.

For clarity, I'm linking these enhancements to the original implementation with examples:

Eliminating unnecessary ArrayList resizing during n-gram extraction from the input string. In the current implementation, the ArrayList is pre-allocated based on the estimated number of n-grams, thereby reducing the overhead caused by element copying during resizing. See the original code here.
Removing per-character normalization at runtime. In the current implementation, instead of normalizing characters during execution, all 65,535 Unicode BMP characters are pre-normalized into a char[] array, making runtime normalization a simple array lookup. See the original code here.
Circular buffer optimization when extracting n-grams. This refined implementation optimizes the original StringBuilder approach by employing a fixed-size circular buffer. This aproach provides a deterministic memory footprint and significantly reduces the frequency of object allocations during processing, making it suitable for performance-sensitive applications or environments where garbage collection pauses are undesirable. As a result, this can lead to more consistent throughput and lower latency, particularly under sustained load. While the buffer management logic introduces greater implementation complexity, this represents a standard trade-off for achieving improved performance characteristics in resource-sensitive applications. See the original code here
Using a float-level precision. Since Java's double-level precision is not neccessary for the current library, a switch to float type has been made when storing and computing probabilities. This will improve memory efficiency, and may also potentially provide a slight performance boost. Modern CPUs are very efficient at floating point calculations, so the performance increase may be small, but it will be there.

For more information how this library compares against other open source language detectors, please see Language detection benchmarks against other libraries

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
.github		.github
conf/gradle		conf/gradle
gradle/wrapper		gradle/wrapper
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
latest-version.txt		latest-version.txt
pre-commit		pre-commit
settings.gradle		settings.gradle

Language	ISO 639-1	Language family	Country	Flag
Afrikaans	af	Indo-European / Germanic	South Africa	🇿🇦
Albanian	sq	Indo-European / Albanoid	Albania	🇦🇱
Amharic	am	Afro-Asiatic / Semitic	Ethiopia	🇪🇹
Arabic	ar	Afro-Asiatic / Semitic	UAE	🇦🇪
Armenian	hy	Indo-European / Armenian	Armenia	🇦🇲
Azerbaijani	az	Turkic / Western Oghuz	Azerbaijan	🇦🇿
Bangla	bn	Indo-European / Indo-Iranian	Bangladesh	🇧🇩
Basque	eu	Isolate	Spain	🇪🇸
Breton	br	Indo-European / Celtic	France	🇫🇷
Bulgarian	bg	Indo-European / Balto-Slavic	Bulgaria	🇧🇬
Catalan	ca	Indo-European / Italic	Spain	🇪🇸
Chinese (China)	zh-cn	Sino-Tibetan / Sinitic	China	🇨🇳
Chinese (Taiwan)	zh-tw	Sino-Tibetan / Sinitic	Taiwan	🇹🇼
Cornish (Kernewek)	kw	Indo-European / Celtic	United Kingdom	🇬🇧
Croatian	hr	Indo-European / Balto-Slavic	Croatia	🇭🇷
Czech	cs	Indo-European / Balto-Slavic	Czech Republic	🇨🇿
Danish	da	Indo-European / Germanic	Denmark	🇩🇰
Dutch	nl	Indo-European / Germanic	Netherlands	🇳🇱
English	en	Indo-European / Germanic	United States	🇺🇸
Estonian	et	Uralic / Finnic	Estonia	🇪🇪
Filipino	tl	Austronesian / Malayo-Polynesian	Philippines	🇵🇭
Finnish	fi	Uralic / Finnic	Finland	🇫🇮
French	fr	Indo-European / Italic	France	🇫🇷
Georgian	ka	Kartvelian / Karto-Zan	Georgia	🇬🇪
German	de	Indo-European / Germanic	Germany	🇩🇪
Greek	el	Indo-European / Hellenic	Greece	🇬🇷
Gujarati	gu	Indo-European / Indo-Iranian	India	🇮🇳
Hausa	ha	Afro-Asiatic / Chadic	Nigeria	🇳🇬
Hebrew	he	Afro-Asiatic / Semitic	Israel	🇮🇱
Hindi	hi	Indo-European / Indo-Iranian	India	🇮🇳
Hungarian	hu	Uralic / Ugric	Hungary	🇭🇺
Indonesian	id	Austronesian / Malayo-Polynesian	Indonesia	🇮🇩
Irish	ga	Indo-European / Celtic	Ireland	🇮🇪
Italian	it	Indo-European / Italic	Italy	🇮🇹
Japanese	ja	Japonic	Japan	🇯🇵
Kannada	kn	Dravidian / Southern Dravidian	India	🇮🇳
Kazakh	kk	Turkic / Common Turkic	Kazakhstan	🇰🇿
Korean	ko	Koreanic	South Korea	🇰🇷
Kyrgyz	ky	Turkic / Common Turkic	Kyrgyzstan	🇰🇬
Latvian	lv	Indo-European / Balto-Slavic	Latvia	🇱🇻
Lithuanian	lt	Indo-European / Balto-Slavic	Lithuania	🇱🇹
Luxembourgish	lb	Indo-European / Germanic	Luxembourg	🇱🇺
Macedonian	mk	Indo-European / Balto-Slavic	North Macedonia	🇲🇰
Malayalam	ml	Dravidian / Southern Dravidian	India	🇮🇳
Manx	gv	Indo-European / Celtic	Isle of Man	🇮🇲
Marathi	mr	Indo-European / Indo-Iranian	India	🇮🇳
Mongolian	mn	Mongolic / Central Mongolic	Mongolia	🇲🇳
Nepali	ne	Indo-European / Indo-Iranian	Nepal	🇳🇵
Norwegian	no	Indo-European / Germanic	Norway	🇳🇴
Oromo	om	Afro-Asiatic / Cushitic	Kenya	🇰🇪
Persian	fa	Indo-European / Indo-Iranian	Iran	🇮🇷
Polish	pl	Indo-European / Balto-Slavic	Poland	🇵🇱
Portuguese	pt	Indo-European / Italic	Portugal	🇵🇹
Punjabi	pa	Indo-European / Indo-Iranian	India	🇮🇳
Romanian	ro	Indo-European / Italic	Romania	🇷🇴
Russian	ru	Indo-European / Balto-Slavic	Russia	🇷🇺
Serbian	sr	Indo-European / Balto-Slavic	Serbia	🇷🇸
Shona	sn	Niger–Congo / Atlantic–Congo	Zimbabwe	🇿🇼
Sinhala	si	Indo-European / Indo-Iranian	Sri Lanka	🇱🇰
Slovak	sk	Indo-European / Balto-Slavic	Slovakia	🇸🇰
Slovenian	sl	Indo-European / Balto-Slavic	Slovenia	🇸🇮
Somali	so	Afro-Asiatic / Cushitic	Somalia	🇸🇴
Spanish	es	Indo-European / Italic	Spain	🇪🇸
Swahili	sw	Niger-Congo / Atlantic-Congo	Tanzania	🇹🇿
Swedish	sv	Indo-European / Germanic	Sweden	🇸🇪
Tajik	tg	Indo-European / Indo-Iranian	Tajikistan	🇹🇯
Tamil	ta	Dravidian / Southern Dravidian	India	🇮🇳
Telugu	te	Dravidian / South-Central Dravidian	India	🇮🇳
Thai	th	Kra-Dai / Tai	Thailand	🇹🇭
Tibetan	bo	Sino-Tibetan / Tibeto-Burman	China	🇨🇳
Tigrinya	ti	Afro-Asiatic / Semitic	Eritrea	🇪🇷
Turkish	tr	Turkic / Common Turkic	Turkey	🇹🇷
Ukrainian	uk	Indo-European / Balto-Slavic	Ukraine	🇺🇦
Urdu	ur	Indo-European / Indo-Iranian	Pakistan	🇵🇰
Vietnamese	vi	Austroasiatic / Vietic	Vietnam	🇻🇳
Welsh	cy	Indo-European / Celtic	United Kingdom	🇬🇧
Yiddish	yi	Indo-European / Germanic	Israel	🇮🇱
Yoruba	yo	Niger–Congo / Atlantic–Congo	Nigeria	🇳🇬
Zulu	zu	Niger–Congo / Atlantic–Congo	South Africa	🇿🇦

Name	Configured by the ENV variable	Description
`baseFrequency`	`LANGUAGE_DETECT_BASE_FREQUENCY`	Default: `10000`
`iterationLimit`	`LANGUAGE_DETECT_ITERATION_LIMIT`	Safeguard to break loop. Default: `10000`
`numberOfTrials`	`LANGUAGE_DETECT_NUMBER_OF_TRIALS`	Number of trials (affects CPU usage). Default: `7`
`alpha`	`LANGUAGE_DETECT_ALPHA`	Naive Bayes classifier smoothing parameterto prevent zero probabilities and improve the robustness of the classifier. Default: `0.5`
`alphaWidth`	`LANGUAGE_DETECT_ALPHA_WIDTH`	The width of smoothing. Default: `0.05`
`convergenceThreshold`	`LANGUAGE_DETECT_CONVERGENCE_THRESHOLD`	Detection is terminated when normalized probability exceeds this threshold. Default: `0.99999`

License

azagniotov/language-detection

Folders and files

Latest commit

History

Repository files navigation

Language Detection

Table of Contents

About this library

Enhancements over past implementations

Supported ISO 639-1 codes

Model parameters

Quick detection of CJK languages

How to use?

Basic usage

Methods to build the LanguageDetectionSettings

Configuring ISO 639-1 codes

Maximum text chars

Skipping input sanitization

CJK detection threshold

Classify any Chinese content as Japanese

General minimum detection certainty

Minimum detection certainty for top language with a fallback

Language detection benchmarks against other libraries

Running the benchmarks

Accuracy report

Speed of execution

Key takeaways

Local development

System requirements

Pre-commit Hook

Build system

List of Gradle tasks

Building

Formatting

Testing

Unit tests

Classification accuracy analysis

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 44

Packages 0

Contributors 2

Languages

Packages