Trigrams for 460+ languages.
This package exposes all trigrams for natural languages.
Based on the most translated copyright-free document on this planet: UDHR.
When you are dealing with natural language detection.
This package is ESM only.
In Node.js (version 14.14+, 16.0+), install with npm:
In Deno with esm.sh
:
import {top, min} from 'https://esm.sh/trigrams@5'
In browsers with esm.sh
:
<script type="module">
import {top, min} from 'https://esm.sh/trigrams@5?bundle'
</script>
import {top, min} from 'trigrams'
console.log((await top()).pam)
console.log((await min()).nld)
Yields:
{ // 300 top trigrams.
'isa': 6,
'upa': 6,
'i k': 6,
// …
'ang': 273,
'ing': 282,
'ng ': 572 // Most common trigram with how often it was found.
}
[ // 300 top trigrams.
' ar',
'eer',
'tij',
// …
'de ',
'an ',
'en ' // Most common trigram.
]
This package exports the identifiers top
and min
.
There is no default export.
Get top trigrams to occurrence counts.
Returns a promise resolving to an object mapping UDHR in Unicode
codes to objects mapping the top 300 trigrams to occurrence counts
(Promise<Record<string, Record<string, number>>>
).
Get top trigrams.
Returns a promise resolving to arrays containing the top 300 trigrams sorted
from least occurring to most occurring
(Promise<Record<string, Array<string>>>
).
The trigrams are based on the unicode versions of the universal declaration
of human rights.
The files are created from all paragraphs made available by
wooorm/udhr
and do not include headings and such.
Before creating trigrams,
- the unicode characters from
\u0021
to \u0040
(both including) are
removed
- one or more white space characters (
\s+
) are replaced with a single space
- alphabetic characters are lower cased (
[A-Z]
)
Additionally, the input is padded with two spaces on both sides.
Code |
Name |
007 |
Sãotomense |
008 |
Crioulo, Upper Guinea (008) |
009 |
Mbundu (009) |
010 |
Tetun Dili |
011 |
Umbundu (011) |
013 |
(Mijisa) |
014 |
(Maiunan) |
016 |
(Minjiang, spoken) |
017 |
(Minjiang, written) |
020 |
Drung |
021 |
(Muzzi) |
022 |
(Klau) |
025 |
(Bizisa) |
026 |
(Yeonbyeon) |
027 |
Gumuz |
028 |
Kafa |
029 |
Sidamo |
030 |
Kituba (2) |
032 |
South Azerbaijani |
041 |
Latvian (2) |
042 |
Spanish (resolution) |
043 |
Zarma |
aar |
Afar |
abk |
Abkhaz |
ace |
Aceh |
acu |
Achuar-Shiwiar |
acu_1 |
Achuar-Shiwiar (1) |
ada |
Dangme |
ady |
Adyghe |
afr |
Afrikaans |
agr |
Aguaruna |
aii |
Assyrian Neo-Aramaic |
ajg |
Aja |
aka_akuapem |
Twi (Akuapem) |
aka_asante |
Twi (Asante) |
aka_fante |
Fante |
als |
Albanian, Tosk |
alt |
Altai, Southern |
amc |
Amahuaca |
ame |
Yaneshaʼ |
amh |
Amharic |
ami |
Amis |
amr |
Amarakaeri |
arb |
Arabic, Standard |
arl |
Arabela |
arn |
Mapudungun |
ast |
Asturian |
auc |
Waorani |
auv |
Occitan (Auvergnat) |
ayr |
Aymara, Central |
azj_cyrl |
Azerbaijani, North (Cyrillic) |
azj_latn |
Azerbaijani, North (Latin) |
bam |
Bamanankan |
ban |
Bali |
bax |
Bamun |
bba |
Baatonum |
bci |
Baoulé |
bcl |
Bicolano, Central |
bel |
Belarusan |
bem |
Bemba |
ben |
Bengali |
bfa |
Bari |
bho |
Bhojpuri |
bin |
Edo |
bis |
Bislama |
blt |
Tai Dam |
blu |
Hmong Njua |
boa |
Bora |
bod |
Tibetan, Central |
bos_cyrl |
Bosnian (Cyrillic) |
bos_latn |
Bosnian (Latin) |
bre |
Breton |
btb |
Bulu |
buc |
Bushi |
bug |
Bugis |
bul |
Bulgarian |
cab |
Garifuna |
cak |
Kaqchikel, Central |
cat |
Catalan-Valencian-Balear |
cbi |
Chachi |
cbr |
Cashibo-Cacataibo |
cbs |
Cashinahua |
cbt |
Chayahuita |
cbu |
Candoshi-Shapra |
ccx |
Zhuang, Yongbei |
ceb |
Cebuano |
ces |
Czech |
cha |
Chamorro |
chj |
Chinantec, Ojitlán |
chk |
Chuukese |
chr_cased |
Cherokee (cased) |
chr_uppercase |
Cherokee (uppercase) |
chv |
Chuvash |
cic |
Chickasaw |
cjk |
Chokwe |
cjk_AO |
Chokwe (Angola) |
cjs |
Shor |
ckb |
Kurdish, Central |
cnh |
Chin, Haka |
cni |
Asháninka |
cnr |
Montenegrin |
cof |
Colorado |
cos |
Corsican |
cot |
Caquinte |
cpu |
Ashéninka, Pichis |
crh |
Crimean Tatar |
crs |
Seselwa Creole French |
csa |
Chinantec, Chiltepec |
csw |
Cree, Swampy |
ctd |
Chin, Tedim |
cym |
Welsh |
dag |
Dagbani |
dan |
Danish |
ddn |
Dendi |
deu_1901 |
German, Standard (1901) |
deu_1996 |
German, Standard (1996) |
dga |
Dagaare, Southern |
dip |
Dinka, Northeastern |
div |
Maldivian |
dyo |
Jola-Fonyi |
dyu |
Jula |
dzo |
Dzongkha |
ell_monotonic |
Greek (monotonic) |
ell_polytonic |
Greek (polytonic) |
emk |
Maninkakan, Eastern |
eml |
Romagnolo |
eng |
English |
epo |
Esperanto |
ese |
Ese Ejja |
est |
Estonian |
eus |
Basque |
eve |
Even |
evn |
Evenki |
ewe |
Éwé |
fao |
Faroese |
fij |
Fijian |
fin |
Finnish |
fkv |
Finnish, Kven |
flm |
Chin, Falam |
fon |
Fon |
fra |
French |
fri |
Frisian, Western |
fuf |
Pular |
fur |
Friulian |
fuv |
Fulfulde, Nigerian |
fuv2 |
Fulfulde, Nigerian (2) |
fvr |
Fur |
gaa |
Ga |
gag |
Gagauz |
gax |
Oromo, Borana-Arsi-Guji |
gjn |
Gonja |
gkp |
Kpelle, Guinea |
gla |
Gaelic, Scottish |
gld |
Nanai |
gle |
Gaelic, Irish |
glg |
Galician |
glv |
Manx |
gsw1 |
Alemannisch (Elsassisch) |
guc |
Wayuu |
gug |
Guaraní, Paraguayan |
guj |
Gujarati |
guu |
Yanomamö |
gyr |
Guarayu |
hat_kreyol |
Haitian Creole French (Kreyol) |
hat_popular |
Haitian Creole French (Popular) |
hau_NE |
Hausa (Niger) |
hau_NG |
Hausa (Nigeria) |
hau_3 |
Hausa |
haw |
Hawaiian |
hea |
Hmong, Northern Qiandong |
heb |
Hebrew |
hil |
Hiligaynon |
hin |
Hindi |
hlt |
Chin, Matu |
hms |
Hmong, Southern Qiandong |
hna |
Gen |
hni |
Hani |
hns |
Hindustani, Sarnami |
hrv |
Croatian |
hsb |
Sorbian, Upper |
hsf |
Huastec (Sierra de Otontepec) |
hun |
Hungarian |
hus |
Huastec (Veracruz) |
huu |
Huitoto, Murui |
hva |
Huastec (San Luís Potosí) |
hye |
Armenian |
ibb |
Ibibio |
ibo |
Igbo |
ido |
Ido |
idu |
Idoma |
ijs |
Ijo, Southeast |
ike |
Inuktitut, Eastern Canadian |
ilo |
Ilocano |
ina |
Interlingua |
ind |
Indonesian |
isl |
Icelandic |
ita |
Italian |
jav |
Javanese (Latin) |
jav_java |
Javanese (Javanese) |
jiv |
Shuar |
jpn |
Japanese |
jpn_osaka |
Japanese (Osaka) |
jpn_tokyo |
Japanese (Tokyo) |
kaa |
Karakalpak |
kal |
Inuktitut, Greenlandic |
kan |
Kannada |
kat |
Georgian |
kaz |
Kazakh |
kbd |
Kabardian |
kbp |
Kabiyé |
kde |
Makonde |
kdh |
Tem |
kea |
Kabuverdianu |
kek |
Q'eqchi' |
kha |
Khasi |
khk |
Mongolian, Halh (Cyrillic) |
khm |
Khmer, Central |
kin |
Rwanda |
kir |
Kirghiz |
kjh |
Khakas |
kkh_lana |
Khün |
kmb |
Mbundu |
kmr |
Kurdish, Northern |
knc |
Kanuri, Central |
kng |
Koongo |
kng_AO |
Koongo (Angola) |
koi |
Komi-Permyak |
koo |
Konjo |
kor |
Korean |
kqn |
Kaonde |
kqs |
Kissi, Northern |
kri |
Krio |
krl |
Karelian |
ktu |
Kituba |
kwi |
Awa-Cuaiquer |
lad |
Ladino |
lao |
Lao |
lat |
Latin |
lat_1 |
Latin (1) |
lav |
Latvian |
lia |
Limba, West-Central |
lij |
Ligurian |
lin |
Lingala |
lin_tones |
Lingala (tones) |
lit |
Lithuanian |
lld |
Ladin |
lnc |
Occitan (Languedocien) |
lns |
Lamnso' |
lob |
Lobi |
lot |
Otuho |
loz |
Lozi |
ltz |
Luxembourgeois |
lua |
Luba-Kasai |
lue |
Luvale |
lug |
Ganda |
lun |
Lunda |
lus |
Mizo |
mad |
Madura |
mag |
Magahi |
mah |
Marshallese |
mai |
Maithili |
mal |
Malayalam |
mal_chillus |
Malayalam |
mam |
Mam, Northern |
mar |
Marathi |
maz |
Mazahua Central |
mcd |
Sharanahua |
mcf |
Matsés |
men |
Mende |
mfq |
Moba |
mic |
Micmac |
min |
Minangkabau |
miq |
Mískito |
mkd |
Macedonian |
mlt |
Maltese |
mly_arab |
Malay (Arabic) |
mly_latn |
Malay (Latin) |
mnw |
Mon |
mor |
Moro |
mos |
Mòoré |
mri |
Maori |
mto |
Mixe, Totontepec |
mxi |
Mozarabic |
mxv |
Mixtec, Metlatónoc |
mya |
Burmese |
mzi |
Mazatec, Ixcatlán |
nav |
Navajo |
nba |
Nyemba |
nbl |
Ndebele |
ndo |
Ndonga |
nds |
Saxon, Low |
nep |
Nepali |
nhn |
Nahuatl, Central |
nio |
Nganasan |
niu |
Niue |
niv |
Gilyak |
njo |
Naga, Ao |
nku |
Kulango, Bouna |
nld |
Dutch |
nno |
Norwegian, Nynorsk |
nob |
Norwegian, Bokmål |
not |
Nomatsiguenga |
nso |
Sotho, Northern |
nya_chechewa |
Nyanja (Chechewa) |
nya_chinyanja |
Nyanja (Chinyanja) |
nym |
Nyamwezi |
nyn |
Nyankore |
nzi |
Nzema |
oaa |
Orok |
oci_1 |
Occitan (Francoprovençal, Fribourg) |
oci_2 |
Occitan (Francoprovençal, Savoie) |
oci_3 |
Occitan (Francoprovençal, Vaud) |
oci_4 |
Occitan (Francoprovençal, Valais) |
ojb |
Ojibwa, Northwestern |
oki |
Okiek |
orh |
Oroqen |
oss |
Osetin |
ote |
Otomi, Mezquital |
pam |
Pampangan |
pan |
Panjabi, Eastern |
pap |
Papiamentu |
pau |
Palauan |
pbb |
Páez |
pbu |
Pashto, Northern |
pcd |
Picard |
pcm |
Pidgin, Nigerian |
pes_1 |
Farsi, Western |
pes_2 |
Dari |
pis |
Pijin |
piu |
Pintupi-Luritja |
plt |
Malagasy, Plateau |
pnb |
Panjabi, Western |
pol |
Polish |
pon |
Pohnpeian |
por_BR |
Portuguese (Brazil) |
por_PT |
Portuguese (Portugal) |
pov |
Crioulo, Upper Guinea |
ppl |
Pipil |
prv |
Occitan |
quc |
K'iche', Central |
qud |
Quechua (Unified Quichua, old Hispanic orthography) |
qug |
Quichua, Chimborazo Highland |
quy |
Quechua, Ayacucho |
quz |
Quechua, Cusco |
qva |
Quechua, Ambo-Pasco |
qvc |
Quechua, Cajamarca |
qvh |
Quechua, Huamalíes-Dos de Mayo Huánuco |
qvm |
Quechua, Margos-Yarowilca-Lauricocha |
qvn |
Quechua, North Junín |
qwh |
Quechua, Huaylas Ancash |
qxa |
Quechua, South Bolivian |
qxn |
Quechua, Northern Conchucos Ancash |
qxu |
Quechua, Arequipa-La Unión |
rar |
Rarotongan |
rmn |
Romani, Balkan |
rmn_1 |
Romani, Balkan (1) |
rmy |
Aromanian |
roh |
Romansch |
roh_puter |
Romansch (Puter) |
roh_rumgr |
Romansch (Grischun) |
roh_surmiran |
Romansch (Surmiran) |
roh_sursilv |
Romansch (Sursilvan) |
roh_sutsilv |
Romansch (Sutsilvan) |
roh_vallader |
Romansch (Vallader) |
ron_1953 |
Romanian (1953) |
ron_1993 |
Romanian (1993) |
ron_2006 |
Romanian (2006) |
run |
Rundi |
rus |
Russian |
sag |
Sango |
sah |
Yakut |
san |
Sanskrit |
sco |
Scots |
sey |
Secoya |
shk |
Shilluk |
shn |
Shan |
shp |
Shipibo-Conibo |
sin |
Sinhala |
skr |
Seraiki |
slk |
Slovak |
slr |
Salar |
slv |
Slovenian |
sme |
Saami, North |
smo |
Samoan |
sna |
Shona |
snk |
Soninke |
snn |
Siona |
som |
Somali |
sot |
Sotho, Southern |
spa |
Spanish |
src |
Sardinian, Logudorese |
srp_cyrl |
Serbian (Cyrillic) |
srp_latn |
Serbian (Latin) |
srr |
Serer-Sine |
ssw |
Swati |
suk |
Sukuma |
sun |
Sunda |
sus |
Susu |
swb |
Comorian, Maore |
swe |
Swedish |
swh |
Swahili |
tah |
Tahitian |
tam |
Tamil |
tam_LK |
Tamil (Sri Lanka) |
tat |
Tatar |
tbz |
Ditammari |
tca |
Ticuna |
tel |
Telugu |
tem |
Themne |
tet |
Tetun |
tgk |
Tajiki |
tgl |
Tagalog |
tha |
Thai |
tha2 |
Thai (2) |
tir |
Tigrigna |
tiv |
Tiv |
tly |
Talysh |
tob |
Toba |
toi |
Tonga |
toj |
Tojolabal |
ton |
Tongan |
top |
Totonac, Papantla |
tpi |
Tok Pisin |
tsn |
Tswana |
tso_MZ |
Tsonga (Mozambique) |
tso_ZW |
Tsonga (Zimbabwe) |
tsz |
Purepecha |
tuk_cyrl |
Turkmen (Cyrillic) |
tuk_latn |
Turkmen (Latin) |
tur |
Turkish |
tyv |
Tuva |
tzc |
Tzotzil (Chamula) |
tzh |
Tzeltal, Oxchuc |
tzm |
Tamazight, Central Atlas |
udu |
Uduk |
uig_arab |
Uyghur (Arabic) |
uig_latn |
Uyghur (Latin) |
ukr |
Ukrainian |
umb |
Umbundu |
ura |
Urarina |
urd |
Urdu |
urd_2 |
Urdu (2) |
uzn_cyrl |
Uzbek, Northern (Cyrillic) |
uzn_latn |
Uzbek, Northern (Latin) |
vai |
Vai |
vec |
Venetian |
ven |
Venda |
ven2 |
Venda |
vep |
Veps |
vie |
Vietnamese |
vmw |
Makhuwa |
war |
Waray-Waray |
wln |
Walloon |
wol |
Wolof |
wwa |
Waama |
xho |
Xhosa |
xsm |
Kasem |
yad |
Yagua |
yao |
Yao |
yap |
Yapese |
ydd |
Yiddish, Eastern |
ykg |
Yukaghir, Northern |
yor |
Yoruba |
yrk |
Nenets |
yua |
Maya, Yucatán |
zam |
Zapotec, Miahuatlán |
zdj |
Comorian, Ngazidja |
zgh |
Tamazight, Standard Morocan |
zro |
Záparo |
ztu |
Zapotec, Güilá |
zul |
Zulu |
This package is fully typed with TypeScript.
It exports no additional types.
This package is at least compatible with all maintained versions of Node.js.
As of now, that is Node.js 14.14+ and 16.0+.
It also works in Deno and modern browsers.
Yes please!
See How to Contribute to Open Source.
This package is safe.
MIT © Titus Wormer