Distant origins | Uralic languages
- galois.ai
- Dec 20, 2023
- 19 min read
Updated: Jan 20
After a trip back to the time of the Cold War, let's look a little more closely at Uralic languages, their history and some of their characteristics, particularly phonological - vowel harmony. Some might object: why dwell on these historical and typological considerations, when at this very moment, billions of billions of artificial neurons are lighting up in the planet's thousands of data centers, learning language way faster than archaic human beings do? We authors, technologists and linguists have the right, and even the duty, to entertain ourselves by stretching the limits of knowledge and promoting its dissemination.
But there is more to say on this subject. (Even if, in the vast theater that is the Universe, the entertainment of oneself and others according to the arbitrariness of good pleasure seems of sufficient importance to justify the elaborations on the ancestors of Finno-Ugric that follow. (In short, building and operating, in a way, an independent theme park with completely personal and original decors within the mind.))
Machines learn language far more efficiently than humans, without affect and simply by frequenting massive data. How much faster can machines learn? Let's make a few inconsequential estimates. First of all, let's consider second language acquisition, i.e. in adolescence or adulthood. A clever typology by the Foreign Service Institute divides languages into four categories of difficulty, with Level I, or easy (e.g., Dutch, French, Italian, Danish), requiring the average native English learner to study for around 600-750 hours, or 24-30 weeks at a rate of 5 hours a day, Saturdays and Sundays off - while Level IV, very difficult (e.g., Korean, Chinese, Arabic) requires almost 88 weeks. We could compare this learning of second-language to fine-tuning, as the term is applied to deep learning by artificial intelligences. Human neural networks have already been trained - pre-trained, if we stick informatics jargon - in language in general, at the time of acquisition - prodigious at such a young age - of the mother tongue(s). When learning a second language, language-related categories are already present - words and meanings, lexical fields, grammar and syntax, modes and tenses, sounds, phonetic landscape, tones and registers, nuance and ornamentation. All that remains is to make "small adjustments" to a new language: a new vocabulary, certainly, but whose words correspond (for a large number of language pairs) quite bijectively to the native or mastered language vocabulary items; a new grammar, but one which declines certain very expected patterns; conjugation presents a fairly predictable panel of tenses and modes, often from the indicative to the subjective and conditional, from past to future, with simple and compound tenses, and so on. In short, even if the learner is ignorant of the new language in the sense of not being able to understand or speak it, he already knows in part what to expect: in addition to a vocabulary that designates just about everything the mother tongue is capable of designating, it will have certain characteristics - of syntax, grammar, logic, structure - and it's by paying particular attention to these characteristics - new, but certainly along linguistic lines similar to those exemplified by the first languages - that learning will gain momentum. Some, hypothesizing the existence of a universal grammar, would argue that these categories are at least partly innate - under this premise, we could say that the human baby is born with a neural network that is pre-trained in the computer meaning, and that even first-language acquisition is no more than fine-tuning. But to dwell on this subject would be to support collateral debates that would distract us from our leisure: the genealogy of Uralic languages, and under this pretext, the comparative practice of a little Hungarian and Estonian. And we have yet to estimate the machine's relative efficiency. In any case, we can say that general language learning, pre-training - perhaps in part pre-natal, for the proponents of universal grammar - and fine-tuning - require several years, in fact the full span of childhood and adolescence, say, fifteen years of playground practice, against a few months (say three months) for a modern large language model. So we have a factor of 60: machines are 60 times faster at learning language.
(Language) learning brings pleasure thanks to the release of neurotransmitters that contribute to well-being; a natural curiosity about the environment and the entire universe is satisfied; from a narcissistic point of view, self-image is strengthened, which expands the potential to contribute to furthering science, to the greater good, to overcoming the barriers of imagination; learning allows entry into the circle of initiates and researchers, and promotes rewarding and satisfying interpersonal exchanges. In short, the fact that machines learn faster is certainly not an argument in favor of resigning and giving up intellectual efforts to the machine. And if these days, efforts are massively focused on the perfection of artificial intelligences - which reserves its share of amazement - it is high time to catch up and start working on the perfection of humans. Fortunately, there is a lot of thrilling, brain chemicals-producing research to be carried out here to improve the performance of human language learning.
A second note on humans and machines before we delve into Uralic languages. If we look at the history of machine translation systems from the fifties of the last century to the present day, we see an exponential rise in their performance, paralleled by their equally rapid simplification. The first systems, from the 1950s to the late 1980s, implemented Rule-Based Machine Translation (RBMT): perhaps strongly inspired by a spontaneous approach to translation, the kind achieved by an expert human operator armed with in-depth knowledge of the linguistic characteristics, including syntax, morphology and semantics, of the source and target languages, they generated the translation by analyzing the source text, to which they applied a set of rules. They therefore required the memorization of detailed registers of these rules, as well as exhaustive dictionaries compiled by linguists. Thereafter, the statistical approach to machine translation (SMT) dominated until the mid-2010s. Translation then gained naturalness, but continued to falter on complex, specialized or idiomatic structures. Each SMT system was still specialized in translation specific to a source-target language pair. But instead of applying explicit linguistic rules inventoried in advance by experts, the system learns, by applying statistical methods to large parallel corpora of texts and their translations, the statistical patterns of translation for that language pair, and at inference time comes up with the most likely translation for each sentence or phrase. Finally, the most recent advances in deep learning have led to the advent of neural machine translation (NMT). The most recent models are designed around a encoder-decoder architecture with transformers, mainly made up of successive blocks of neurons organized to "learn to pay attention". The parameters of the neural network - weights that determine how intensely, where ("to which linguistic aspects") to pay attention in order to master language - are optimized in successive iterations over large parallel datasets consisting of source-language sentences and their target-language translations. At each iteration, all the millions, billions of parameters (340 million for BERT, up to 11 billion for T5, 175 billion for GPT-4) are updated so as to minimize the error between the translations obtained with the model's current weighting and the ideal translations, from the original dataset. While NMT achieves a naturalness and fluency much closer to human expectations and possibilities than earlier machine translation systems, it is less interpretable (black box). We design the model so that it can learn to "pay attention to certain language facts", and we know that it succeeds in doing so, since the result is conclusive, but we don't know precisely which features (to speak in the old categories, which phonological, morphological, syntactic, semantic, grammatical, etc. features) each "attention" block has paid particular attention to. An abundance of scientific literature is beginning to address the reverse-engineering of models and interpretability issues. Confronting these questions is also our aim: we believe that dissecting the workings of artificial intelligences will enable us to reapply them to humans. Artificial neural networks were designed by imitating biological and very human neural networks, and have evolved to become superior (see comment above on the speed of language learning). Their "ways of doing things", which are still a little obscure but in the process of being discovered, can now be "reinjected" into human practices, to boost language learning. But now back to our quick history of major machine translation systems. If there's no doubt about exponential progress, recently boosted by transformers, how can we talk of simplification? Developing deep learning in general, and models based on the attention mechanism in particular, required decades of complex R&D. The models themselves, with their sophisticated arrangements of billions of virtual neurons, learn to translate in a way that seems all the more complicated as the details of their functioning are inaccessible to us. By contrast, the need for a human expert - the expert linguist who established the rules in the RBMT, or the expert statistician and linguist who developed the SMT's pattern-finding methods - has vanished. Expertise has somehow been transferred to the machine, dethroning the capabilities of the human counterpart, while remaining obscure as to its processes: the new expert in a box certainly delivers outstanding results, but jealously guards his recipe. And we can say that machine translation systems have simplified (seen from the outside) in the sense that processing complexity has been transferred (inside) alongside expertise. Or, we have gone from the cacophony of rules - each specific to a given language or pair of languages - of RBMT systems to the pre-Babelian unity of large language models, which learn language in general in the underground of data centers.
Calls are now being made for a little retrospection, and some are seeking the possible benefits of augmenting neural machine translation systems, with human specialists instilling them with linguistic facts, specific to the languages involved. A sort of adjuvant to the very generic large language models, inculcating them with some of the rules (grammar, logic, syntax, semantics) that govern the source and target languages. In [1] for instance, published in November 2023, the authors augment a transform-based encoder-decoder architecture with a module designed to maximize semantic similarity between source and target sentences - in other words, guaranteeing that each sentence has (nearly) the same meaning as its translation. This Sentence-Transformer resorts to vertical and horizontal fusion methods integrating various features at different level of the neural network - low-level features pertaining individual words to high-level features pertaining to whole sentences - and linked to diverse linguistical aspects - syntax, semantics, context, pragmatics. A similar approach is investigated by [2], where a semantic-unit-level sentence representation obtained by modeling the integral meanings of semantic units within a sentence is concatenated with the token-level one, the combination then serving as input to the encoder. Note that these two examples do not, strictly speaking, inject knowledge of linguistic aspects (such as grammar) specific to a given language pair, but they do make explicit in the neural model a module dedicated to sentence semantics, which may also identify and map smaller semantic units.
This trend towards a more hybrid approach in NMT suggests a future where machine translation is not only about processing vast amounts of text data but also about understanding and incorporating the intricate rules and nuances of human languages. Such hybridization hold the potential to yield more accurate, reliable and context-sensitive machine translations, where the data-driven approach alone falters despite its great capabilities - when handling complex grammatical structures, or dealing with low-resource languages or specific linguistic contexts.
And, needless to say, we are now thinking about how to enhance our human learner. It seems that language methods have followed a path quite parallel to that of NMT systems. At first scholastic, separating aspects of language, working here with grammar by means of systematic exercises, jotting down vocabulary there in a small notebook, listening elsewhere to pedagogical texts to be repeated, and writing essays on restricted lexical fields; then more integral approaches, total linguistic immersion, multisource linguistic data aggregation without any particular sorting. Following the dialectic of machines, this integrative method (of which we are the first supporters) shall be infused with precise elements of target-language linguistics, in the right dosage.
Uralic languages share some interesting features. One phonological feature common to many of them is vowel harmony. It occurs in Hungarian and Finnic languages with the exception of Livonian, Estonian and Veps. Let's try to figure it out in Finnish, Võro and Hungarian.
In Finnish, we learn that
Suomen kielessä missään omakantaisessa, yhdistämättömässä sanassa ei esiinny sekä etu- /ä ö y/ että takavokaaleja /a o u/, vaan sanat ovat joko etuvokaalisia tai takavokaalisia. Kuitenkin /i/ ja /e/ ovat vokaalisoinnun kannalta lähes neutraaleja ja voivat esiintyä sekä etu- että takavokaalisissa sanavartaloissa, vaikka ne foneettisilta ominaisuuksiltaan kuuluvat etuvokaaleihin. Yhdyssanan pääte määräytyy sanan viimeisen osan mukaan.
and a good automatic translator gives us the following Hungarian equivalent
A finn nyelvben egyetlen eredeti, nem összevont szóban sem szerepel elülső /ä ö y/ és hátsó magánhangzó /a o u/, de a szavakban vagy elő- vagy hátsó magánhangzók találhatók. Az /i/ és az /e/ azonban szinte semlegesek a magánhangzók akkordját tekintve, és az elülső és hátsó magánhangzós szótestekben is megjelenhetnek, pedig fonetikai tulajdonságaik az első magánhangzókhoz tartoznak. Az összetett szó végét a szó utolsó része határozza meg.
Knowing a bit of the latter language and very little of the former, let's attempt some elements of exegesis. Let's break the whole down into elemental semantic units, and identify the parallels.
Suomen kielessä missään omakantaisessa, yhdistämättömässä sanassa ei esiinny sekä etu- /ä ö y/ että takavokaaleja /a o u/, vaan sanat ovat joko etuvokaalisia tai takavokaalisia.
A finn nyelvben egyetlen eredeti, nem összevont szóban sem szerepel elülső /ä ö y/ és hátsó magánhangzó /a o u/, de a szavakban vagy elő- vagy hátsó magánhangzók találhatók.
"In the Finnish language", so far so good.
Suomen kielessä missään omakantaisessa, yhdistämättömässä sanassa ei esiinny sekä etu- /ä ö y/ että takavokaaleja /a o u/, vaan sanat ovat joko etuvokaalisia tai takavokaalisia.
A finn nyelvben egyetlen eredeti, nem összevont szóban sem szerepel elülső /ä ö y/ és hátsó magánhangzó /a o u/, de a szavakban vagy elő- vagy hátsó magánhangzók találhatók.
Let's do a visual acuity test. The Hungarian sentence has the word "word" twice in different forms, szóban in the inessive (place in which one is) singular case, therefore, "in a word", and szavakban, inessive plural, "in words". The equivalent word in Finnish - repeated, but a little different - is sanassa / sanat. Word, then.
Suomen kielessä missään omakantaisessa, yhdistämättömässä sanassa ei esiinny sekä etu- /ä ö y/ että takavokaaleja /a o u/, vaan sanat ovat joko etuvokaalisia tai takavokaalisia.
A finn nyelvben egyetlen eredeti, nem összevont szóban sem szerepel elülső /ä ö y/ és hátsó magánhangzó /a o u/, de a szavakban vagy elő- vagy hátsó magánhangzók találhatók.
Two prefixes apply successively to magánhangzók, elő- and hátsó - two times. Precisely, a little linguistics, as elő- vagy hátsó magánhangzók means front or back vowel. etu and taka appear to be the corresponding Finnish prefixes, as the formula appears twice.
Suomen kielessä missään omakantaisessa, yhdistämättömässä sanassa ei esiinny sekä etu- /ä ö y/ että takavokaaleja /a o u/, vaan sanat ovat joko etuvokaalisia tai takavokaalisia.
A finn nyelvben egyetlen eredeti, nem összevont szóban sem szerepel elülső /ä ö y/ és hátsó magánhangzó /a o u/, de a szavakban vagy elő- vagy hátsó magánhangzók találhatók.
Knowing Finnish and Hungarian for word, vowel, back and front, and now seeing clearly for the whole Hungarian clause the translation but in words are found either front vowels or back vowels, we infer the Finnish meaning of vaan: but, joko..., tai...: either..., or..., ovat: are.
Suomen kielessä missään omakantaisessa, yhdistämättömässä sanassa ei esiinny sekä etu- /ä ö y/ että takavokaaleja /a o u/, vaan sanat ovat joko etuvokaalisia tai takavokaalisia.
sanassa is the Finnish inessive for word, missään is nowhere because ei signals a negative clause, so missään sanassa, in no words. ...ei esiinny sekä etu- että takavokaaleja, ...don't appear both front and back vowels.
Here are the basic elements of harmony. Front vowels, /ä ö y/, and back vowels, /a o u/, and never in any simple word a mixture of back and front vowels.
The remainder of the short Finnish text is already less obscure, as we decipher a few words we've already seen (or whose meaning is obvious to an English speaker).
Kuitenkin /i/ ja /e/ ovat vokaalisoinnun kannalta lähes neutraaleja ja voivat esiintyä sekä etu- että takavokaalisissa sanavartaloissa, vaikka ne foneettisilta ominaisuuksiltaan kuuluvat etuvokaaleihin. Yhdyssanan pääte määräytyy sanan viimeisen osan mukaan.
Let's consider the English equivalent and proceed further with some identifications.
However, /i/ and /e/ are almost neutral in terms of vowel chord and can appear in both front and back vowel word bodies, even though their phonetic properties belong to front vowels. The ending of a compound word is determined by the last part of the word.
Kuitenkin /i/ ja /e/ ovat vokaalisoinnun kannalta lähes neutraaleja ja voivat esiintyä sekä etu- että takavokaalisissa sanavartaloissa, vaikka ne foneettisilta ominaisuuksiltaan kuuluvat etuvokaaleihin. Yhdyssanan pääte määräytyy sanan viimeisen osan mukaan.
Two adverbs, kuitenkin for however and lähes for almost.
kannalta illustrates the Finnish ablative, the sixth of the locative cases with the meaning "from, off, of", and calls for the genitive, here vokaalisoinnun: from the perspective of the vowel chord, that is, in terms of vowel chord.
The conjunction vaikka translates into even though.
The verb kuulua, to belong, conjugates in kuuluvat in the present 3rd person plural. ominaisuuksiltaan is the previous ablative, this time in the plural, from their properties or given their properties. etuvokaaleihin is declined in the illative case, normally used for describing movement towards something, here required by belong, belong to vowels. In total, "even though they belong to front vowels in terms of their phonetic characteristics".
The preposition mukaan accompanied by the genitive means based on. The genitive in question is viimeisen osan, viimeisen meaning last and osan, part, so, based on the last part.
määräytyä is a reflexive verb - often in -tyä in Finnish - meaning to be determined, to be decided - literally, to decide or determine itself. määräytyy conjugates it in the third person singular of the present indicative.
pääte in the nominative form means ending. The entire last sentence with pääte as the subject is therefore: the ending of a compound word is determined according to the last part of the word. So much for vowel harmony in compound nouns.
Will we be able to illustrate Finnish vowel harmony on some of the words we met earlier? sanavartaloissa with the back vowel /a o/ and the neutral vowel /i/ is back-vocalic. määräytyy or pääte are clearly front-vocalic. etuvokaalisia means front-vocalic and is back-vocalic, featuring the back vowels /a o u/ as well as the neutral /e i/. foneettisilta has its ablative in -lta because it's a back-vocalic. The front vocalic ystävä (friend) becomes ystävältä in the ablative. The suffixes of the different grammatical cases are therefore tuned according to harmony.
What about Võro?
A võro nyelv az uráli nyelvek finn ágához tartozó nyelv. Hagyományosan az észt nyelv egyik déli nyelvjárásának tekintették, de ma már önálló irodalmi nyelvvel rendelkezik, és arra törekednek, hogy hivatalosan is elfogadják mint autochton regionális nyelvet Észtországban.
Let's try to make sense of this short Hungarian description of the Võro language.
A võro nyelv az uráli nyelvek finn ágához tartozó nyelv.
This speaks of the Võro language, and a võro nyelv is the grammatical subject of the first sentence.
A võro nyelv az uráli nyelvek finn ágához tartozó nyelv.
The verb to be is omitted: "When the verb is used as a copula i.e. if one speaks about what someone or something is, it is omitted in the third person singular and plural of the present tense." az ... nyelv is the predicate or subject complement, specifying what the subject a võro nyelv is.
A võro nyelv az uráli nyelvek finn ágához tartozó nyelv.
The highlighted group completes the name nyelv. tartozó is a present participle used as an epithet adjective, meaning belonging to and calling for the allative case, here in -hoz, which generally describes a movement towards. finn ágához means to the Finnish branch and finn ágához tartozó, belonging to the Finnish branch. urali nyelvek characterizes ágához: the Finnish branch of Uralic languages, with ágához bearing the possessive mark. (We recall here that in Hungarian, the thing possessed (grammatically) bears the possessive suffix, while the possessor can also, but need not always, be marked by the dative case).
In total, we have
The Võro language is a language belonging to the Finnish branch of the Uralic languages.
and we read further.
Hagyományosan az észt nyelv egyik déli nyelvjárásának tekintették, de ma már önálló irodalmi nyelvvel rendelkezik, és arra törekednek, hogy hivatalosan is elfogadják mint autochton regionális nyelvet Észtországban.
hagyományosan est un adverbe. Hungarian forms adverbs with the suffix -an/-en or -ul/-ül, where the choice of vowel depends on, well, vowel harmony. Just one of the characteristics of (most) uralic languages, around which we're playing here. As in Finnish, Hungarian words, with the exception of recent loanwords, have either back vowels only, or front vowels.
Hungarian back vowels: a, á, (i), (í), o, ó, u, ú
Hungarian front vowels: e, é, (i), (í), ö, ő, ü, ű
i, í do not contribute to the classification as back/front vowel word if other vowels are present. For instance, tekintették is front-vocalic, because of the vowel e. i is neutral in the designation as front-vocalic. irodalmi is back-vocalic.
hagyományosan means traditionally. tekintették illustrates the use of the third person plural active form to express the passive voice. There are several ways of expressing the passive in Hungarian, and the use of an explicit passive form is the rarest. In addition to the use of the third person plural active form, here literally "they" have considered with implicit, impersonal "they", so in more common English, has been considered, Hungarian expresses the passive with the middle voice. With middle voice verbal forms (also known as unaccusative verbs) the subject is not the agent but the patient or experiencer of the action. The suffixes -ul/-ül are commonly used to turn a verb with a some active meaning into a middle voice form. Examples: épül (to be built) from épít (to build), alakul (to form or take shape) from alakít (to form or shape something). tekintették is here accompanied by the dative, has been considered as nyelvjárásának, has been considered a dialect of.... Finally, az észt nyelv egyik déli nyelvjárásának features the possessive: a southern dialect (thing possessed, in the grammatical sense, egyik déli nyelvjárásának - note that the dative marked by the suffix -nak does not participate in the expression of possessive, but is required by tekintették; the dative mark of the possessive, if any, is borne by the possessor) of the Estonian language (possessor, in the grammatical sense, az észt nyelv).
In total, here is how the Võro language has long been considered.
It has traditionally been considered a southern dialect of the Estonian language,
Let's analyze further.
Hagyományosan az észt nyelv egyik déli nyelvjárásának tekintették, de ma már önálló irodalmi nyelvvel rendelkezik, és arra törekednek, hogy hivatalosan is elfogadják mint autochton regionális nyelvet Észtországban.
The highlighted part translates into
but today it has an independent literary language
rendelkezik accompanied by the instrumental -val/-vel means has, possesses. Since nyelv is front-vocalic (because of the vowel e), the instrumental makes nyelvvel. önálló, independent, derives from the verb áll, independent, and its present participle álló, standing.
And finally
Hagyományosan az észt nyelv egyik déli nyelvjárásának tekintették, de ma már önálló irodalmi nyelvvel rendelkezik, és arra törekednek, hogy hivatalosan is elfogadják mint autochton regionális nyelvet Észtországban.
we learn that our Võro language
is striving to be officially accepted as an autochthonous regional language in Estonia.
Note the use of accusative nyelvet with mint, which introduces the predicate nominative, mint autochton regionális nyelvet: as an autochthonous regional language.
We learned a lot about the Võro language, but also about vowel harmony in Hungarian. What about vowel harmony in Võro?
Vabahelle kokkokõla (ka vabahelükokkokõla) vai vokaalharmoonia om keele helüoppusõ (kavvõassimilatsiooni) säädüs, mink perrä sõna lõpupoolõ vabahelüq piät olõma samma sorti ku sõna alostusõ vabahelüq. Nii saavaq vabahelükokkokõlaga keelin üten sõnan ollaq õnnõ ütte sorti vabahelüq, hariligult kas õnnõ edepoolidsõq vai õnnõ tagapoolidsõq.
It's in Võro. Is the (official, North-) Estonian equivalent close?
Vabahäälikute harmoonia (ka vokaalharmoonia) on keeleteaduse (kaashäälikuassimilatsiooni) reegel, mille kohaselt peavad sõna lõpuosas olema samatüübilised vabahäälikud nagu sõna alguses. Nii võivad vokaalharmooniaga keeltes ühes sõnas esineda ainult üht tüüpi vabahäälikud, tavaliselt kas eespool asuvad või tagapool asuvad vabahäälikud.
This translation was produced by a generative AI, within the limits of its current knowledge. Given the example of a language, Võro, which can certainly be described as rare, its accessible corpus (on the Internet, for example) being limited, as are the resources for learning it (lexicon, translation tools, grammar), let's illustrate once again our method of learning languages by proceeding like machines: by paying attention. We try to divide each sentence into clauses or units of meaning, and find out which part of the Estonian sentence each of these subsets "pays attention" to. We can then rejoice in deciphering Võro without a dictionary, insofar as we manage to reconstruct the meaning, word by word, word group by word group, in the presence of a translation in a slightly less unfamiliar language.
The closeness is obvious from the very first clause.
Võro
Vabahelle kokkokõla (ka vabahelükokkokõla) vai vokaalharmoonia om keele helüoppusõ (kavvõassimilatsiooni) säädüs,
Estonian
Vabahäälikute harmoonia (ka vokaalharmoonia) on keeleteaduse (kaashäälikuassimilatsiooni) reegel,
Vowel is highlighted, a same radical vabah- in both languages.
Vabahelle kokkokõla (ka vabahelükokkokõla) vai vokaalharmoonia om keele helüoppusõ (kavvõassimilatsiooni) säädüs,
Vabahäälikute harmoonia (ka vokaalharmoonia) on keeleteaduse (kaashäälikuassimilatsiooni) reegel,
to be in the third person present tense is on in Estonian, and probably om in Võro.
Vabahelle kokkokõla (ka vabahelükokkokõla) vai vokaalharmoonia om keele helüoppusõ (kavvõassimilatsiooni) säädüs,
Vabahäälikute harmoonia (ka vokaalharmoonia) on keeleteaduse (kaashäälikuassimilatsiooni) reegel,
Let's wager that säädüs means rule. All the more likely as Finnish has sääntö for rule, related as it is to the Hungarian word, sző. All of them are assumed to stem from Proto-Uralic säŋä*, which later branched out into proto-Finnish sää*. And that reegel betrays an ancient borrowing from Low German, further illustrating those Germanic influences on Estonian we spoke of a short while ago.
Vabahelle kokkokõla (ka vabahelükokkokõla) vai vokaalharmoonia om keele helüoppusõ (kavvõassimilatsiooni) säädüs,
Vabahäälikute harmoonia (ka vokaalharmoonia) on keeleteaduse (kaashäälikuassimilatsiooni) reegel,
The exact same word, keele for the genitive of language, but teaduse means science and might not be a precise translation of helüoppusõ. õppima in Estonian means to learn or to study and heli, sound or tone. (keele) elüoppusõ could be the science of sounds, phonology.
Vabahelle kokkokõla (ka vabahelükokkokõla) vai vokaalharmoonia om keele helüoppusõ (kavvõassimilatsiooni) säädüs,
Vabahäälikute harmoonia (ka vokaalharmoonia) on keeleteaduse (kaashäälikuassimilatsiooni) reegel,
The similarity is still going strong.
mink perrä sõna lõpupoolõ vabahelüq piät olõma samma sorti ku sõna alostusõ vabahelüq
mille kohaselt peavad sõna lõpuosas olema samatüübilised vabahäälikud nagu sõna alguses
peavad olema and piät olõma both mean must be.
mink perrä sõna lõpupoolõ vabahelüq piät olõma samma sorti ku sõna alostusõ vabahelüq
mille kohaselt peavad sõna lõpuosas olema samatüübilised vabahäälikud nagu sõna alguses
mille kohaselt means according to which, and finds its obvious equivalent in mink perrä.
mink perrä sõna lõpupoolõ vabahelüq piät olõma samma sorti ku sõna alostusõ vabahelüq
mille kohaselt peavad sõna lõpuosas olema samatüübilised vabahäälikud nagu sõna alguses
Very close. In the end of the word. So, peavad sõna lõpuosas olema or sõna lõpupoolõ piät olõma mean must be in the end part of the word.
mink perrä sõna lõpupoolõ vabahelüq piät olõma samma sorti ku sõna alostusõ vabahelüq
mille kohaselt peavad sõna lõpuosas olema samatüübilised vabahäälikud nagu sõna alguses
And similarly, the highlighted expressions, in Võro and Estonian respectively, are similar: at the beginning of the word.
mink perrä sõna lõpupoolõ vabahelüq piät olõma samma sorti ku sõna alostusõ vabahelüq
mille kohaselt peavad sõna lõpuosas olema samatüübilised vabahäälikud nagu sõna alguses
samatüübilised vabahäälikud nagu means the same type of vowels as, just like samma sorti ku ... vabahelüq. The kinship is obvious, samma and samatüübilised, vabahelüq and vabahäälikud.
In short, according to the linguistic rules of vowel harmony, the end of the word must have the same type of free vowels as the beginning of the word.
We'll confine ourselves to broadly observing the similarities in the second sentence.
Nii saavaq vabahelükokkokõlaga keelin üten sõnan ollaq õnnõ ütte sorti vabahelüq, hariligult kas õnnõ edepoolidsõq vai õnnõ tagapoolidsõq.
Nii võivad vokaalharmooniaga keeltes ühes sõnas esineda ainult üht tüüpi vabahäälikud, tavaliselt kas eespool asuvad või tagapool asuvad vabahäälikud.
The English translation confirms what we already know about vowel harmony.
Thus, in languages with vowel harmony, only one type of free vowel can occur in a word, usually either front or back free vowels.
Ultimately, back-vowels in Võro are o, a, u, õ, front-vowels ö, ä, ü, e, while i is neutral. However, looking at vabahelüq. Harmony is not perfect in Võro. e, ü are front vowels, a a back vowel.
Here's a brief explanation in Võro.
Võro keelen käü-üi hariligult vabahelükokkokõla ala sääntseq sõnalõpuq nigu -o, -he, -ga(q), -ku(q)/-gu(q), nt nägo, ilosahe, imäga, elägu, tütrigu. Niisama küünü-üi vabahelle kokkokõla üle liitsõna piiri, nt süküskuu, elotüü.
A few exceptions confirm the rule.
In the Võro language, vowel harmony usually does not apply to certain word endings like -o, -he, -ga(q), -ku(q)/-gu(q), for example nägo (face), ilosahe (beautifully), imäga (with mother), elägu (may live), tütrigu (daughter). Similarly, vowel harmony does not extend across the boundaries of compound words, for instance süküskuu (autumn month), elotüü (life's work).
The above texts clearly illustrate that standard Estonian does not comply with vowel harmony. For example, in lõpuosas, alguses and vabahäälikud, front and back vowels coexist (even within the same element of a compound noun). Let's say, then, that Estonian disobeys vowel harmony more often or less regularly than Võro - depending on how we look at it.
Finally, back to our human and machine preoccupations, knowledge of certain characteristics of uralic languages undoubtedly speeds up analysis and comprehension in these languages. Lots of data and a little linguistics.
-------------------------------------------------------------------------------------------------------------------------------------------------------
[1] Li, Jiaxin & Jin, Rize & Paik, Joon-Young & Chung, Tae-Sun. (2023). Neural Machine Translation with an Awareness of Semantic Similarity. 10.1007/978-981-99-7022-3_20. https://www.researchgate.net/publication/375550753_Neural_Machine_Translation_with_an_Awareness_of_Semantic_Similarity
[2] Huang, Langlin & Gu, Shuhao & Zhuocheng, Zhang & Feng, Yang. (2023). Enhancing Neural Machine Translation with Semantic Units. 2264-2277. 10.18653/v1/2023.findings-emnlp.149. https://www.researchgate.net/publication/376401539_Enhancing_Neural_Machine_Translation_with_Semantic_Units