top of page

How brains disambiguate

Updated: Jan 20

Let's look closely.  


Az üzletek nemrégiben bezártak.

The shops have recently closed.


Íme a hatalmas épületek listája. A bezártak között van egy új iskola is.

Here is the list of huge buildings. Among those closed is a new school.


A brain new to Hungarian may find itself a little lost when it has to interpret the word we highlighted above in a hasty reading. bezártak in Hungarian can have two functions: it can be the third person plural indefinite past tense of the verb bezár (have closed), or the past participle bezárt (closed) taken here as a noun, in the nominative plural. Two different verb forms, perfectly homographs, which can confuse the novice translator. Perhaps even more eloquently:


A zárunk mindig megbízható.

Our lock is always reliable.


Este be is zárunk, elnézést kérünk a kellemetlenségekért.

We close in the evening, we apologize for the inconvenience.


Above, we had the same spelling for two verbal forms (one of which was substantivized). This time, the homographs' grammatical functions are even more distinct: in the first sentence, zárunk is a noun in the nominative possessive 1st person singular (our lock), while in the second, it is the stem of the full verb bezárunk (be is a separable particle or coverb), in the indefinite 1st person plural present indicative.


We have just illustrated that two perfectly homograph "marked" (Hungarian) forms can have very different grammatical functions, requiring specific translation. (In linguistics, the term "marker" refers to a morpheme that specifies grammatical functions, such as tense, case, person or number. Words (or phrases) bearing such markers are said to be "marked", which distinguishes them from their more neutral or basic form (often the lemma found in dictionaries)). In the first case, -tak marks the 3rd person plural past tense, while the second bezártak adds to the same stem bezár the marker -t of the past participle and then -ak of the nominative plural. In the second case, an exact same morpheme, -unk, can be the marker of the present 1st person plural or one of the possessive 1st person plural of a noun. In both cases, we're perplexed. Shouldn't a perfect language be unambiguous?


Clearly, the ambiguity described in the opening does not have the same magnitude for a native speaker and a new learner. Disambiguation is in most cases completely unconscious in the mother tongue: attribution to one or another grammatical or functional category, the concomitant attribution of a meaning, is more or less immediate, and if there is momentary confusion - lack of attention, more complex sentence than average -, this can be corrected instantly. This would seem to apply to the polyfunctionality at work here (when the same morphological marker performs several grammatical or semantic functions, as with the suffix -unk). (One might even imagine that if the morphological ambiguity has persisted, it's precisely because it has posed few elucidation problems so far). Interpretation is much more problematic for a stranger to Hungarian, especially if he speaks a distant language, or for a machine translation program.


There are still other ways of confusing a translator - other forms of morphological ambiguity - with homonymy, when words of the same spelling (homographs) or sound (homophones) (or both) carry quite distinct meanings, or with polysemy, when the same word may convey several meanings, often related. Here again, anyone fluent in Hungarian will have no trouble distinguishing, depending on the context, between the noun vár, of Iranian origin, meaning castle, and the verb vár, of Finnish-Hungarian descent, for to wait. Or to distinguish between the meanings of the verb kerül, depending on its use:


  • intransitive use with a lative suffix to mean to get somewhere,


Biztos nehéz helyzetbe került.

He must have got into a difficult situation.


  • transitive use to mean to avoid,


Mindig igyekszik kerülni a zárt tereket.

He always tries to avoid closed spaces.


  •  intransitive use with an illative case (-ba/-be) to mean to cost.


Az orvoshoz járás túl sok pénzbe kerül.

Going to the doctor costs too much money.


Other ambiguities can be more problematic, resisting even the conscientious reader for several seconds, not to mention machines. The Winograd Schema Challenge tests the ability of artificial intelligence to resolve referential ambiguities. A Winograd Schema consists of a pair of sentences, differing by only one or two words, containing an ambiguous pronoun whose referent must be found, a disambiguation that requires some prior knowledge and reasoning based on common sense - the most widely shared thing in the world, among humans, according to Descartes, but let's see the full quote:


Le bon sens est la chose du monde la mieux partagée : car chacun pense en être si bien pourvu, que ceux même qui sont les plus difficiles à contenter en toute autre chose, n’ont point coutume d’en désirer plus qu’ils en ont.


Common sense is the most widely shared thing in the world: for everyone thinks they are so well endowed with it that even those even those who are the most difficult to please in all other respects, are not wont to desire more than they have. [1]


Here is an example of a Hungarian Winograd scheme that common sense struggles with.


A városi tanácsnokok megtagadták a tüntetőktől az engedélyt, mert féltek az erőszaktól.

A városi tanácsnokok megtagadták a tüntetőktől az engedélyt, mert az erőszakot pártolták.


The English translation features the same puzzle.


The city councilmen refused the demonstrators a permit because they feared violence.

The city councilmen refused the demonstrators a permit because they advocated violence.


Who was fearful, and who was promoting violence? An anaphoric demonstrative pronoun seems to reduce ambiguity, azok in


A városi tanácsnokok megtagadták a tüntetőktől az engedélyt, mert azok az erőszakot pártolták.


and an English translation making the reference resolution clear could be


The city councilmen refused the demonstrators a permit because of their advocacy of violence.


We can see here that the coreference problem can be defused by rephrasing. In the same way, if the context does not allow a precise, unambiguous interpretation of a polysemic word or a word struck by homonymy, wouldn't it be simpler to reformulate? If there is any ambiguity, the speaker need only express himself more clearly. Yet, we must recognize that a) despite every effort at clarity, it's impossible to get rid of the ambiguity of language altogether b) clarity depends on the degree of initiation to a given language, from native speaker to newcomer (to the machine). Let's be more precise.


Language maps the "thoughts" of a transmitter (in quotation marks, now that the latter can be a simple machine) onto a sequence of signs. A receiver, sometimes human, can extract meaning from this amalgamation of signs, because he is sufficiently formed (trained) in its deciphering. Nothing more here than a reminder of a few obvious facts: language imperfectly fulfils the function of transmitting information. By analogy with information science or information theory, we could say that the channel is noisy, that compression - of thought into sign - and decompression - the reverse - potentially (necessarily) leads to an on-line loss of information, that language works as a model of thought which necessarily, out of pragmatism, for the sake of expediency, simplifies its contours.


Here, language is seen as a simple vehicle for communication, and its ambiguity seems to be a flaw, the result of a fundamental imperfection: the imperfection of the information transfer channel it constitutes. Ants, whales and even plants also communicate, via their own form of language, sound and waves, gestures and dance, chemistry in the form of biogenic volatile organic compounds. Human language, a very recent evolutionary phenomenon on the scale of life on earth, probably appeared between 100,000 and 70,000 years ago. According to the Strong Minimalist Thesis, it stands out from less sophisticated means of communication by its hierarchical syntactic structure, in other words, a structure based on a merge rule, "a single, repeatable operation that takes exactly two syntactic elements a and b and assembles them to form the set {a, b}". (Bolhuis, Johan & Tattersall, Ian & Chomsky, Noam & Berwick, Robert, 2014, [2])


Questioning how language emerged in the course of evolution, the aforementioned authors argue that it is not (only) an evolutionary branch of living communication systems in general, nor the strict corollary of a specialization, a growth of the organs and faculties of speech (and listening, language assimilated to speech) or sight (language assimilated to signs). While acquisition of a particular (native) language is a matter of geographical or contextual contingency, the merge function that "gives language its humanity" is innate. Several proofs are being adduced. For example, it is verified that an essential property of human language, that of "displacement", can be derived from it. Also called "duality of semantic patterning", this property is illustrated when a word assumes two roles in a given context, when it is interpreted twice according to different functions. In "(Guess) what boys eat", to mimic the authors' example, "what" is interpreted as the object, what the boys eat, and as an interrogative operator. "Merge" is applied twice, between {boys} and {eat, what}, then between the result, {boys, {eat, what}} and the interrogative {what} - that's one way of looking at things, {what, {boys, {eat, what}}.


Seen in this light, ambiguity is just as constitutive of language as it is when we consider language a noisy channel as defined by information theory. But it has now a far more advantageous connotation: it is no longer the effect of a limitation, but of an evolutionary refinement that makes human language exceptional. It's this fundamental property, its hierarchical syntactic structure, that gives language its infinite flexibility, so that it can best espouse thought. And if we are to revive the metaphor of language as a simple vehicle for communicating thought, language is certainly an imperfect model (no model is perfect), but it is an extremely evolved and rich one, capable of conveying all the nuances of thought.


Regardless of one's linguistic school, there is an obvious compositional character to language, evident to any language learner when recognizing in words, for example, stems and functional markers. Syntax, the rules for combining words into clauses and their tree representation; semantic analysis, how words combine to create meanings; markedness, whether words are stems or feature grammatical or semantic markers such as affixes; derivations, illustrated by word families, ancestral links, or cognates across languages; phonological patterns, the arrangement of phonemes, and prosody, the interplay of rhythms and intonation in full sentences. All these aspects speak for composition - which implies some basic components, some atoms: roots, concepts, ancestors, or sounds, which are assembled from the simplest to the most complex.


Let's try to derive from these simple observations some implications for language learning. Since language (and thought) is by nature compositional, there might be a limited set of basic concepts (certain "atoms of computation", to speak with Chomsky [3]) that we can seek to master to speed up the process of learning a new language; given the higher frequency of these basic elements, we can also rely on asymptotic statistics to ascertain that we will start by learning these concepts anyway if we frequent a large amount of textual data. Therefore, the brain that disambiguates is certainly a statistical brain. By scanning massive data, it has become accustomed to the most frequent atoms of language and thought. It knows how to spot invariants, the roots, the stems and, by exclusion, what varies, the markers, the meaning modulators and their patterns. Of course, we would like to confirm this hypothesis by experiment. Two comments on a challenging procedure.


The ideal experience consists of subjecting a sufficient number of complete beginners in Hungarian, natives of a distant language (Indo-European for example), to ultra-attentive reading, condensed over a short period of time, of a significant volume of Hungarian literature, in various registers: belles lettres, sciences, gazette, daily life, etc. Obviously, we can't expect a large number of volunteers to read for dozens of hours without understanding a word, and enthusiastic candidates are likely to display psycho-cognitive properties that make them unrepresentative of the population; which would limit the possibility of drawing conclusions. If not, a vast paid campaign would be needed, profession: cryptic reader, but without intrinsic motivation, we can hardly guarantee that sufficient and sustained attention would be mobilized. Our point is that, in the course of such an obscure monolingual reading (perhaps aided by a protocol we shall detail), the most common language atoms are spotted, imprinted and retained, along with certain patterns of varying elements, conjugation markers, grammatical cases and so on. The rhythm, the color of the Hungarian sentence becomes evident, even before a meaning can be posited.


In anticipation of larger-scale works, and hoping to emulate him, we lend ourselves to the exercise, and peruse a large number of pages by Mikszáth Kálmán. [4] (We are not perfect subjects, as we already know a great deal of basic Hungarian grammar and some vocabulary. However, we are perfectly capable of observing the positive effects of the experiment, starting from a non-zero level, or, as we used to say so as not to slight forgetful Latinists, grand débutant.)


Simple statistics on the Hungarian Web Corpus [5], comprising a large number of web documents for a total of x words, show that the lemma (dictionary entry) sok (often) appears more than 2866 times per million words, the adjective nagy (big) 1975 and még, adverb that qualifies, in many ways, the most of something, 2631 times. még interestingly exemplifies the previously introduced concept of polysemy. The adverb can mean the continuation (in time) of occurrence (still) or with a negation, its opposite, the continuation of absence, that is, the not-yet. (The three quotes hereafter are from Mikszáth Kálmán in A Tisza.)


"Mert még csak gyermek volt akkor a mi kis Mariskánk."

Because our little Mariska was only a child then.


Literally, "was still only a child" where still is rendered by még and restriction, only, by csak.


Furthermore, több, the superlative of sok (many, much) translates more. még then expresses the even-more, the outbidding.


Hálás, elismerő leszek, még az imádságomba is beszövöm, míg élek, egész házanépét.

I'll be grateful, I'll be appreciative, I'll even include in my prayers, as long as I live, all his household.


These concepts, of "even longer" and "even more", are not perfectly distinct (eating more of something is equivalent to eating longer of that something), and the words that express them lie on a continuum, insisting respectively on time or quantity at either end, so that some occurrences of még may accept a translation by still, as well as by more (in particular, more time).


Van-e még idő?

Is there still time? (or Is there more time, is there time left?)


So, let's go back to our monolingual reading experiment on massive data. By the law of large numbers and the concrete properties of the Hungarian corpus, the brain becomes accustomed to very frequent words such as sok, még, or nagy. As the human brain is considerably slower than the machine, it necessarily focuses on a tiny subspace of the corpus (here, a short story by Mikszáth Kálmán whose title refers to an Eastern European river, major tributary of the Danube), and is likely to encounter essentially one or a few major lexical fields - some few concepts and lemma that recur constantly. Here, once the small, common words have been run through, certain "meaningful" words top the frequency chart, such as szegény (poor) occurring 5 times, gyerek (children), asszonyság (madam) and fejét (head, brains) 4 times, or öreg (old) 3 times. Let's try to dissect how some brain may operate along such an attentive but "uncomprehending" perusal.


First encounter with szégeny.


Szegény öreg nótáknak az a sorsa!


The structure, at least, is clear to us: the present indicative verb to be is omitted, and we have a form of expression of possession where the thing possessed (grammatically), sorsa, bears the 3rd person possessive ending, and the dative in -nak applies to the possessor. In all, something like "This is the something of the [szegény] such something!".


Mi lesz a szegény kis Mariskánkból? Oh, istenem, istenem! Kétségbeesik az a gyerek... megőrül... a vízbe ugrik. Se ennivalója, se egy garasa, elhagyatva, eltépve tőlünk, mit fog csinálni szegényke? Kihez megy ott Szolnokon?


Let's interpret the first sentence assuming we can assume nothing about its lexicon. A little word like mi opening a question sounds like a fairly simple interrogative pronoun, like what or who or maybe which. And then we remember the first occurrence of szegény, and that the word there looked quite like an adjective. This is confirmed: here, after the article a, the adjective szegény qualifies Mariskánkból. The nominal group emerges (in the elative case, the space from which one comes out), and lesz resembles a verbal form. Something like "What does out of the szegény such something?". Well, we are on the right track. In fact, rather than "doing from somewhere", it is more likely to "depart" from somewhere or "become" from some state. In other words, the elative helps us interpret the verb, and a little background knowledge tells us that lesz can be the future tense of van, to be, or the verb of becoming: to become, to get, to turn into, which pairs well with -ból. So, "What will become out of the szegény such something?". 


The following proposition containing szegény is also a question, and beginning with mit, well related to mi - we confirm the family of interrogative pronouns. -ni in csinálni marks an infinitive, and fog is a good candidate for a verbal form, the subject being all identified, szegényke, a substantivized version of our word of main interest (in fact, we might know that -ke forms a diminutive). How do you know that -ni marks the infinitive if you don't know it? Well, Kálmán's short text has already presented, at this stage of our scanning, 9 infinitive verbs in -ni. Our attention to each of the corresponding sentences has already shed a lot of new light. So, "What [should/could/must/can/...] do the little something?", where we don't really know the modality expressed by fog and assume the simplest possible meaning for the infinitive, do. Why mit and not mi? Earlier, mi had the function of subject. Here, it's probably the object. So we learn, without even noticing, that the object case, the accusative, is marked by the suffix -t.


The last interesting occurrence of our adjective happens here.


A zsebkendője után nyúlt, de a dohányzacskóját húzta ki szegény, azzal törölgette zavarában a szemeit, minélfogva a dohányportól elkezdett irtózatosan prüszkölni, de ő bizony azt észre sem vette ebben a mindent fölülmúló szerencsétlen helyzetében.


Focusing on the proposition de a dohányzacskóját húzta ki szegény, the first thing we notice is the -t, a potential accusative mark, in dohányzacskóját. This ties in with our fundamental claim, which underlies the exercise and which this progressive example aims to prove: close reading provokes hypotheses, and these hypotheses are confirmed from sentence to sentence, from one occurrence to the next of the same "phenomenon", lexicon, marker, order, structure. In fact, this is the fourth time in the short story that we encounter an -át ending (the accent on the á is important) in a word other than the very common hát, meaning well, then, of course, etc. The á sound is prolonged by the accent, so forming the possessive accusative form. szegény seems to be substantivized and to assume the role of subject. Perhaps we already know that the coverb ki in the (probable) verb kihúzta often means out. So, "..., but the something did out his something else, ...". The meaning of the verb becomes clearer, possibly to get out (of one's bag), to pull out, etc. And if we miss the meaning of dohány (tobacco), it may become clearer when we encounter it again in the next sentence - in any case, we will recognize it, the spelling will imprint itself, and it will be less and less foreign to us.


Now is the time to detail how we could help this monolingual reading (supposedly teaching us Hungarian without us having to learn anything voluntarily) to multiply its effectiveness. In addition to observation and attention, we could also search the dictionary very sparingly. Consult, to give an order of magnitude and a recipe, 5% of new words. Or look up all the words on a given page, then continue reading "blind". The mere statistical repetition then takes care of the confirmation, the impression in memory.


As we have already explained, we would like to carry out this experiment on a large number of subjects, and the present comment illustrates sufficiently the difficulty of obtaining "good" participants. Now, a question we missed or did not take seriously enough when we talked about még; does the reading brain really disambiguate between its different senses, or rather, nuances (even more, longer, still, yet)? For in the realm of polysemy, the different meanings are linked and intermingled. Is it a matter of "découpage des sens" (slicing the senses)?


« Le phénomène si typique du langage naturel qu'est la polysémie pose au moins trois problèmes étroitement liés (...) celui du découpage des sens, c'est-à-dire de leur découverte et de leur définition; celui des relations que ces sens entretiennent et celui de la levée des ambiguïtés au plan du discours. »  


«The phenomenon so typical of natural language that is polysemy poses at least three closely linked problems (...) that of the division of meanings, that is to say of their discovery and their definition; that of the relationships that these senses maintain and that of the removal of ambiguities on the level of discourse. »


(R. Martin, La polysémie verbale. Esquisse d'une analyse formelle de la polysémie., 1972)


The statistical brain of our reading exercise (in fact, a monumental exercise, since it's not a question of reading a short story about a branch of the Danube, but an entire library in Hungarian) doesn't seem to go into that much detail. The renewed impression of the same roots, of the same lemmas, declined in various forms, gradually confirms the semantic halo that surrounds them.


Therefore, as we have now sufficiently demonstrated, our disambiguating brain is primarily a lemmatizing (statistical) brain. How does it lemmatize, or how should it optimally lemmatize? We could look at how machines do, which brings us to the second comment promised earlier, around our experiment to prove that brains disambiguate thanks to statistics tallied on massive corpora - the first being that the experiment lacks volunteers. As artificial neural networks were designed by analogy with the living, there is every chance that, now that they have evolved to the point of threatening to dethrone their initial biological models, their reverse-engineering could teach us best practices applicable to human brains. So the question we are now interested in is, how do machines lemmatize?


A recent study evaluates refinements of HuSpaCy - a Python package supporting Hungarian models for the well-known industrial-strength natural language processing pipeline SpaCy - to improve on existing lemmatization techniques. [6]


HuSpaCy's lemmatization system comprises three subcomponents:


a) a dictionary-based lemmatizer for unambiguous cases, where replacing a word with its lemma is straightforward;


b) a neural network-based end-to-end lemmatizer for all other cases, which automatically learns rules from training data on how to transform words into their corresponding lemmas;


c) hand-made rules (e.g., suffix-cutting) for special cases where approach b) fails due to insufficient training data.


Our primary interest, of course, is in ambiguous cases. We shall take a look at the deep learning models at play, and see if their philosophy can also apply to human learning.


The neural network-based approach b) implements the Lemming lemmatizer. [7] Here, we must delve into the mathematics and algorithms at work in order to draw some conclusions. (We use the notations in the source document, but the precise wording is our own.)

 
 

Let's reformulate once more. Given a word w (with morphological properties m), our model assigns each candidate lemma l a probability of being "the true one", the lemma that a human linguist would deem to correspond to the form w.


For each form, the model computes a specific set of candidate lemmas, proceeding the following way. A corpus preprocessing prior to training has accumulated (we shall see how in a moment) a general catalog of all possible transformations to go from a form to a lemma. (An "edit tree" enters the catalog if it is found in at least two form-lemma pairs < w, l>.) Then, for each form, the set of candidates is obtained by applying all the applicable edit trees (some are clearly not suited) and adding the lemmas (if any) already associated with that form before.


Now, a word about how to extract an edit tree from a certain pair < w, l> (which occurs during preprocessing). The algorithm finds the longest common substring (CLS), then iteratively models prefix and suffix. When no CLS is found, the transformation is considered a substitution. The edit tree retains the length of affixes and the substitutes at substitution nodes. The illustration below, the edit tree extracted from the pair <umgeschaut, umschauen> in German, is from the original research paper.


 

Let's get back to the human brain for a second. By silently reading Mikszáth Kálmán's incomprehensible work, it may have stored a large number of edit trees, from forms to lemmas - but not in exactly the same way as Lemming, because its learning lacks supervision. Unless it checks a dictionary, it is not given the correct lemma for a form, which would enable it to find the tree. It can, however, identify forms likely to be associated with the same lemma, calculate their longest common substring and recursively operate on affixes.


The short story A Tisza, for instance, features four times the sequence h-o-z-z in elhozzák, mihozzánk, hozza and hozzá in that order. Encountering mihozzánk, we remember elhozzák, perhaps hastily scan the above to confirm the exact form, and extract a kind of editing tree, transforming elhozzák into mihozzánk, around the CLS hozz. The following occurrences further confirm the hypotheses about a lemma - probably close to hoz or hozz or hozza - and its suffixes. (The lemma is in fact hoz (the dictionary entry is the 3rd person singular present indicative verb) for hozza and elhozzák, while hozzá and mihozzánk are lemmas themselves.


We shall then recall the form of the Lemming model. The output probabilities it computes are proportional to an exponential function of features. In machine learning, the features of a model are "what the model is called upon to pay attention to". The model's parameters will adjust to minimize its error of judgment as it proceeds by focusing on the features indicated by its architecture, or in this case, by engineers hand-crafting them based on recipes from previous models.

 

Lemming must pay attention to five types of features during training:

i) edit tree features - the edit tree itself, the word-edit-tree pair, and for all affixes, the affix-edit-tree pair;

ii) alignment features - including lemma and form alignment, character-by-character for CLS and block-by-block for substitution nodes, e.g., u-u, m-m, ge-ε, s-s, c-c, h-h, a-a, u-u, t-en;

iii) lemma features - the lemma itself, and lemma-related affixes (frequent affixes);

iv) dictionary features - whether the lemma appears in the dictionary, whether it appears a significant number of times on Wikipedia;

v) the form's part-of-speech (e.g. verb, noun, adjective, etc.) and its morphological properties.

 

Earlier, we outlined the lemmatization model, which yields the probability of each candidate lemma knowing the word and its morphological properties,

In fact, Lemming is slightly more complex than that. It considers words in context, that is, at each position, the morphological tags not only of the current position, but also of previous ones. More precisely, the sequence of these tags is modeled by a Conditional Random Field (CRF), and if we denote f the features i) to iv) and g the morphological features, the model becomes,



The optimization procedure then iteratively adjusts the model parameters, θ and λ. The calibrated model obtained predicts with a high degree of accuracy the lemma most likely to be associated with a given form, regardless of whether this form was seen during training or not.

 

At this stage, a proximity between Lemming and the human learner is no more than an hypothesis, which we are anxious to demonstrate in laboratory settings. (And we shall be pondering the delicate design of compelling experiments in the near future). In any case, It is not absurd to imagine that the disambiguating brain, especially if new to the language of interest, is looking for analogies and longer common substrings. That it surveys texts in search of lemmas, that edit trees are collected, confirmed or adjusted: an implicit, hardly conscious learning by example of morphological rules. That attention is focused, during learning, on very specific features, chosen for their effectiveness. That optimization, towards a robust lemmatization model, considers not only the current word, but also its context.


Let us conclude these prolegomena to the future understanding and augmentation of the cognitive functions of language learning with some practical considerations. The tedious experiment we subjected ourselves to, whose asymptotic form consists in reading the entire work of Mikszáth Kálmán in the text without any prior knowledge of Hungarian, is apparently a matter of unsupervised learning. The monolingual text represents our input data, with no labels to guide us. However, we need to nuance this idea: in the course of reading and concomitant linguistic progress, the brain likely builds up, as it were, a bank of labels - e.g., the edit trees of lemmatization; it then becomes capable of labeling a new sentence in a rapid preliminary preprocessing.


We illustrate with a passage from A Tisza, assuming that the guinea pig reader has already read a fifth of Mikszáth Kálmán's work.


Mindenki úgy javasolta, kényelmesebb is, rövidebb is. Mondanom sem kell, hogy a tiszai hajóknak még akkoriban híre sem volt.


Everyone suggested that it is more comfortable and shorter. Needless to say, the Tisza ships had no reputation even at that time.


This is how we can assume his brain tags a sentence at first sight, employing the arsenal of linguistic means developed so far by simple observation.


Mindenki úgy javasolta, kényelmesebb is, rövidebb is. etc.

pronoun adverb verb adjective conjunction adjective conjunction (part of speech)

minden úgy javasol kényelmes is rövid is (lemma)

e1 x e2 e3 x e3 x (edit tree)

everybody so x more-x also more-rapid also (meaning in English)

etc.


This pre-processing gives learning an increasingly supervised character. So we bet we can make our exercise (both more user-friendly and) more efficient by making learning explicitly (semi-)supervised. The text is interspersed with tags of various kinds. The translation of the form is excluded. The aim is to train the brain to recognize shapes, to resolve ambiguities.




-------------------------------------------------------------------------------------------------------------------------------------------------------------

[1] René Descartes, 1637, Discourse on the method (Discourse on the Method of Rightly Conducting One's Reason and of Seeking Truth in the Sciences; French: Discours de la Méthode Pour bien conduire sa raison, et chercher la vérité dans les sciences)


[2] Bolhuis, Johan & Tattersall, Ian & Chomsky, Noam & Berwick, Robert. (2014). How Could Language Have Evolved?. PLoS biology. 12. e1001934. 10.1371/journal.pbio.1001934.


[3] Berwick, Robert & Chomsky, Noam. (2016). Why Only Us: Language and Evolution. 10.7551/mitpress/9780262034241.001.0001.


[4] Mikszáth Kálmán, see Wikisource in Hungarian, https://hu.wikisource.org/wiki/Szerz%C5%91:Miksz%C3%A1th_K%C3%A1lm%C3%A1n


[5] The great online tool Sketch Engine for corpora statistics and analysis, and in particular, the description of the Hungarian Web Model https://www.sketchengine.eu/hutenten-hungarian-corpus/


[6] Orosz, György & Szabó, Gergő & Berkecz, Péter & Szántó, Zsolt & Farkas, Richárd. (2023). Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines. 10.1007/978-3-031-40498-6_6.


[7] Müller, Thomas & Cotterell, Ryan & Fraser, Alexander & Schütze, Hinrich. (2015). Joint Lemmatization and Morphological Tagging with Lemming. 2268-2274. 10.18653/v1/D15-1272.


Related Posts

See All
Hidegháború (Cold war)

We shall now carry out an interesting exercise. As both Hungarian and Estonian belong to the same language group, namely the Finno-Ugric...

 
 
Istoria României (and schwa)

One of the first things we notice when skimming over this text on Soviet-era Romania is the abundance of the printed character ă . "...

 
 
Kapital and diacritics (I)

Its manifesto and some random perusals suggest that Kapital noviny features news seen from a critical or contentious angle. It can be...

 
 
bottom of page