Preliminary thoughts on Voynichese Part of Speech tagging

Preliminary thoughts on Voynichese Part of Speech taggingMarco PonziBlockedUnblockFollowFollowingFeb 1Software tools for unsupervised Part of Speech (POS) tagging have been around for several years.

Algorithms mostly look for which words appear next to the same words.

On a large enough corpus, such behaviours correlate with the grammatical role of each word.

The application of such techniques to Voynichese (the unknown language of the Voynich manuscript, Beinecke ms 408) has been mentioned in the thesis of Gianluca Bosi (Bologna University) and I am sure others must have experimented with this approach.

The simple experiments discussed here are based on software developed for Alexander Clark’s 2003 paper Combining Distributional and Morphological Information for Part of Speech Induction.

 Clark discusses a number of ideas that could help dealing with rare words (one of the problems with Voynichese) but here I will only focus on an older algorithm re-implemented by Clark (originally described in Hermann Ney, Ute Essen, and Reinhard Kneser, 1994, On structuring probabilistic dependencies in stochastic language modelling).

 I am not sure I understand much of the theory: I use the software as a black box.

I think the algorithm works on bigrams, i.

e.

considering couples of consecutive words: words are assigned to classes in such a way that the possibility of guessing the next word given the current word is maximized.

There will be classes which tend to appear consecutively and others that rarely or never follow each other.

The number of word classes is one of the parameters of the algorithm.

A simple sentenceAs an example, this is the three classes tagging of a simple sentence:0:it 2:ascends 1:from 0:the 2:earth 1:to 0:the 2:heaven 1:and 1:again 0:it 2:descends 1:to 0:the 2:earthThe two most frequent words (“it” and “the”) are assigned to class:0The words that precede those in class:0 are assigned to class:1The words that follow those in class:0 are assigned to class:2Basically, the detected structure is 0,2,1,0,2,1,0,2,1…Even this simple example is not totally regular and in one case a class:1 word is followed by a second class:1 word (“1:and 1:again”).

But class:0 is always followed by class:2, and class:2 is always followed by class:1.

This example also shows how such a system can be useful for Part of Speech identification:The two prepositions (“to” and “from”) are tagged as class:1, together with other two “function words” (“and” and “again”)All verbs and nouns (ascends, descends, heaven, earth) are tagged as class:2Of course, the variability in all natural languages is so high that a meaningful tagging can only be produced on the basis of a very large corpus.

English vs VoynicheseIn order to avoid difficulties with the differences between Currier languages and have as uniform a text as possible, I focused on a single section of the Voynich manuscript: quire 20.

 As a benchmark, I used a portion of the Genesis from King James Bible, considering a similar number of words (about 10,500).

In both cases, I only used 5 POS classes.

Such a number is obviously too small to correctly represent all different part of speech categories, but it makes analysis and discussion easier.

I fed the whole text to the algorithm, without any kind of punctuation or sentence boundary marks.

These are the most frequent 20 words for each of the classes for the English text:C:0 C:1 C:2 C:3 C:4tokens:1860 tokens:2159 tokens:2095 tokens:2048 tokens:2294types:131 types:84 types:122 types:399 types:439ratio:14.

1 ratio:25.

7 ratio:17.

1 ratio:5.

1 ratio:5.

2hapax:67 hapax:31 hapax:41 hapax:190 hapax:189 and 1117 the 866 of 441 it 93 earth 99 that 141 he 146 in 163 be 87 said 93 upon 58 his 131 to 131 him 79 lord 87 which 52 i 109 unto 121 all 76 years 65 lived 34 god 102 was 109 thee 67 will 53 forth 27 a 98 shall 95 abram 59 sons 47 into 24 thou 82 is 79 them 53 man 45 came 24 every 69 after 69 not 43 hundred 44 as 24 thy 68 were 67 noah 41 had 37 out 22 they 62 for 66 me 37 name 34 on 21 their 36 with 65 her 37 wife 33 but 21 my 35 begat 64 there 31 land 33 up 15 s 33 from 61 went 30 waters 32 saying 13 she 30 shalt 34 also 26 day 32 then 12 an 29 when 25 you 20 have 30 because 12 two 21 made 25 this 20 days 30 at 12 cain 18 called 23 one 20 seed 27 wives 10 three 15 make 22 old 20 ark 26 therefore 9 five 14 behold 22 daughters 20 flesh 25 where 7 nine 12 let 21 eat 19 son 23class:0 mostly conjunctions and adverbsclass:1 determiners and numbers (but also subject pronouns)class:2 12 verbs + 8 prepositionsclass:3 contains several object pronouns (him,thee,them,me,her), but it is quite mixedclass:4 16 nouns + 4 verbs (3 of which are auxiliary); even if they do not appear among the most frequent words, several adjectives are also assigned to this class“Hapax” is the number of hapax legomena (i.

e.

words which only occur once in the whole text).

This value is obviously anti-correlated with the tokens/type ratio.

Classes that include fewer word types tend to have fewer hapax legomena.

The number of tokens per class is roughly constant.

It could be that function words concentrate in classes with a high tokens/type ratio and a low number of hapax legomena.

The most frequent sequences of two consecutive classes are (the numbers correspond to occurrences of the sequence in the tagged text):1_4 18282_3 10103_0 9172_1 8684_2 8624_0 8040_1 8013_2 500They can be represented by the following graph:Some sequences that match those illustrated in the graph:1:the 4:days 2:of 3:enos 2:were 1:nine 4:hundred 0:and 1:five 4:years1:the 4:bow 2:shall 3:be 2:in 1:the 4:cloud1:a 4:dove 2:from 3:him 2:to 3:see 2:if 1:the 4:waters 2:were0:and 1:the 4:lord 2:plagued 3:pharaoh 0:and 1:his 4:house0:that 1:his 4:brother 2:was 3:takenOf course, many more sequences appear a significant number of times.

It is also evident that word classes are not clear-cut.

Yet the results illustrate how this software can detect something relevant, at least for the most frequent words, even with a relatively short text.

These are the most frequent 20 words for each of the five classes for Voynich Quire 20 (using the EVA transcription by Zandbergen and Landini, including uncertain spaces):C:0 C:1 C:2 C:3 C:4tokens:1817 tokens:2585 tokens:1692 tokens:1893 tokens:3125types:613 types:618 types:550 types:787 types:507ratio:2.

9 ratio:4.

1 ratio:3.

0 ratio:2.

4 ratio:6.

1hapax:465 hapax:396 hapax:407 hapax:623 hapax:203aiin 208 chedy 200 qokeey 159 ar 149 qokaiin 121 al 138 chey 125 qokeedy 136 or 72 daiin 121 y 122 shedy 116 qokedy 61 otar 59 l 111 ol 113 okeey 97 oteedy 57 otain 52 qokain 100 okain 68 shey 82 lchedy 52 r 49 okaiin 97 ain 58 cheey 78 qokey 40 cheo 39 otaiin 77 dain 48 oteey 63 lkeey 34 s 32 chol 63 air 35 otedy 60 okedy 32 char 32 o 55 am 27 cheol 55 qoteedy 31 qotar 29 otal 52 sain 24 sheey 49 lkeedy 30 dair 27 dar 44 a 21 okeedy 46 qoky 29 lkar 22 qokar 43 aiiin 20 chdy 38 qol 27 sar 21 okal 43 cheeo 18 chckhy 38 qotedy 25 ch 21 raiin 41 qokol 13 keedy 34 qoteey 22 lor 20 qokal 41 shody 11 sheedy 26 qoty 19 kar 19 okar 41 oteol 9 sheol 25 qotal 18 otair 18 qotaiin 40 ral 8 cheedy 24 qokeeey 17 chor 16 lkaiin 39 okol 8 shol 22 lkedy 17 chear 15 kaiin 36 oiin 8 keey 21 otey 15 tar 14 saiin 35 cheeody 8 chedaiin 21 okeeey 15 aiir 14 qotain 34class:0 aiin and variants; mostly very short wordsclass:1 initial bench (ch/sh) -y or -l ending; 13 words start with a bench ending with -l or -y (vs a total of 3 in the top 20 words of the other four classes)class:2 [ql].

[yl]: 16 words start with q- or l- and end with -y or -l (vs a total of 2 in the top 20 words of the other four classes)class:3 final -r (16 words vs a total of 4 in the top 20 words of the other 4 classes)class:4 -aiin -ain -al as suffixes i.

e.

attached to preceding characters (14 words vs 7 in the top 20 words of the other 4 classes)I thought that some of the classes might correspond to line-initial / line-final or paragraph-initial / paragraph-final words, which are known to be peculiar.

But this is not the case: l.

init.

l.

final p.

init.

p.

finalC:0 248 217 57 32C:1 202 203 33 55C:2 144 138 54 28C:3 218 251 58 52C:4 272 275 26 61The most frequent sequences in the tagged text:1_4 11473_0 11354_1 10061_2 9450_1 7974_3 7464_4 6302_4 540These frequent sequences can be represented by this graph:The following are a few fragments that illustrate word sequences compatible with the most frequent sequences:<f104v.

13> 4:lkaiin 3:cheetar 0:aiin 1:cheitaiin<f105v.

11> 4:okaiin 3:os 0:aiin 1:chckhodu 2:qoteedy<f106r.

40> 3:ar 0:aiin 1:sheey 4:lkaiin 1:sheedy<f111v.

32> 1:chey 4:tain 3:chkar 0:alkar 1:chey 2:qol<f112v.

18> 4:saiin 3:or 0:aiin 1:chey 2:qokeedy<f112v.

46> 1:cheky 4:chokain 3:char 0:am 1:chey 4:kain<f113v.

47> 4:lkaiin 4:tair 1:shey 4:qotain 3:ar 0:akal 1:sheyAgain, this is highly simplified: words in each class do not entirely fall into the morphological patterns I described, and, though those listed above are the most frequent consecutive occurrences of word classes, all other combinations occur, even if some of them are extremely rare.

DiscussionIn the output for Voynichese, words in each class are morphologically similar.

This can be verified quantitatively by a similarity function like Levenshtein ratio (1 for identical words, 0 for totally different words).

The average similarity between the 100 words set of the 20 most frequent words from all the 5 classes can be considered as a benchmark.

The values of this average are:English Genesis 0.

195Voynich Quire20 0.

266Possibly due to the well-known rigidity of Voynich morphology, the average similarity is considerably higher in Voynichese than in English.

If we compare the average similarity among the 20 most frequent words within each class, we can see that in English the values are close to the benchmark.

On the contrary, in Voynichese, only one of the classes (c:0) has a value close to the overall average.

All the other classes have higher values, with the similarity ratio of c:2 being higher than the double of the average value.

Similarity between the top 20 words in each classThere are several non-mutually exclusive reasons that could explain this similarity:Similar words correspond to morphological variants of a unique root.

E.

g.

in many languages the singular and plural of a word are different but quite similar words.

Similar words correspond to accidental spelling differences.

In medieval manuscripts, it is not uncommon to see the same word written differently.

The fact that prefixes and suffixes of Voynichese words are dependent on the suffix and prefix of the preceding word is well known.

According to the Transformation Theory by Emma May Smith, this could be the effect of phonological adaptations.

This phenomenon could be so pervasive that it misleads the algorithm into classifying morphology rather than part of speech.

It is also worth noting that “loop sequences” with repeated occurrences of the same class are much more frequent in Voynichese than in English.

English Voynich0_0 37 1191_1 18 1872_2 75 4233_3 212 3934_4 190 630TOT 532 1752Only a small part these 1752 “loop sequences” in Voynichese are due to exact reduplication (like “daiin daiin”).

158 cases of exact reduplication occur in quire 20.

These involve 92 words, which are spread on all of the 5 classes:c0 8c1 18c2 22c3 16c4 28My superficial impression is that the high number of loops is a symptom of a poor classification; grammatically, one can expect that some word classes do not appear consecutively.

We can observe this in the English example, where c0 and c1 (correlating with conjunctions and articles respectively) have particularly rare consecutive occurrences of members of the same class.

Also, one can expect that reduplication is restricted to only a few part-of-speech classes (the fact that reduplicating words are assigned to all classes is surprising).

Further researchThis line of investigation promises a number of potentially interesting experiments.

Here are some ideas:Performing a more detailed analysis of the composition of each class.

Here I focussed on the 20 most frequent words of each class, but looking at all words could provide different insights.

Using a higher number of classes.

Introducing a “sentence-start” marker.

This is an option mentioned by Clark.

Quire20 is divided in a number of paragraphs: a precious piece of linguistic information.

Of course, one can expect that Grove words will be grouped into a single class (they are the peculiar Voynichese words that appear in the first position of most paragraphs).

One could introduce a “line marker” as well.

We know that in Voynichese a line of text appears to be “a functional unit”, with other peculiar words (different from Grove words) appearing in the first and last position of most lines.

Clark’s software allows to set a threshold that forces all words with fewer than that number of occurrences into a fixed class.

5 is suggested as reasonable value.

One could experiment with different values for this threshold (which I did not use in the taggings discussed here).

Clark has developed a variant of Ney’s algorithm that favours the grouping of similar words into the same class.

We see that this is already happening “spontaneously” with Voynichese, but it should happen much more with Clark’s algorithm.

.

. More details

Leave a Reply