Home

Bibliography
Software

Česky

The Bibliography

Articles
Is the Distribution of L-Motifs Inherited from the Word Lengths Distribution?
In Sequences in language and text (2015).


The distribution of L-motifs (measured on a text T) is similar to the L-motifs distribution measured on the pseudotext T’ constructed by random transposition of all tokens within the text T. This inspires the suggestion that the distribution of L-motifs is inherited from the word length distribution (or, by other words, that the word length distribution of a text implies the distribution of L-motifs). The paper clearly shows that despite of the similarity, an L-motifs structure, independent of the word length distribution, can be detected.
Is the Distribution of L-Motifs Inherited from the Word Lengths Distribution?.

Menzerath's Law: The whole is greater than the sum of its parts.
In Journal of Quantitative Linguistics 2/21 (2014).


Reinhard Köhler (1984) proposed an idea that the linguistic constructs which have to be processed by the human parser consist of plain information (that is needed to be communicated) and the structure information, and that this can explain Menzerath's law. Our paper assumes that the amount of plain information and the amount of the structure information are mutually independent. A new model of the nested structure of text and Menzerath's law can be based on this assumption. A formula derived from the model is successfully tested and the results are compared to the classical Menzerath-Altmann law.
Menzerath's Law: The whole is greater than the sum of its parts.

(With Georgios Mikros) Distribution of the Menzerath’s Law on the Syllable Level in Greek texts.
In Gabriel Altmann, Radek Čech, Ján Mačutek, Ludmila Uhlířová (eds.) Empirical Approaches to Text and Language Analysis. RAM-Verlag 2014 Lüdenscheid.


Examining a large corpus of Greek texts we found that the average length of syllables in the disyllabic words is lower than the average length of the syllable in monosyllabic words and lower than the average length of syllables in tri-syllabic words. This peculiar phenomenon can be interpreted as a counterexample of the Menzerah's Law.
Distribution of the Menzerath’s Law on the Syllable Level in Greek texts.

Quotations, Relevance and Time Depth: Medieval Arabic Literature in Grids and Networks.
In Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLfL)(2014)


This contribution deals with the use of quotations (repeated n-grams, the altorithm is tolerant to small lexical changes) in the works of medieval Arabic literature. The analysis is based on a 420 millions of words historical corpus of Arabic. Based on repeated quotations from work to work, a network is constructed and used for interpretation of various aspects of Arabic literature. Two short case studies are presented, concentrating on the centrality and relevance of individual works, and the analysis of a time depth and resulting impact of a given work in various periods.
Quotations, Relevance and Time Depth.

(With Miroslav Kubát): Vocabulary Richness Measure in Genres
In Journal of Quantitative Linguistics 4/20 (2013).

This article deals with the one of the oldest and most traditional fields in quantitative linguistics, the concept of vocabulary richness. Although there are several methods for vocabulary richness measurement, all of them are influenced by text size. Therefore, the authors propose a new way of vocabulary richness measurement without any text length dependence. In the second part of the article, the new method is used for a genre analysis in texts written by the Czech writer Karel Čapek. Furthermore, differences between authors and between languages are studied with this method.
Vocabulary Richness Measure in Genres.
The software used in the paper is available here.

Rank-frequency Relation and Type-token Relation: Two Sides of the Same Coin
In Ivan Obradović, Emmerich Kelih and Reinhard Kohler (Eds.) Methods and Applications of Quantitative Linguistics – Selected papers of the 8th International Conference on Quantitative Linguistics (QUALICO). 2013


Presented on the QUALICO 2012, Beograd
This paper shows that type-token relation, hapax-token relation and, generally, relation between types of certain frequency and tokens can be computed from the rank-frequency relation or from any type of frequency distribution and that type-token relation can be computed from the hapax-token relation. This paper shows that there is no need for any approximation or assumptions and that the formulae can be derived purely algebraically. The second part of the paper observes that, for a very large corpora, the ratio between the number of hapax legomena and types converges to a constant Z; Z>0. Under this assumption an approximation is built that enables us to predict type-token relation and other aforementioned relations from the single parameter Z. This approximation is only valid for very large corpora. As the last chapter shows, this assumption implies that for an infinitely increasing number of tokens, the number of types increases beyond any limit.
Available here.
The software based on the model is available here.

Minimal Ratio: An Exact Metric for Keywords, Collocations etc.
In Czech and Slovak Linguistic Review 1/2012.


The paper defines and shows how to use the Minimal Ratio – an exact metric that expresses the ratio between the measured value and the limits of the confidence interval calculated according to the formula Fischer’s exact test is based on. The metric is meant to assist with keywords and collocations extraction and comparing texts or corpora according to the word types distribution or other similar criteria.
The software based on the metric is available here.
Rank-frequency Relation and Type-token Relation: Two Sides of the Same Coin

Valency and Information Structure: A quantitative approach to from – to juxtaposition in Arabic
In Proceedings of CL Birmingham 2011.


Presented on the CL 2011, Birmingham
In Arabic, mutual order of prepositional phrases syntactically dependent on one head is neither fixed nor random. This paper explores the factors affecting the order of prepositions from and to. Many factors related to syntax, morphology and phonology are taken into account and analysed with a corpus driven approach.
Available on the conference proceedings or on my website.

A Combinatorial Method for a Context Comparison
In Issues in Quantitative Linguistics 2. Lüdenscheid 2011.


When comparing the use of two word types within one text, we can do it by comparing the contexts in which they occur. We pick all the tokens that occur e.g. immediatelly to the right of the word A and immediatelly to the right of the word B, thus getting two multiple subsets of text. This paper offers a method for comparing such subsets (and its use is not limited only to the field of linguistics). The method is based on comparing the cardinality of the intersection of the two multiple subsets and a model which characterizes the average cardinality of all possible subsets of a given length from the given text. The model is derived algebraically.

Type-token & Hapax-token Relation: A Combinatorial Model
In Glottotheory. International Journal of Theoretical Linguistics 2/1 (2009).


If we consider type-token relation to be a feature of text and not of language, we can approach a theoretically based and precise description of this relation. Such description will suit the demands of text linguistics better than the empirical laws that are used nowadays. This paper offers a model of the relation based on the combinatorial characterization of distribution of types in text. This method is subsequently used to formulate the model of hapax-token relation and the subject is generalized.
2. 8. 2008
Type-token & Hapax-token Relation: A Combinatorial Model. Software based on formulae from this article.

Published in
Czech Language
Konfidenční intervaly v empirické lingvistice.
In Lingvistika Praha 2014.


Empirical linguistics and confidence intervals
The paper attempts to introduce confidence intervals to the (Czech) empirical linguistics. First, classical inference tests are discussed claiming their inability to determine the real life significancy. Then confidence intervals are defined and the basic idea underlying the method for computing the confidence intervals for binary data is described. It is shown how the intervals can be useful when exploring binary quaternities and relations between two variables. The last section deals with the relevance of the method for the Czech linguistic discourse.
Konfidenční intervaly v empirické lingvistice..

With Jan Chromý: Experimentální zkoumání stylotvorných faktorů: první výstupy
(Experimental research on style-forming factors: first outcomes)
In Naše řeč 95/4(2012).


The paper introduces an experiment on the role of preparedness in writing. The experiment took place in 2010. Participants (N = 51; students of Charles University in Prague) were randomly divided into two groups: group N (N = 24) and group P (N = 27). Their main task was to describe the plot of a short animated film Quest. Group N started to write right after seeing each part of the film, group P had 5 minutes to prepare. Significant differences in the sentence length and number of revisions were shown between the two groups. It is claimed that preparedness is a valid styleforming factor, i.e. it influences both the process and the result of writing. Furthermore, the same method could be used for the analysis of the role of other style-forming factors in the writing process.
In Naše řeč 95/4 (2012) pp 181–186 ISSN 0027-8203

Knihtisk v dějinách islámské kultury.
In Nový orient 64/2 (2009).


(Typography and the Islamic culture)
The article examins the phenomenon ot the typography in the course of the Islamic history. In the Islamic world printing by movable types and printblocks was unacceptable. The using such a technology to copy a text written in an Arabic script was illegal. The article asks how could the society resist the temptation of this innovation and describes the distressful influence of typography on the life of Muslims.
9. 5. 2007; Full version (Czech): Cesta k arabskému knihtisku na Blízkém východě.
2009 Abbreviated version published in the Nový Orient (64/2), Knihtisk v dějinách islámské kultury.

*   *   *