The program is based on the algorithms described by Jiří Milička in his article Rank-frequency Relation and
Type-token Relation: Two Sides of the Same Coin. The algorithms are valid only for big corpora (the number
of tokens should exceed 100,000,000 )
The application helps you to plan the size of a new corpus.
- Specify the number of tokens (or "positions") in your corpus / corpora
- Specify the number of types (wordforms, lemmas) in your corpus / corpora
- Specify the number of hapax legomena (ie. types, that occur only once) in your corpus / corpora
- Fill arbitrary 2 of the 3 following values:
- The number of tokens in your planned corpus, if you need b) types of frequency c)
- The number or types of frequency c) in the corpus of size a)
- The frequency of b) types in a corpus that contains a) tokens
- The remainning value will be calculated after pushing the main button.
The algorithm is quite greedy so please be patient (the progress is not visible).