Processes of numbers of different words and their elementary probabilistic model Conference attendances
Language | Английский | ||||
---|---|---|---|---|---|
Participant type | Секционный | ||||
Conference |
2023 China Russia Symposium on Probability Theory 28 Aug - 1 Sep 2023 , Пекин |
||||
Authors |
|
||||
Affiliations |
|
Abstract:
The prominent Zipf's law (Zipf, 1936) states that the rank-frequency distribution of words in a text is a power law. A lesser known law, discovered by Herdan (1960) and commonly referred to as Heaps' law (Heaps, 1978), describes the dynamics in the number of different words along the length of a text: with the growth of the length of the text, the number of different words grows in accordance with the power function. The connection between the Zipf and Heaps laws was substantiated in a number of works, but the first substantiation within the framework of an elementary probabilistic model was proposed by Bahadur (1960): if one chooses words independently of each other from some infinite dictionary according to some distribution whose pmf decreases according to a power law, then the expectation of the number of distinct words grows asymptotically according to a power law, and the same is true for the very number of different words. Bahadur proved the equivalence in the weak sense, Karlin (1967) proved the strong equivalence. Zipf's parameter alone is not enough to describe real texts: both Zipf's law and Heaps' law show significant discrepancies with reality. Mandelbrot (1965) proposed a modification of Zipf's law. We use the Mandelbrot's model but we reserve initial probabilities for some initial number of words whose occurrence probabilities are different from the Mandelbrot's formula. We analyze texts based on this elementary probabilistic model: we assume that the words of a text are selected independently from each other from some infinite dictionary. We show that this elementary model effectively describes the Zipf's and Heaps' laws. The main application of this study is a text homogeneity analysis technology. The statistical test decides whether a text is written by one or more authors based on an omega-square type statistic for the difference in the processes of different words when reading the text forward and backward. This approach is based on papers of Chebunin and Kovalevskii (2016), Abebe et al. (2022).
Cite:
Kovalevskii A.
Processes of numbers of different words and their elementary probabilistic model
2023 China Russia Symposium on Probability Theory 28 Aug - 1 Sep 2023
Processes of numbers of different words and their elementary probabilistic model
2023 China Russia Symposium on Probability Theory 28 Aug - 1 Sep 2023