Sciact
  • EN
  • RU

Text Segmentation Via Processes that Count the Number of Different Words Forward and Backward Full article

Journal Journal of Quantitative Linguistics
ISSN: 1744-5035
Output data Year: 2024, Volume: 31, Number: 1, Pages: 1-18 Pages count : 18 DOI: 10.1080/09296174.2023.2275342
Tags change-point detection
Authors Abebe Berhane 1,2 , Chebunin Mikhail 1,3 , Kovalevskii Artyom 1,4,5
Affiliations
1 Novosibirsk State University
2 Mainefhi College of Science
3 Karlsruhe Institute of Technology, Institute of Stochastics
4 Sobolev Institute of Mathematics
5 Novosibirsk State Technical University

Funding (1)

1 Sobolev Institute of Mathematics FWNF-2022-0010

Abstract: The paper is developing a new statistical approach to automatic partitioning of texts into parts belonging to different authors. It is based on the analysis of processes that counts the number of different words forward and backward. The theoretical study of the processes is based on the assumptions of an elementary probability model with a change point. We prove consistence of our statistical estimate of the point of concatenation in the case when the concatenated texts have different Zipf exponents. This method is being tested on the Brown corpus and also on newspaper texts in different languages. Testing shows a good estimate of the concatenation point. This method can be used in parallel with other text segmentation methods.
Cite: Abebe B. , Chebunin M. , Kovalevskii A.
Text Segmentation Via Processes that Count the Number of Different Words Forward and Backward
Journal of Quantitative Linguistics. 2024. V.31. N1. P.1-18. DOI: 10.1080/09296174.2023.2275342 WOS Scopus OpenAlex
Dates:
Published online: Nov 12, 2023
Published print: Jan 15, 2024
Identifiers:
Web of science: WOS:001100158000001
Scopus: 2-s2.0-85176726465
OpenAlex: W4388608298
Citing:
DB Citing
OpenAlex 2
Web of science 2
Scopus 3
Altmetrics: