Sciact
  • EN
  • RU

Comparative Statistical Analysis of Word Frequencies in Human-Written and AI-Generated Texts Научная публикация

Журнал Glottometrics
ISSN: 1617-8351 , E-ISSN: 2625-8226
Вых. Данные Год: 2025, Том: 58, Страницы: 19-34 Страниц : 16 DOI: 10.53482/2025_58_423
Ключевые слова Large Language Model, Zipf’s Law, rare words.
Авторы Kudryavtseva Anna 1 , Kovalevskii Artyom 1,2,3
Организации
1 Novosibirsk State University
2 Novosibirsk State Technical University
3 Sobolev Institute of Mathematics

Информация о финансировании (1)

1 Институт математики им. С.Л. Соболева СО РАН FWNF-2022-0010

Реферат: We classify texts using relative word frequencies. The task is to distinguish human-written texts from those generated by a computer using modern algorithms. We study two essay datasets, each containing an equal number of human-written and computer-generated essays. Studying Zipf diagrams shows that the generated texts have a significantly smaller vocabulary compared to human ones. However, the relative frequency of rare words (not included in the 1000 most common) does not allow us to confidently classify the texts. As additional features, we used the relative frequencies of the four most frequent words, as well as the ratio of the number of hapax legomena to the total number of different words. This feature allows to significantly improve the classification. Using these six features allows us to fairly confidently determine whether the text is computer-generated.
Библиографическая ссылка: Kudryavtseva A. , Kovalevskii A.
Comparative Statistical Analysis of Word Frequencies in Human-Written and AI-Generated Texts
Glottometrics. 2025. V.58. P.19-34. DOI: 10.53482/2025_58_423 WOS Scopus OpenAlex
Даты:
Опубликована в печати: 6 авг. 2025 г.
Опубликована online: 6 авг. 2025 г.
Идентификаторы БД:
Web of science: WOS:001545112900002
Scopus: 2-s2.0-105013211258
OpenAlex: W4413000682
Цитирование в БД: Пока нет цитирований
Альметрики: