Approximate approach for frequent itemsets mining on massive distributed data beyond computing capacity | Sciact - Система учета научной деятельности ИМ СО РАН

Approximate approach for frequent itemsets mining on massive distributed data beyond computing capacity Научная публикация

Журнал

Expert Systems with Applications
ISSN: 0957-4174 , E-ISSN: 1873-6793

Вых. Данные

Год: 2026, Том: 318, Номер статьи : 132043, Страниц : 15 DOI: 10.1016/j.eswa.2026.132043

Ключевые слова

Parallel and distributed algorithms; Frequent itemsets mining; Spark; Big data analytics; Distributed data files; Sampling techniques

Авторы

Ngueilbaye Alladoumbaye ^1,2,3 , Sibagatullin Ratmir ^2,3 , Cai Yongda ⁴ , Mahmud Mohammad Sultan ^3,5 , Sun Xudong ⁶ , Nechesov Andrey ⁷ , Goncharov Sergey S. ⁸ , Huang Joshua Zhexue ^2,3,9

Организации

1	School of Artificial Intelligence, Shenzhen University, 518060, Shenzhen, China
2	National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, 518060, Shenzhen, China
3	Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, 518060, Shenzhen, China
4	School of Computer Sciences, Guangdong Polytechnic Normal University, 510665, Guangzhou Guangdong, China
5	School of Artificial Intelligence, Shenzhen Technology University, 518118, Shenzhen, China
6	College of Management, Shenzhen University, China
7	The Artificial Intelligence Center, Novosibirsk State University, 630090, Novosibirsk, Russia
8	Sobolev Institute of Mathematics, Siberian Branch of the Russian Academy of Sciences, 630090, Novosibirsk, Russia
9	Guangdong Laboratory of Artificial Intelligence and Digital Economy, 518107, Shenzhen, China

Frequent itemsets mining (FIM) is a fundamental task in data mining; however, traditional methods struggle with massive distributed data that exceeds available memory and computing resources. Mining frequent itemsets (FIs) from a massive static distributed data file (MSDDF) on a cluster with limited memory is therefore a challenging problem. In this paper, we propose Approximate Frequent Itemsets Mining (ApproxFIM), a novel two-stage solution that combines a new sampling method and an approximation approach to reduce computational cost under strict resource constraints. In the first stage, a bounded number of data blocks are randomly selected from the MSDDF and converted into representative random sample partitions. Theoretical guarantees are derived to bound the number of selected data blocks and to ensure the quality of the random sample, and prove that each constructed sample remains representative of the entire dataset. In the second stage, frequent itemsets are mined independently and in parallel from the sampled partitions using FP-Growth, and the resulting patterns are aggregated into a final approximate FIs set. ApproxFIM is implemented in Apache Spark using the Local Operations with Global Operations (LOGO) computing paradigm and evaluated on both real-world and synthetic datasets. Experimental results demonstrate that ApproxFIM scales effectively, significantly reduces memory and execution time requirements, and produces accurate approximations, making it well-suited for practical massive static distributed data mining on small clusters with limited resources.

Ngueilbaye A. , Sibagatullin R. , Cai Y. , Mahmud M.S. , Sun X. , Nechesov A. , Goncharov S.S. , Huang J.Z.
Approximate approach for frequent itemsets mining on massive distributed data beyond computing capacity
Expert Systems with Applications. 2026. V.318. 132043 :1-15. DOI: 10.1016/j.eswa.2026.132043 WOS Scopus OpenAlex

Поступила в редакцию:	13 окт. 2025 г.
Принята к публикации:	10 мар. 2026 г.
Опубликована online:	14 мар. 2026 г.
Опубликована в печати:	1 июл. 2026 г.

≡ Web of science:	WOS:001721167100001
≡ Scopus:	2-s2.0-105034621405
≡ OpenAlex:	W7135405357