Development of a parallel clustering algorithm for the k-medoids problem in high-dimensional space for large-scale datasets Conference attendances
Language | Английский | ||||||
---|---|---|---|---|---|---|---|
Participant type | Секционный | ||||||
URL | https://motor2023.uran.ru/docs/technical--program-v3.pdf | ||||||
Conference |
22nd International conference "Mathematical Optimization Theory and Operations Research" 02-08 Jul 2023 , Екатеринбург |
||||||
Authors |
|
||||||
Affiliations |
|
Abstract:
The k-medoids (k-median) clustering problem is an important problem in the field of unsupervised learning. It is used to partition a set of data points into k clusters, where each cluster is represented by one of the data points, called a medoid. This problem is widely used in various fields such as data mining, machine learning, and image processing. It has been applied to a variety of real-world applications such as data reduction, anomaly detection, and market segmentation. K-medoids algorithm is more robust to noise and has an ability to handle different types of data, including categorical and numerical data. Despite its importance, finding high-quality solutions for large-scale datasets remains a challenging task. In this paper, we improve previous work for the k-medoids problem [1] and propose a parallel primal-dual heuristic algorithm
in high-dimensional space. We developed an algorithm that addresses the limitations of existing solutions. These limitations include: time-consuming calculation of the distance matrix, unefficient searching for nearest neighbours, the absence of efficient GPU parallel algorithms, difficulties working with largescale datasets. Our algorithm employs an efficient parallel implementation of nearest neighbor search, a parallel subgradient column generation method, and a novel subgradient search algorithm. During the work, a comprehensive literature review was conducted to critically evaluate contemporary methods for addressing kmeans, nearest neighbour, and k-medoids problems using both CPU and GPUbased approaches. The review culminated in the identification of the most promising algorithms, which were subsequently chosen for in-depth study with the aim of incorporating their key principles into the design and implementation of a novel algorithm.
As a result of the research, the execution time of the algorithm on the BIRCH dataset was studied in relation to variations in the initial parameters and different implementations of specific components of the algorithm. Additionally, a comparative analysis of nearest neighbor search times was performed, utilising algorithm [2] as a reference point. Furthermore, the feasibility of applying the techniques proposed in article [3] to compute the approximate distance matrix was explored. Through our experiments, we demonstrate that our algorithm finds highquality solutions for large-scale datasets and outperforms this k-medoids clustering algorithms by achieving significant speedup.
1) A. Ushakov, I. Vasilyev. Near-optimal large-scale k-medoids clustering,
Information Sciences., 2020
2) V. Garcia. Fast k-nearest neighbor search using GPU, 2009
3) J. Johnson, M. Douze. Billion-scale similarity search with GPUs, 2017
Cite:
Vandanov S.A.
, Plyasunov A.V.
, Ushakov A.V.
Development of a parallel clustering algorithm for the k-medoids problem in high-dimensional space for large-scale datasets
22nd International conference "Mathematical Optimization Theory and Operations Research" 02-08 Jul 2023
Development of a parallel clustering algorithm for the k-medoids problem in high-dimensional space for large-scale datasets
22nd International conference "Mathematical Optimization Theory and Operations Research" 02-08 Jul 2023