Privacy-Preserving K-means Clustering on Distributed Datasets using Secure Multi-Party Computation
Project Code: 25P4U19
Abstract
This research explores the design and implementation of a privacy-preserving k-means clustering algorithm for large-scale datasets using a MapReduce framework and Secure Multi-Party Computation (MPC). The primary objective is to enable efficient and accurate clustering while ensuring data confidentiality. We leverage MPC techniques to perform computations on encrypted data, mitigating the risk of data breaches. The results demonstrate a viable approach for privacy-preserving clustering on distributed environments, achieving comparable accuracy to traditional k-means with strong privacy guarantees. Future work will focus on optimizing performance and scalability for even larger datasets.
Introduction
K-means clustering is a widely used technique for unsupervised machine learning, finding applications across various domains. However, applying k-means to large datasets often necessitates distributed computing frameworks like MapReduce. Simultaneously, growing concerns about data privacy necessitate the development of methods that allow for computation on sensitive data without revealing its contents. Existing methods either compromise privacy or lack scalability. This research addresses this challenge by proposing a privacy-preserving k-means algorithm leveraging the efficiency of MapReduce and the security of MPC, aiming to achieve both scalability and privacy.
Objectives
- Implement a privacy-preserving k-means clustering algorithm using a MapReduce framework.
- ntegrate secure multi-party computation (MPC) protocols to protect data privacy during computation.
- valuate the performance and accuracy of the proposed algorithm against a baseline k-means implementation..
Demo Video
Domain: Cybersecurity, Data Classification
Year: 2025
Technologies: Python, Data Analysis, Case Study Research, Visualization Tools
Platform: Cross-platform (Web-based or Desktop tool)