Efficient Long-Term Data Archival in Hadoop using Erasure Coding: The aHDFS Approach
Project Code: 25P4U21
Abstract
This research proposes aHDFS, a novel Hadoop Distributed File System (HDFS) extension for cost-effective long-term data archival. aHDFS leverages erasure coding to significantly reduce storage redundancy while maintaining high data availability and reliability. The system addresses the challenges of managing massive, infrequently accessed datasets by optimizing storage utilization and minimizing retrieval latency. Our results demonstrate that aHDFS achieves substantial storage savings compared to traditional HDFS replication, with minimal impact on data access performance, making it a viable solution for large-scale data archiving in Hadoop environments.
Introduction
The exponential growth of data necessitates efficient and cost-effective archival solutions. Hadoop, while robust, suffers from high storage costs due to its inherent data replication mechanism. This replication, while ensuring high availability, is wasteful for cold data requiring infrequent access. Existing approaches often lack scalability or are not fully integrated into the Hadoop ecosystem. The need for a scalable, cost-effective, and Hadoop-native solution for long-term data archiving motivates the development of aHDFS. Key challenges include minimizing storage overhead, maintaining data integrity and availability under node failures, and ensuring efficient data retrieval.
Objectives
- Design a Hadoop-native archival system using erasure coding techniques.
- Reduce storage redundancy for cold datasets while maintaining data availability.
- Evaluate performance trade-offs between replication and erasure coding in HDFS.
- Ensure efficient retrieval with minimal latency under node failure scenarios.
- Demonstrate integration feasibility and scalability within existing Hadoop clusters.
Demo Video
Domain: Big Data, Hadoop, Distributed Systems
Year: 2025
Technologies: Java, Hadoop HDFS, Erasure Coding, MapReduce
Platform: Linux, Hadoop Cluster