0824 4256456   |   91-7892581597   |   project4uindia@gmail.com
Chat on WhatsApp Call Us Email Us

Efficient Long-Term Data Archival in Hadoop using Erasure Coding: The aHDFS Approach

Project Code: 25P4U21

Abstract

This research proposes aHDFS, a novel Hadoop Distributed File System (HDFS) extension for cost-effective long-term data archival. aHDFS leverages erasure coding to significantly reduce storage redundancy while maintaining high data availability and reliability. The system addresses the challenges of managing massive, infrequently accessed datasets by optimizing storage utilization and minimizing retrieval latency. Our results demonstrate that aHDFS achieves substantial storage savings compared to traditional HDFS replication, with minimal impact on data access performance, making it a viable solution for large-scale data archiving in Hadoop environments.

Introduction

The exponential growth of data necessitates efficient and cost-effective archival solutions. Hadoop, while robust, suffers from high storage costs due to its inherent data replication mechanism. This replication, while ensuring high availability, is wasteful for cold data requiring infrequent access. Existing approaches often lack scalability or are not fully integrated into the Hadoop ecosystem. The need for a scalable, cost-effective, and Hadoop-native solution for long-term data archiving motivates the development of aHDFS. Key challenges include minimizing storage overhead, maintaining data integrity and availability under node failures, and ensuring efficient data retrieval.

Objectives

  • Design a Hadoop-native archival system using erasure coding techniques.
  • Reduce storage redundancy for cold datasets while maintaining data availability.
  • Evaluate performance trade-offs between replication and erasure coding in HDFS.
  • Ensure efficient retrieval with minimal latency under node failure scenarios.
  • Demonstrate integration feasibility and scalability within existing Hadoop clusters.

Demo Video

Project Information

Domain: Big Data, Hadoop, Distributed Systems

Year: 2025

Technologies: Java, Hadoop HDFS, Erasure Coding, MapReduce

Platform: Linux, Hadoop Cluster