Press "Enter" to skip to content

Apache Hadoop Database Formats for Beginners

Hadoop is a collection of many software utilities in the big data ecosystem, which uses the power of a large number of computers to solve huge computational need by involving massive volumes of data. Hadoop primarily offers a software framework that enables distributed storage and also big data processing using MapReduce. It’s built originally for computer clusters or top-end hardware. The Hadoop modules are custom designed in such a way that any hardware failure or other such occurrences can be handled more effectively by the framework without tampering the processing.

Hadoop architecture

The core component of Hadoop is a storage system, which is technically known as HDFS (Hadoop Distributed File System). Next is the processing part, which is known as MapReduce. Files are split to large blocks and then distributed among various nodes in the framework. These clusters transfer the packaged codes further to multiple nodes for parallel processing of data.

This is a unique approach which gets benefited by data locality, where each node manipulates the data available to it. Altogether, it allows a dataset to be processed much faster and efficient than it could have been in any conventional computers or supercomputers. Through a high-speed network, the data is distributed to a parallel file system to do the computations.

Primary modules of Apache Hadoop framework

  1. Hadoop Common – This consists of utilities and libraries which are needed by different Hadoop modules.
  2. HDFS – This distributed file-system stores different types of data on machines offering higher aggregate bandwidth amongst the cluster.
  3. Hadoop YARN – It is responsible for managing all computing resources in clusters better and to use them to schedule the users’ applications.
  4. MapReduce – To implement the programming models for big-scale data processing needs.

When you say Hadoop, it often refers to the this group of applications with base modules and sub-modules, the ecosystem, and the collection software packages to be installed with Hadoop, which include but not limited to Pig, Hive, Phoenix, HBase, ZooKeeper, Spark, Flume, Cloudera Impala, Oozie, Sqoop, and Storm etc.

Hadoop is built on Java and there also some native code is written in C with the command line utilities using shell scripts. Even though Java is common, it is possible to use any programming language for Hadoop Streaming.

As we said above, the vital part of Hadoop is HDFS, which is the distributed file system. Unlike other files systems, HDFS can be best used in combination with any data processing tool like Spark or MapReduce. All these processing systems operate on different forms of textual data like web content, location data or server logs, etc. Further, we will discuss more various file formats in Hadoop.

Hadoop storage formats

Storage format refers to the way as to how a file stores information. You usually understand it by checking the file extension. Say, for example; you can see that there are many different storage formats for images like JPG, PNG, and GIF, etc. In fact, the same image can be stored in all these formats, but each of which may have a unique characteristic. Say, for example, JPG files may be smaller but will store a compressed file which may be of lower quality.

As offered by RemoteDBA.com, while dealing with the file system of Hadoop, you can choose among many of the traditional formats as you like and also some new Hadoop-specific file formats too for both structured and unstructured data.

Some of the best Hadoop storage formats are:

  • Plain text as TSV, CSV, etc.
  • Parquet
  • Sequence Files
  • Avro

Importance of storage formats

The major flaw of MapReduce type of HDFS applications is the time it may take for finding relevant data from a location and also the time is taken to write this data on to some other location. This could be a larger-scale issue while dealing with huge datasets like the evolving schemas. New Hadoop formats have come up as an easy way to deal with these issues in various use cases. Choose a proper file format may contribute to:

  • Faster read and write time.
  • File splitting with which you don’t have to read the files as a whole but parts of it.
  • Supporting schema evolution, which allows you to change dataset fields as you wish.
  • Compression support with which the files can be easily compressed with compression codec.

Some of these file formats are for general use as in case of MapReduce and Spark, some others for specific use cases as powering the databases, and some others are meant for certain data characteristics. So, there are many choices for users to go with.

In HDFS-enabled environment, you use MapReduce, Hive, and Spark as the three modes of interacting with the files on Hadoop. All these frameworks offer libraries which will let you process the files in various formats.

In MapReduce, the support is offered by two classes as:

  • InputFormat and
  • OutputFormat.

An example of a simple MapReduce read and write task as below:

The users can implement InputFormat classes as you want to store data in customized formats as you like. Spark and Hive too have the same mechanism to read and write custom formats as described in the InputFormat as above. So, we can find InputFormat as the gateway to different file formats in Hadoop. Even though you can swap storage formats in Hadoop, it is not as easy as switching two lines on the code. Various storage formats may provide various data types to the users.

The hottest trend in Hadoop file formats is now columnar storage of files. Instead of storing data rows close to one another, column values are stored next to each other in this model. With this, datasets get partitioned both vertically and horizontally. This is largely useful if the processing framework for data needs to have access to data subsets stored on disks that can access the values of a column quickly without the need to read the entire records.

In fact, this is not an exhaustive guide to Hadoop data formats, but just an introduction to it. Once you enter into it and start exploring, you can find out more handy ways to decide the most appropriate formats based on your needs in hand.