Hadoop Introduction

Introduction to big data and hadoop

A problem that led to hadoop --


Before getting into technicalities in this Hadoop tutorial blog, let me begin with an interesting story on how Hadoop came into the picture and why is it so popular in the industry nowadays. So, it all started with two people, Doug Cutting and Mike Cafarella , who were in the process of building a search engine system that can index 1 billion pages. After their research, they estimated that such a system will cost around half a million dollars in hardware, with a monthly running cost of $30,000, which is quite expensive. However, they soon realised that their architecture will not be capable enough to work around with billions of pages on the web.
They came across a paper, published in 2003, that described the architecture of Google’s distributed file system, called GFS, which was being used in production at Google. Now, this paper on GFS proved to be something that they were looking for, and soon, they realized that it would solve all their problems of storing very large files that are generated as a part of the web crawl and indexing process. Later in 2004, Google published one more paper that introduced MapReduce to the world. Finally, these two papers led to the foundation of  the framework called “Hadoop“. Doug quoted on Google’s contribution in the development of Hadoop framework:
“Google is living a few years in the future and sending the rest of us messages.”
So, by now you would have realized how powerful Hadoop is. But the question that arises over here is, how Hadoop provides such great functionalities? I would request you to bear with me and trust me guys, all of your doubts will be cleared by the time you have finished this blog.

What is Big Data?
Have you ever wondered how technologies evolve to fulfill emerging needs? For example, earlier we had landline phones, but now we have shifted to smartphones. Similarly, how many of you remember floppy drives that were extensively used back in 90’s? These Floppy drives have been replaced by hard disks because these floppy drives had very low storage capacity and transfer speed. Thus, this makes floppy drives insufficient for handling the amount of data with which we are dealing today. In fact, now we can store terabytes of data on the cloud without being bothered about size constraints.
Now, let us talk about various drivers that contribute to the generation of data. 
Have you heard about IoT? IoT connects your physical device to the internet and makes it smarter. Nowadays, we have smart air conditioners, televisions etc. Your smart air conditioner constantly monitors your room temperature along with the outside temperature and accordingly decides what should be the temperature of the room. Now, in order to do this, it first collects the data of the temperature outside the room from the internet. It continuously stores the data received from its sensors. Finally, with the help of these two data, it infers the required change in room temperature. Now imagine how much data would be generated in a year by smart air conditioner installed in tens & thousands of houses. By this you can understand how IoT is contributing a major share to Big Data.
Now, let us talk about the largest contributor to the Big Data which is, nothing but, social media. Social media is actually one of the most important factors in the evolution of Big Data as it provides information about the people’s behaviour.



Grocery Store Analogy


Let us take a simple example of a grocery store to understand the problems associated with Big Data and how Hadoop solved that problem.

Piyush is a businessman who has opened a small grocery store. Initially, in his grocery store, he used to receive two orders per hour and he had one worker with one shelf in his store which was sufficient enough to handle all the orders and fulfil the requirements

Now let us compare the restaurant example with the traditional scenario where data was getting generated at a steady rate and our traditional systems like RDBMS is capable enough to handle it, just like Piyush's worker. Here, you can relate the data storage with the groceries store shelf and the traditional processing unit with the worker.


After few months, Piyush thought of expanding his business and therefore, he started taking online orders and added vegatables section in the store in order to engage a larger audience. Because of this transition, the rate at which they were receiving orders rose to an alarming figure of 10 orders per hour and it became quite difficult for a single worker to cope up with the current situation. Aware of the situation in processing the orders, Piyush started thinking about the solution. 

Similarly, in Big Data scenario, the data started getting generated at an alarming rate because of the introduction of various data growth drivers such as social media, smartphones etc. Now, the traditional system, just like worker in Piyush’s store, was not efficient enough to handle this sudden change. Thus, there was a need for a different kind of solutions strategy to cope up with this problem. 

After a lot of research, Piyush came up with a solution where he hired 4 more workers to tackle the huge rate of orders being received. Everything was going quite well but, this solution led to one more problem. Since four worker were sharing the same store bag, the very shelf becoming the bottleneck of the whole process. Hence, the solution was not that efficient as Piyush had thought.

Similarly, to tackle the problem of processing huge datasets, multiple processing units were installed so as to process the data parallelly (just like Piyush hired 4 workers). But even in this case bringing multiple processing units was not an effective solution because: the centralized storage unit is the bottleneck. In other words, the performance of the whole system is driven by the performance of the central storage unit. Therefore, the moment our central storage goes down, the whole system gets compromised. Hence, again there was a need to resolve this single point of failure. 

Piyush came up with another efficient solution, he divided all the workers in two hierarchies, i.e. junior and head worker and assigned each junior worker with a shelf. Let us assume that the order is 1 kg of sugar and 2 kg of wheat. Now, according to Piyush’s plan, one junior worker will get the sugar and the other junior worker will get wheat. Moving ahead they will transfer both sugar and wheat to the head worker, where the head worker will combine both products, which then will be delivered as the final order.

Hadoop functions in a similar fashion as Piyush’s grocry store. As the store shelf is distributed in Piyush’s grocery, similarly, in Hadoop, the data is stored in a distributed fashion with replications, to provide fault tolerance. For parallel processing, first the data is processed by the slaves where it is stored for some intermediate results and then those intermediate results are merged by master node to send the final result.

Major challenges with Big Data

  1. The first problem is storing the colossal amount of data. Storing huge data in a traditional system is not possible. The reason is obvious, the storage will be limited to one system and the data is increasing at a tremendous rate.
  2. The second problem is storing heterogeneous data. Now we know that storing is a problem, but let me tell you it is just one part of the problem. The data is not only huge, but it is also present in various formats i.e. unstructured, semi-structured and structured. So, you need to make sure that you have a system to store different types of data that is generated from various sources.
  3. Finally let’s focus on the third problem, which is the processing speed. Now the time taken to process this huge amount of data is quite high as the data to be processed is too large.
To solve the storage issue and processing issue, two core components were created in Hadoop – HDFS and YARN. HDFS solves the storage issue as it stores the data in a distributed fashion and is easily scalable. And, YARN solves the processing issue.

Hadoop as a solution



The first problem is storing huge amount of data
As you can see in the above image, HDFS provides a distributed way to store Big Data. Your data is stored in blocks in DataNodes and you specify the size of each block. Suppose you have 158 MB of data and you have configured HDFS such that it will create 64 MB of data blocks. Now, HDFS will divide data into 3 blocks and stores it across different DataNodes. While storing these data blocks into DataNodes, data blocks are replicated on different DataNodes to provide fault tolerance.
Hadoop follows horizontal scaling instead of vertical scaling. In horizontal scaling, you can add new nodes to HDFS cluster on the run as per requirement, instead of increasing the hardware stack present in each node. 







Next problem was storing the variety of data
As you can see in the above image, in HDFS you can store all kinds of data whether it is structured, semi-structured or unstructured. In HDFS, there is no pre-dumping schema validation. It also follows write once and read many model. Due to this, you can just write any kind of data once and you can read it multiple times for finding insights.




If you can recall, the third challenge was about processing the data faster. In order to solve this, we move processing unit to data instead of moving data to processing unit. So, what does it mean by moving the computation unit to data? It means that instead of moving data from different nodes to a single master node for processing, the processing logic is sent to the nodes where data is stored so as that each node can process a part of data in parallel. Finally, all of the intermediary output produced by each node is merged together and the final response is sent back to the client.

Hadoop Features


Reliability: When machines are working in tandem, if one of the machines fails, another machine will take over the responsibility and work in a reliable and fault tolerant fashion. Hadoop infrastructure has inbuilt fault tolerance features and hence, Hadoop is highly reliable. 
Economical: Hadoop uses commodity hardware (like your PC, laptop). For example, in a small Hadoop cluster, all your DataNodes can have normal configurations like 8-16 GB RAM with 5-10 TB hard disk and Xeon processors, but if I would have used hardware-based RAID with Oracle for the same purpose, I would end up spending 5x times more at least. So, the cost of ownership of a Hadoop-based project is pretty minimized. It is easier to maintain the Hadoop environment and is economical as well. Also, Hadoop is an open source software and hence there is no licensing cost.
Scalability: Hadoop has the inbuilt capability of integrating seamlessly with cloud-based services. So, if you are installing Hadoop on a cloud, you don’t need to worry about the scalability factor because you can go ahead and procure more hardware and expand your setup within minutes whenever required.
Flexibility: Hadoop is very flexible in terms of ability to deal with all kinds of data. Data can be of any kind and Hadoop can store and process them all, whether it is structured, semi-structured or unstructured data.
These 4 characteristics make Hadoop a front-runner as a solution to Big Data challenges. Now that we know what is Hadoop, we can explore the core components of Hadoop.

Lets Dive-in Deep


Now that we know what is Hadoop, we can explore the core components of Hadoop.
Let us understand, what are the core components of Hadoop.

Hadoop Core Components

While setting up a Hadoop cluster, you have an option of choosing a lot of services as part of your Hadoop platform, but there are two services which are always mandatory for setting up Hadoop. One is HDFS (storage) and the other is YARN (processing). HDFS stands for Hadoop Distributed File System, which is a scalable storage unit of Hadoop whereas YARN (Yet Another Resource Negotiator) is used for process the data i.e. stored in the HDFS in a distributed and parallel fashion.
Let us go ahead with HDFS first. The main components of HDFS are: NameNode and DataNode. Let us talk about the roles of these two components in detail. 
NameNode
  • It is the master daemon that maintains and manages the DataNodes (slave nodes)
  • It records the metadata of all the blocks stored in the cluster, e.g. location of blocks stored, size of the files, permissions, hierarchy, etc.
  • It records each and every change that takes place to the file system metadata
  • If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog
  • It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live
  • It keeps a record of all the blocks in the HDFS and DataNode in which they are stored
  • It has high availability and federation features which I will discuss in HDFS architecture in detail
DataNode
  • It is the slave daemon which run on each slave machine
  • The actual data is stored on DataNodes
  • It is responsible for serving read and write requests from the clients
  • It is also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode
  • It sends heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds

Now, let move ahead to our second fundamental unit of Hadoop i.e. YARN. YARN comprises of two major component: ResourceManager and NodeManager.

ResourceManager 
  • It is a cluster level (one for each cluster) component and runs on the master machine
  • It manages resources and schedule applications running on top of YARN
  • It has two components: Scheduler & ApplicationManager
  • The Scheduler is responsible for allocating resources to the various running applications
  • The ApplicationManager is responsible for accepting job submissions and negotiating the first container for executing the application
  • It keeps a track of the heartbeats from the Node Manager
NodeManager
  • It is a node level component (one on each node) and runs on each slave machine
  • It is responsible for managing containers and monitoring resource utilization in each container
  • It also keeps track of node health and log management
  • It continuously communicates with ResourceManager to remain up-to-date

Hadoop Ecosystem In A Nut-Shell 



Comments

Popular posts from this blog

Hadoop calculate maximum temperature explained

Hadoop Map-Reduce Word Count Java Example

Sqoop In Depth