Posts

Showing posts from August, 2017

HDFS Block Concepts

File system Blocks :  file system  is used to control how data is  stored  and retrieved. Without a file system, information placed in a storage  medium  would be one large body of data with no way to tell where one piece of information stops and the next begins. A block is the smallest unit of data that can be stored or retrieved from the disk. Filesystems deal with the data stored in blocks. Filesystem blocks are normally in few kilobytes of size. Even if you try to store a block that has contents less than that of block size still it will occupy the block size on the disk.Blocks are transparent to the user who is performing filesystem operations like read and write. Need of  distributed filesystems When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines. Filesystems that manage the storage across a network of machines are called distributed filesyst...

Hadoop calculate maximum temperature explained

Image
Analyzing the Data with Hadoop Using Map Reduce  To take advantage of the parallel processing that Hadoop provides, we need to express our query as a MapReduce job because MapReduce framwork will manage the parallel processing by it self. MapReduce divides the processing into 2 phases - the map phase  the reduce phase  Each phase has input in the form of key value pair, and both phases produces output as key value pair. The output that has been generated by the map phase is given to a reduce phase as an input. It is the programmers responsibility to specifies two functions: the map function  the reduce function Lets take an example where input to the map phase is from the below link which has the NCDC data- https://raw.githubusercontent.com/lmsamarawickrama/Hadoop-MapReduce/master/NCDC%20weather%20files/1901 Using the above data we need to calculate maximum temperature per year. While writing a mapreudce code, We choose a text input...

Hadoop Map-Reduce Word Count Java Example

This hadoop tutorial aims to give developers a great start in the world of hadoop mapreduce programming by giving them a hands-on experience in developing their first hadoop based WordCount application. Hadoop MapReduce WordCount example is a standard example where hadoop developers begin their hands-on programming with. This tutorial will help hadoop developers learn how to implement WordCount example code in MapReduce to count the number of occurrences of a given word in the input file. (so called) Pre-requisites to follow this Hadoop WordCount Example Tutorial Hadoop must be installed or you should have a sandbox running on your Virtualbox (or VMWare). In case you have installed Hadoop on your machine Single node hadoop cluster must be configured and running. Optional - IDE must be installed (IntelliJ or Eclipse or any IDE) Hadoop Map Reduce Example - Word Count – How it works? Hadoop WordCount operation occurs in 3 stages – Mapper Phase Shuffle Phase Reducer Ph...

Hadoop Introduction

Image
Introduction to big data and hadoop A problem that led to hadoop -- Before getting into technicalities in this Hadoop tutorial blog, let me begin with an interesting story on how Hadoop came into the picture and why is it so popular in the industry nowadays. So, it all started with two people,  Doug Cutting and  Mike Cafarella , who were in the process of building a search engine system that can index 1 billion pages. After their research, they estimated that such a system will cost around half a million dollars in hardware, with a monthly running cost of $30,000, which is quite expensive. However, they soon realised that their architecture will not be capable enough to work around with billions of pages on the web. They came across a paper, published in 2003, that described the architecture of Google’s distributed file system, called GFS, which was being used in production at Google. Now, this paper on GFS proved to be something that they were looking for, and...