Posts

STEPS TO RUN WORD COUNT MAP-REDUCE

Image
If you have hadoop installed (extracted hadoop tar file) please ignore the below installation step. Installation: ==> please replace user value with appropriate user name from your system terminal> cd /home/user ==> to download the tarball from below location you: Either you can paste below link in web browser which would start downloading or you can you wget command to download hadoop tar file. terminal>wget  http://redrockdigimark.com/apachemirror/hadoop/common/stable/hadoop-2.9.1.tar.gz http://redrockdigimark.com/apachemirror/hadoop/common/stable/hadoop-2.9.1.tar.gz ==> untar the file using below command terminal>tar -xzf hadoop-2.9.1.tar.gz ==> you can use the below command to check if the files has been sucessfully extracted or not  terminal> cd hadoop-2.9.1/bin ==> below ls command will list all the commands which related to hadoop terminal> ls ==> below command would show you the current files from the cu

HADOOP MAP-REDUCE WORD COUNT JAVA CODE

WordCount.java package org.myorg.hadoop; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable( 1 ); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { context.getCurrentKey(); StringTokenizer itr = new StringTokenizer(value.toString());

Sqoop In Depth

Image
In earlier post we have seen  need of Sqqop ? what are the Sqoop commands ? And what are the file formats Sqoop supports. Now lets dive in into the functional aspect of Sqoop like How does Sqoop imports/exports data from/to RDBMS.. Sqoop Import Sqoop’s import tool will run a MapReduce job that connects to the MySQL database and reads the table. By default, this will use four map tasks in parallel.Each task will write its imported results to a different file.If a distributed Hadoop cluster is being used, localhost (in jdbc:mysql://localhost/dbname )should not be specified in the connect string, because map tasks not running on the same machine as the database will fail to connect. For example we have 3 nodes (machines) in a cluster with ip 192.168.0.1, 192.168.0.2, 192.168.0.3 (consider these ip addresses just for example purpose). Lets assume MySql is installed on machine 192.168.0.2 .After running sqoop import with connection string as " jdbc:mysql://localhost/dbname

Introduction to Sqoop Part 1

Introduction To use Hadoop for analytics, It is required to load the data into Hadoop clusters. Then later we can use this data for processing it  using traditional processing tool (e.g Map-Reduce/Hive/Pig). Sqoop is  used to Import data from RDBMS system to Hadoop distributed File system (HDFS). And for Exporting data from HDFS back to RDBMS, Sqoop is used. Loading GBs and TBs of data into HDFS from production databases or accessing it from map-reduce applications is a challenging task. While doing so, we have to consider things like data consistency, overhead of running these jobs on production systems and at the end if this process would be efficient or not. Using batch scripts to load data is an inefficient way to go with. Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities: Imports individual tables or entire databases to files in HDFS Generates Java classes to allow you to interact with your imported data Provides the a

HDFS Block Concepts

File system Blocks :  file system  is used to control how data is  stored  and retrieved. Without a file system, information placed in a storage  medium  would be one large body of data with no way to tell where one piece of information stops and the next begins. A block is the smallest unit of data that can be stored or retrieved from the disk. Filesystems deal with the data stored in blocks. Filesystem blocks are normally in few kilobytes of size. Even if you try to store a block that has contents less than that of block size still it will occupy the block size on the disk.Blocks are transparent to the user who is performing filesystem operations like read and write. Need of  distributed filesystems When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines. Filesystems that manage the storage across a network of machines are called distributed filesystems. Since they are network based

Hadoop calculate maximum temperature explained

Image
Analyzing the Data with Hadoop Using Map Reduce  To take advantage of the parallel processing that Hadoop provides, we need to express our query as a MapReduce job because MapReduce framwork will manage the parallel processing by it self. MapReduce divides the processing into 2 phases - the map phase  the reduce phase  Each phase has input in the form of key value pair, and both phases produces output as key value pair. The output that has been generated by the map phase is given to a reduce phase as an input. It is the programmers responsibility to specifies two functions: the map function  the reduce function Lets take an example where input to the map phase is from the below link which has the NCDC data- https://raw.githubusercontent.com/lmsamarawickrama/Hadoop-MapReduce/master/NCDC%20weather%20files/1901 Using the above data we need to calculate maximum temperature per year. While writing a mapreudce code, We choose a text input format (which is a

Hadoop Map-Reduce Word Count Java Example

This hadoop tutorial aims to give developers a great start in the world of hadoop mapreduce programming by giving them a hands-on experience in developing their first hadoop based WordCount application. Hadoop MapReduce WordCount example is a standard example where hadoop developers begin their hands-on programming with. This tutorial will help hadoop developers learn how to implement WordCount example code in MapReduce to count the number of occurrences of a given word in the input file. (so called) Pre-requisites to follow this Hadoop WordCount Example Tutorial Hadoop must be installed or you should have a sandbox running on your Virtualbox (or VMWare). In case you have installed Hadoop on your machine Single node hadoop cluster must be configured and running. Optional - IDE must be installed (IntelliJ or Eclipse or any IDE) Hadoop Map Reduce Example - Word Count – How it works? Hadoop WordCount operation occurs in 3 stages – Mapper Phase Shuffle Phase Reducer Ph