Posts

Sqoop In Depth

Image
In earlier post we have seen  need of Sqqop ? what are the Sqoop commands ? And what are the file formats Sqoop supports. Now lets dive in into the functional aspect of Sqoop like How does Sqoop imports/exports data from/to RDBMS.. Sqoop Import Sqoop’s import tool will run a MapReduce job that connects to the MySQL database and reads the table. By default, this will use four map tasks in parallel.Each task will write its imported results to a different file.If a distributed Hadoop cluster is being used, localhost (in jdbc:mysql://localhost/dbname )should not be specified in the connect string, because map tasks not running on the same machine as the database will fail to connect. For example we have 3 nodes (machines) in a cluster with ip 192.168.0.1, 192.168.0.2, 192.168.0.3 (consider these ip addresses just for example purpose). Lets assume MySql is installed on machine 192.168.0.2 .After running sqoop import with connection string as " jdbc:mysql://localhost/dbname...

Introduction to Sqoop Part 1

Introduction To use Hadoop for analytics, It is required to load the data into Hadoop clusters. Then later we can use this data for processing it  using traditional processing tool (e.g Map-Reduce/Hive/Pig). Sqoop is  used to Import data from RDBMS system to Hadoop distributed File system (HDFS). And for Exporting data from HDFS back to RDBMS, Sqoop is used. Loading GBs and TBs of data into HDFS from production databases or accessing it from map-reduce applications is a challenging task. While doing so, we have to consider things like data consistency, overhead of running these jobs on production systems and at the end if this process would be efficient or not. Using batch scripts to load data is an inefficient way to go with. Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities: Imports individual tables or entire databases to files in HDFS Generates Java classes to allow you to interact with your imported data Provi...

HDFS Block Concepts

File system Blocks :  file system  is used to control how data is  stored  and retrieved. Without a file system, information placed in a storage  medium  would be one large body of data with no way to tell where one piece of information stops and the next begins. A block is the smallest unit of data that can be stored or retrieved from the disk. Filesystems deal with the data stored in blocks. Filesystem blocks are normally in few kilobytes of size. Even if you try to store a block that has contents less than that of block size still it will occupy the block size on the disk.Blocks are transparent to the user who is performing filesystem operations like read and write. Need of  distributed filesystems When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines. Filesystems that manage the storage across a network of machines are called distributed filesyst...

Hadoop calculate maximum temperature explained

Image
Analyzing the Data with Hadoop Using Map Reduce  To take advantage of the parallel processing that Hadoop provides, we need to express our query as a MapReduce job because MapReduce framwork will manage the parallel processing by it self. MapReduce divides the processing into 2 phases - the map phase  the reduce phase  Each phase has input in the form of key value pair, and both phases produces output as key value pair. The output that has been generated by the map phase is given to a reduce phase as an input. It is the programmers responsibility to specifies two functions: the map function  the reduce function Lets take an example where input to the map phase is from the below link which has the NCDC data- https://raw.githubusercontent.com/lmsamarawickrama/Hadoop-MapReduce/master/NCDC%20weather%20files/1901 Using the above data we need to calculate maximum temperature per year. While writing a mapreudce code, We choose a text input...

Hadoop Map-Reduce Word Count Java Example

This hadoop tutorial aims to give developers a great start in the world of hadoop mapreduce programming by giving them a hands-on experience in developing their first hadoop based WordCount application. Hadoop MapReduce WordCount example is a standard example where hadoop developers begin their hands-on programming with. This tutorial will help hadoop developers learn how to implement WordCount example code in MapReduce to count the number of occurrences of a given word in the input file. (so called) Pre-requisites to follow this Hadoop WordCount Example Tutorial Hadoop must be installed or you should have a sandbox running on your Virtualbox (or VMWare). In case you have installed Hadoop on your machine Single node hadoop cluster must be configured and running. Optional - IDE must be installed (IntelliJ or Eclipse or any IDE) Hadoop Map Reduce Example - Word Count – How it works? Hadoop WordCount operation occurs in 3 stages – Mapper Phase Shuffle Phase Reducer Ph...

Hadoop Introduction

Image
Introduction to big data and hadoop A problem that led to hadoop -- Before getting into technicalities in this Hadoop tutorial blog, let me begin with an interesting story on how Hadoop came into the picture and why is it so popular in the industry nowadays. So, it all started with two people,  Doug Cutting and  Mike Cafarella , who were in the process of building a search engine system that can index 1 billion pages. After their research, they estimated that such a system will cost around half a million dollars in hardware, with a monthly running cost of $30,000, which is quite expensive. However, they soon realised that their architecture will not be capable enough to work around with billions of pages on the web. They came across a paper, published in 2003, that described the architecture of Google’s distributed file system, called GFS, which was being used in production at Google. Now, this paper on GFS proved to be something that they were looking for, and...