World Of Big Data

Posts

Sqoop In Depth

December 28, 2017

In earlier post we have seen need of Sqqop ? what are the Sqoop commands ? And what are the file formats Sqoop supports. Now lets dive in into the functional aspect of Sqoop like How does Sqoop imports/exports data from/to RDBMS.. Sqoop Import Sqoop’s import tool will run a MapReduce job that connects to the MySQL database and reads the table. By default, this will use four map tasks in parallel.Each task will write its imported results to a different file.If a distributed Hadoop cluster is being used, localhost (in jdbc:mysql://localhost/dbname )should not be specified in the connect string, because map tasks not running on the same machine as the database will fail to connect. For example we have 3 nodes (machines) in a cluster with ip 192.168.0.1, 192.168.0.2, 192.168.0.3 (consider these ip addresses just for example purpose). Lets assume MySql is installed on machine 192.168.0.2 .After running sqoop import with connection string as " jdbc:mysql://localhost/dbname...

Introduction to Sqoop Part 1

September 23, 2017

Introduction To use Hadoop for analytics, It is required to load the data into Hadoop clusters. Then later we can use this data for processing it using traditional processing tool (e.g Map-Reduce/Hive/Pig). Sqoop is used to Import data from RDBMS system to Hadoop distributed File system (HDFS). And for Exporting data from HDFS back to RDBMS, Sqoop is used. Loading GBs and TBs of data into HDFS from production databases or accessing it from map-reduce applications is a challenging task. While doing so, we have to consider things like data consistency, overhead of running these jobs on production systems and at the end if this process would be efficient or not. Using batch scripts to load data is an inefficient way to go with. Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities: Imports individual tables or entire databases to files in HDFS Generates Java classes to allow you to interact with your imported data Provi...

HDFS Block Concepts

August 25, 2017

File system Blocks : file system is used to control how data is stored and retrieved. Without a file system, information placed in a storage medium would be one large body of data with no way to tell where one piece of information stops and the next begins. A block is the smallest unit of data that can be stored or retrieved from the disk. Filesystems deal with the data stored in blocks. Filesystem blocks are normally in few kilobytes of size. Even if you try to store a block that has contents less than that of block size still it will occupy the block size on the disk.Blocks are transparent to the user who is performing filesystem operations like read and write. Need of distributed filesystems When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines. Filesystems that manage the storage across a network of machines are called distributed filesyst...

Hadoop calculate maximum temperature explained

August 19, 2017

Analyzing the Data with Hadoop Using Map Reduce To take advantage of the parallel processing that Hadoop provides, we need to express our query as a MapReduce job because MapReduce framwork will manage the parallel processing by it self. MapReduce divides the processing into 2 phases - the map phase the reduce phase Each phase has input in the form of key value pair, and both phases produces output as key value pair. The output that has been generated by the map phase is given to a reduce phase as an input. It is the programmers responsibility to specifies two functions: the map function the reduce function Lets take an example where input to the map phase is from the below link which has the NCDC data- https://raw.githubusercontent.com/lmsamarawickrama/Hadoop-MapReduce/master/NCDC%20weather%20files/1901 Using the above data we need to calculate maximum temperature per year. While writing a mapreudce code, We choose a text input...

Hadoop Map-Reduce Word Count Java Example

August 17, 2017

This hadoop tutorial aims to give developers a great start in the world of hadoop mapreduce programming by giving them a hands-on experience in developing their first hadoop based WordCount application. Hadoop MapReduce WordCount example is a standard example where hadoop developers begin their hands-on programming with. This tutorial will help hadoop developers learn how to implement WordCount example code in MapReduce to count the number of occurrences of a given word in the input file. (so called) Pre-requisites to follow this Hadoop WordCount Example Tutorial Hadoop must be installed or you should have a sandbox running on your Virtualbox (or VMWare). In case you have installed Hadoop on your machine Single node hadoop cluster must be configured and running. Optional - IDE must be installed (IntelliJ or Eclipse or any IDE) Hadoop Map Reduce Example - Word Count – How it works? Hadoop WordCount operation occurs in 3 stages – Mapper Phase Shuffle Phase Reducer Ph...

Hadoop Introduction

August 16, 2017

Introduction to big data and hadoop A problem that led to hadoop -- Before getting into technicalities in this Hadoop tutorial blog, let me begin with an interesting story on how Hadoop came into the picture and why is it so popular in the industry nowadays. So, it all started with two people, Doug Cutting and Mike Cafarella , who were in the process of building a search engine system that can index 1 billion pages. After their research, they estimated that such a system will cost around half a million dollars in hardware, with a monthly running cost of $30,000, which is quite expensive. However, they soon realised that their architecture will not be capable enough to work around with billions of pages on the web. They came across a paper, published in 2003, that described the architecture of Google’s distributed file system, called GFS, which was being used in production at Google. Now, this paper on GFS proved to be something that they were looking for, and...