Introduction to Sqoop Part 1

Introduction to Sqoop Part 1

September 23, 2017

Introduction

To use Hadoop for analytics, It is required to load the data into Hadoop clusters. Then later we can use this data for processing it using traditional processing tool (e.g Map-Reduce/Hive/Pig).

Sqoop is used to Import data from RDBMS system to Hadoop distributed File system (HDFS). And for Exporting data from HDFS back to RDBMS, Sqoop is used. Loading GBs and TBs of data into HDFS from production databases or accessing it from map-reduce applications is a challenging task. While doing so, we have to consider things like data consistency, overhead of running these jobs on production systems and at the end if this process would be efficient or not. Using batch scripts to load data is an inefficient way to go with.

Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:

Imports individual tables or entire databases to files in HDFS

Generates Java classes to allow you to interact with your imported data

Provides the ability to import from SQL databases straight into your Hive data warehouse

Sqoop Connectors

Sqoop has an extension framework that makes it possible to import data from—and export data to—any external storage system that has bulk data transfer capabilities. A Sqoop connector is a modular component that uses this framework to enable Sqoop imports and exports. Sqoop ships with connectors for working with a range of popular databases, including MySQL, PostgreSQL, Oracle, SQL Server, DB2, and Netezza. There is also a generic JDBC connector for connecting to any database that supports Java’s JDBC protocol. Sqoop provides optimized MySQL, PostgreSQL, Oracle, and Netezza connectors that use database-specific APIs to perform bulk transfers more efficiently.

MySQL Example -

generic sqoop command -

sqoop import --connect jdbc:mysql://localhost/db --table TABLENAME

Lets say we have a MySQL database "website" and inside that database we have a table USERS and we want to improt that data on hadoop. Below is the Creat table statement for USERS -

CREATE TABLE USERS (

user_id INTEGER NOT NULL PRIMARY KEY,

name VARCHAR(128) NOT NULL,

state CHAR(2),

email VARCHAR(128),

password CHAR(64)

);

Importing this table into HDFS could be done with the command

sqoop import --connect jdbc:mysql://localhost/website --table USERS -m 1

Sqoop’s import tool will run a MapReduce job that connects to the MySQL database and reads the table. By default, this will use four map tasks in parallel to speed up the import process. Each task will write its imported results to a different file, but all in a

common directory. If you want to overwirte number of mapper to be run we can do that by specifying -m option. In above example we have mentioned number of mapper as 1 (-m 1).

By default, Sqoop will generate comma-delimited text files for our imported data. if you want to change the delimiter you can use --fields-terminated-by while importing.

sqoop --connect jdbc:mysql://localhost/website --table USERS --local --hive-import

The –-local option instructs Sqoop to take advantage of a local MySQL connection which performs very well. The –-hive-import option means that after reading the data into HDFS, Sqoop will connect to the Hive meta-store, create a table named USERS with the same columns and types (translated into their closest analogues in Hive), and load the data into the Hive warehouse directory on HDFS (instead of a sub directory of your HDFS home directory).

If MySQL is not installed on the same node but on some other node then instead of localhost you need to mention the ip address in the above command.

Different File Formats For Sqoop -

Sqoop is capable of importing into a few different file formats. Text files (the default) offer a human-readable representation of data, platform independence, and the simplest structure. However, they cannot hold binary fields (such as database columns of type VARBINARY).

Sqoop also supports SequenceFiles, Avro datafiles, and Parquet files. These binary formats provide the most precise representation possible of

the imported data. They also allow data to be compressed while retaining MapReduce’s ability to process different sections of the same file in parallel.

Suppose you wanted to work with this data in MapReduce and weren’t concerned with Hive. When storing this table in HDFS, you might want to take advantage of compression, so you’d like to be able to store the data in SequenceFiles. In this case, you might want to import the data with the command:

sqoop --connect jdbc:mysql://localhost/web

site --table USERS --as-sequencefile

Comments