HDFS Block Concepts

File system Blocks
file system is used to control how data is stored and retrieved.Without a file system, information placed in a storage medium would be one large body of data with no way to tell where one piece of information stops and the next begins.
A block is the smallest unit of data that can be stored or retrieved from the disk. Filesystems deal with the data stored in blocks. Filesystem blocks are normally in few kilobytes of size. Even if you try to store a block that has contents less than that of block size still it will occupy the block size on the disk.Blocks are transparent to the user who is performing filesystem operations like read and write.

Need of distributed filesystems

When a dataset outgrows the storage capacity of a single physical machine, it becomes
necessary to partition it across a number of separate machines. Filesystems that manage
the storage across a network of machines are called distributed filesystems. Since they
are network based, all the complications of network programming kick in, thus making
distributed filesystems more complex than regular disk filesystems. For example, one
of the biggest challenges is making the filesystem tolerate node failure without suffering data loss.


HDFS


HDFS stands for Hadoop Distributed Filesystem, And Hadoop comes with this distributed file system.HDFS is used to store the big data that we need to process or you can even treat HDFS as a datawarehouse.



HDFS design


Hdfs is the backbone of Hadoop. Since Hadoop is designed to process large datasets, It filesystem should be able to store large data on it. File on the Hdfs cold be in KB or it could be in MB or a single file could be on GB as well.(It is preferred that file that needs to be processed should be in at-least in GB but it could be possible some supporting files are needed which are in KB or MB so Hdfs should also be able to store those files as well.)
HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. Lets understanding meaning of each word now-

Very large files - means files that are  gigabytes, or terabytes in size.
Streaming data access-  HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. Time to read whole record is important than reading the first records.
Commodity hardware- Commodity hardware is more like commonly available hardware.

Blocks

As We have seen earlier that, block size is the minimum amount of data that it can read or write.
Normal Filesystem blocks are typically a few kilobytes in size, whereas disk blocks are normally 512 bytes, 1KB, 4KB, or 8 KB etc. HDFS, too, has the concept of a block, but it is a much larger unit—128 MB by default(In earlier hadoop version this block size was 64MB).
Similar to regular file system, files in HDFS are broken into block-sized chunks,which are stored as independent units . In case of HDFS, If you try to save a block of a file which is less than that of the HDFS block size then unlike regular file system it does not occupy full HDFS block. For example if you want to save a file of size 150 MB on HDFS and let's say HDFS block size is 128 MB then, this file would occupy 2 blocks on HDFS, One block would store all 128 MB and the remaining 22 MB would be stored as second block but it would not occupy next 106 MB on HDFS. Second block would take only 22 MB physically on HDFS.

Why Blocks?
Benefits of having blocks-
  1. A file can be larger than any single disk
  2. Simplified storage subsystem
  3. Blocks fit well with replication
Lets look at each benefit in detail-
 1. A file can be larger than any single disk
There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. It is even possible to have a single file which is of size of the cluster.

 2. Simplified storage subsystem
Since blocks are a fixed size, it is easy to calculate how many can be stored on a given disk. It also eliminates metadata concerns because blocks are just chunks of data to be stored, file metadata such as permissions information does not need to be stored with the blocks, so another system can handle metadata separately. (In case of Hadoop NameNode stores all the metadata information related to a file)
3. Blocks fit well with replication

To manage corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (by default three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client. A block that is no longer available due to corruption or machine failure can be replicated from its alternative locations to other live machines to bring the replication factor back to the normal level. Since we are dealing with smallest unit as a block, If a machine goes down we only need to copy the blocks which were present on that machine to some other machine.


Comments

  1. Then what does Cloud Storage means, seems like a distributed file system as you have explained.
    Where the uploaded actually data goes? Does it go to cluster based networks in those datacenters?

    ReplyDelete
    Replies
    1. The simplest way to answer your question would be, Consider the HDFS as the virtual file system. It acts as a single file system but internally stores the data on different nodes which are transparent to the end user.

      Delete

Post a Comment

Popular posts from this blog

Hadoop calculate maximum temperature explained

Sqoop In Depth

Introduction to Sqoop Part 1