300 Big Data Hadoop Interview Questions And Answers For Experienced And Freshers

Big Data Hadoop Interview Questions And Answers
Welcome to Interview Questions World. Here we are providing 300 Big Data Hadoop Interview Questions And Answers For Experienced And Freshers. Big Data and Hadoop is the most trending and most demanding technology in the current job market. Hope this blog post will help you in preparing for your next big data hadoop interview.


 Big Data Hadoop Interview Questions



What is Big Data?

Big data is really large amount of data that exceeds the processing capacity of conventional database systems, and requires special parallel processing mechanism. The data is too big and grows rapidly. This data can be either structural or unstructured data. To retrieve meaningful information from this data, we must choose an alternative way to process it.

Big data is data that exceeds the processing capacity of traditional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.


What is Hadoop?

Hadoop is an open source framework written in java by Apache Software Foundation. Hadoop is used to write software applications which require to process large scale data (usually called 'Big Data'). It supports distributed processing of large data sets across clusters of commodity servers. It process data very reliably and fault-tolerant manner.


Name any org. who is generating Big Data?

Facebook,Google.


Why do we need Hadoop?

For some big organizations data grows very rapidly, e.g. Facebook, Google. Everyday enormous amount of data is getting dumped on their machines. Storing this data is not the major problem, but to analyze this data and retrieve meaningful information from this is really a major challenge. On top of it, data may be stored in multiple machines and at multiple locations, which increases the challenge by multifold. In this situation a necessity for a system arises, which can process this huge amount of distributed data at relatively fast speed.


What is NoSQL?

NoSQL is a whole new way of thinking about a database. NoSQL is not a relational database. The reality is that a relational database model may not be the best solution for all situations. The easiest way to think of NoSQL, is that of a database which does not adhering to the traditional relational database management system (RDMS) structure. Sometimes you will also see it revered to as 'not only SQL'.


We have already SQL then Why NoSQL?

NoSQL is high performance with high availability, and offers rich query language and easy scalability.

NoSQL is gaining momentum, and is supported by Hadoop, MongoDB and others. The NoSQL Database site is a good reference for someone looking for more information.


What is Hadoop and where did Hadoop come from?

By Mike Olson: The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google’s innovations were incorporated into Nutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.


What problems can Hadoop solve?

The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That’s exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.


What is the Difference between Hadoop and Apache Hadoop?

There is no diff, Hadoop, formally called Apache Hadoop, is an Apache Software Foundation project.


Is NoSQL follow relational DB model?

No


Why would NoSQL be better than using a SQL Database? And how much better is it?

It would be better when your site needs to scale so massively that the best RDBMS running on the best hardware you can afford and optimized as much as possible simply can't keep up with the load. How much better it is depends on the specific use case (lots of update activity combined with lots of joins is very hard on "traditional" RDBMSs) - could well be a factor of 1000 in extreme cases.


Name the modes in which Hadoop can run?

Hadoop can be run in one of three modes:

i. Standalone (or local) mode
ii. Pseudo-distributed mode
iii. Fully distributed mode


What do you understand by Standalone (or local) mode?

There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them.


What is Pseudo-distributed mode?

The Hadoop daemons run on the local machine, thus simulating a cluster on a small scale.


What does /var/hadoop/pids do?

It stores the PID.


What is the full form of HDFS?

Hadoop Distributed File System


What is the idea behind HDFS?

HDFS is built around the idea that the most efficient approach to storing data for processing is to optimize it for write once, and read many approach.


Where does HDFS fail?

Cannot support large number of small files as the file system metadata increases with every new file, and hence it is not able to scale to billions of files. This file system metadata is loaded into memory and since memory is limited, so is the number of files supported.


What are the ways of backing up the filesystem metadata?

There are 2 ways of backing up the filesystem metadata which maps different filenames with their data stored as different blocks on various data nodes:

Writing the filesystem metadata persistently onto a local disk as well as on a remote NFS mount.

Running a secondary namenode.


What is Namenode in Hadoop?

Namenode is the node which stores the filesystem metadata i.e. which file maps to what block locations and which blocks are stored on which datanode.


What is DataNode in Hadoop?

Namenode is the node which stores the filesystem metadata i.e. which file maps to what block locations and which blocks are stored on which datanode.


What is Secondary NameNode?

The Secondary NameNode (SNN) is an assistant daemon for monitoring the state of the cluster HDFS, Like the NameNode, Each cluster has one SNN, and it typically resides on its own machine as well.


What is JobTracker in Hadoop?

The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.


What are the functions of JobTracker in Hadoop?

Once you submit your code to your cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they are running.

If a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries.

There is only one JobTracker daemon per Hadoop cluster. It is typically run on a server as a master node of the cluster.


What is MapReduce in Hadoop?

Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of large data sets on compute clusters of commodity hardware. It is a sub-project of the Apache Hadoop project. The framework takes care of scheduling tasks, monitoring them and re-executing any failed tasks.



Hadoop Interview Questions And Answers




What is Hadoop?

Hadoop is a distributed computing platform. It is written in Java. It consist of the features like Google File System and MapReduce.

What platform and Java version is required to run Hadoop?

Java 1.6.x or higher version are good for Hadoop, preferably from Sun. Linux and Windows are the supported operating system for Hadoop, but BSD, Mac OS/X and Solaris are more famous to work.

What kind of Hardware is best for Hadoop?

Hadoop can run on a dual processor/ dual core machines with 4-8 GB RAM using ECC memory. It depends on the workflow needs.

What are the most common input formats defined in Hadoop?

These are the most common input formats defined in Hadoop:
TextInputFormat
KeyValueInputFormat
SequenceFileInputFormat
TextInputFormat is a by default input format.


What is InputSplit in Hadoop? Explain.

When a hadoop job runs, it splits input files into chunks and assign each split to a mapper for processing. It is called InputSplit.

How many InputSplits is made by a Hadoop Framework?

Hadoop will make 5 splits as following:
One split for 64K files
Two splits for 65MB files, and
Two splits for 127MB files


What is the use of RecordReader in Hadoop?

InputSplit is assigned with a work but doesn’t know how to access it. The record holder class is totally responsible for loading the data from its source and convert it into keys pair suitable for reading by the Mapper. The RecordReader’s instance can be defined by the Input Format.

What is JobTracer in Hadoop?

is a service within Hadoop which runs MapReduce jobs on the cluster.

What are the functionalities of JobTracer?

These are the main tasks of JobTracer:
To accept jobs from client.
To communicate with the NameNode to determine the location of the data.
To locate TaskTracker Nodes with available slots.
To submit the work to the chosen TaskTracker node and monitors progress of each tasks.

Define TaskTracker?

TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations from a JobTracker.

What is Map/Reduce job in Hadoop?

Map/Reduce is programming paradigm which is used to allow massive scalability across the thousands of server.

Actually MapReduce refers two different and distinct tasks that Hadoop performs. In the first step maps jobs which takes the set of data and converts it into another set of data and in the second step, Reduce job. It takes the output from the map as input and compress those data tuples into smaller set of tuples.

What is Hadoop Streaming?

Hadoop streaming is a utility which allows you to create and run map/reduce job. It is a generic API that allows programs written in any languages to be used as Hadoop mapper.

What is a combiner in Hadoop?

A Combiner is a mini-reduce process which operates only on data generated by a Mapper. When Mapper emits the data, combiner receives it as input and sends the output to reducer.

Is it necessary to know java to learn Hadoop?

If you have a background in any programming language like C, C++, PHP, Python, Java etc. It may be really helpful, but if you are nil in java, it is necessary to learn Java and also get the basic knowledge of SQL.

How to debug Hadoop code?

There are many ways to debug Hadoop codes but the most popular methods are:

By using Counters.

By web interface provided by Hadoop framework.

Is it possible to provide multiple inputs to Hadoop? If yes, explain.

Yes, It is possible. The input format class provides methods to insert multiple directories as input to a Hadoop job.

What is the relation between job and task in Hadoop?

In Hadoop, A job is divided into multiple small parts known as task.

What is distributed cache in Hadoop?

Distributed cache is a facility provided by MapReduce Framework. It is provided to cache files (text, archives etc.) at the time of execution of the job. The Framework copies the necessary files to the slave node before the execution of any task at that node.

What commands are used to see all jobs running in the Hadoop cluster and kill a job in LINUX?

Hadoop job – list

Hadoop job – kill jobID

What is the functionality of JobTracker in Hadoop? How many instances of a JobTracker run on Hadoop cluster?

JobTracker is a giant service which is used to submit and track MapReduce jobs in Hadoop. Only one JobTracker process runs on any Hadoop cluster. JobTracker runs it within its own JVM process.

Functionalities of JobTracker in Hadoop:

When client application submits jobs to the JobTracker, the JobTracker talks to the NameNode to find the location of the data.

It locates TaskTracker nodes with available slots for data.

It assigns the work to the chosen TaskTracker nodes.

The TaskTracker nodes are responsible to notify the JobTracker when a task fails and then JobTracker decides what to do then. It may resubmit the task on another node or it may mark that task to avoid.

How JobTracker assign tasks to the TaskTracker?

The TaskTracker periodically sends heartbeat messages to the JobTracker to assure that it is alive. This messages also inform the JobTracker about the number of available slots. This return message updates JobTracker to know about where to schedule task.

Is it necessary to write jobs for Hadoop in Java language?

No, There are many ways to deal with non-java codes. HadoopStreaming allows any shell command to be used as a map or reduce function.

How is Hadoop different from other parallel computing systems?

Hadoop is a distributed file system, which lets you store and handle massive amount of data on a cloud of machines, handling data redundancy. The primary benefit is that since data is stored in several nodes, it is better to process it in distributed manner. Each node can process the data stored on it instead of spending time in moving it over the network.

On the contrary, in Relational database computing system, you can query data in real-time, but it is not efficient to store data in tables, records and columns when the data is huge.

Hadoop also provides a scheme to build a Column Database with Hadoop HBase, for runtime queries on rows.

What all modes Hadoop can be run in?

Hadoop can run in three modes:

Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.

Pseudo-Distributed Mode (Single Node Cluster): In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.

Fully Distributed Mode (Multiple Cluster Node): This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.

What is distributed cache and what are its benefits?

Distributed Cache, in Hadoop, is a service by MapReduce framework to cache files when needed. Once a file is cached for a specific job, hadoop will make it available on each data node both in system and in memory, where map and reduce tasks are executing.Later, you can easily access and read the cache file and populate any collection (like array, hashmap) in your code.

Benefits of using distributed cache are:

It distributes simple, read only text/data files and/or complex types like jars, archives and others. These archives are then un-archived at the slave node.

• Distributed cache tracks the modification timestamps of cache files, which notifies that the files should not be modified until a job is executing currentluy.

What are the most common Input Formats in Hadoop?

There are three most common input formats in Hadoop:

Text Input Format: Default input format in Hadoop.

Key Value Input Format: used for plain text files where the files are broken into lines

Sequence File Input Format: used for reading files in sequence

Define DataNode and how does NameNode tackle DataNode failures?

DataNode stores data in HDFS; it is a node where actual data resides in the file system. Each datanode sends a heartbeat message to notify that it is alive. If the namenode does noit receive a message from datanode for 10 minutes, it considers it to be dead or out of place, and starts replication of blocks that were hosted on that data node such that they are hosted on some other data node.A BlockReport contains list of all blocks on a DataNode. Now, the system starts to replicate what were stored in dead DataNode.

The NameNode manages the replication of data blocksfrom one DataNode to other. In this process, the replication data transfers directly between DataNode such that the data never passes the NameNode.

What are the core methods of a Reducer?

The three core methods of a Reducer are:
setup(): this method is used for configuring various parameters like input data size, distributed cache.

public void setup (context)
reduce(): heart of the reducer always called once per key with the associated reduced task

public void reduce(Key, Value, context)
cleanup(): this method is called to clean temporary files, only once at the end of the task

public void cleanup (context)

What is SequenceFile in Hadoop?

Extensively used in MapReduce I/O formats, SequenceFile is a flat file containing binary key/value pairs. The map outputs are stored as SequenceFile internally. It provides Reader, Writer and Sorter classes. The three SequenceFile formats are:

Uncompressed key/value records.

2. Record compressed key/value records – only ‘values’ are compressed here.

3. Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed. The size of the ‘block’ is configurable.

What is Job Tracker role in Hadoop?

Job Tracker’s primary function is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking the taks progress and fault tolerance).

• It is a process that runs on a separate node, not on a DataNode often

• Job Tracker communicates with the NameNode to identify data location

• Finds the best Task Tracker Nodes to execute tasks on given nodes

• Monitors individual Task Trackers and submits the overall job back to the client.

• It tracks the execution of MapReduce workloads local to the slave node.

What is the use of RecordReader in Hadoop?

Since Hadoop splits data into various blocks, RecordReader is used to read the slit data into single record. For instance, if our input data is split like:

Row1: Welcome to

Row2: Intellipaat

It will be read as “Welcome to Intellipaat” using RecordReader.

What happens if you try to run a Hadoop job with an output directory that is already present?

It will throw an exception saying that the output file directory already exists. To run the MapReduce job, you need to ensure that the output directory does not exist before in the HDFS.

To delete the directory before running the job, you can use shell:

Hadoop fs –rmr /path/to/your/output/

Or via the Java API: FileSystem.getlocal(conf).delete(outputDir, true);

How can you debug Hadoop code?

First, check the list of MapReduce jobs currently running. Next, we need to see that there are no orphaned jobs running; if yes, you need to determine the location of RM logs.

1. Run: “ps –ef | grep –I ResourceManager”

and look for log directory in the displayed result. Find out the job-id from the displayed list and check if there is any error message associated with that job.

2. On the basis of RM logs, identify the worker node that was involved in execution of the task.

3. Now, login to that node and run – “ps –ef | grep –iNodeManager”

4. Examine the Node Manager log. The majority of errors come from user level logs for each map-reduce job.

How to configure Replication Factor in HDFS?

hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all files placed in HDFS.

You can also modify the replication factor on a per-file basis using the Hadoop FS Shell:

[training@localhost ~]$ hadoopfs –setrep –w 3 /my/file

Conversely, you can also change the replication factor of all the files under a directory.

[training@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir

How to compress mapper output but not the reducer output?

To achieve this compression, you should set:

conf.set(“mapreduce.map.output.compress”, true)

conf.set(“mapreduce.output.fileoutputformat.compress”, false)

What is the difference between Map Side join and Reduce Side Join?

Map side Join at map side is performed data reaches the map. You need a strict structure for defining map side join. On the other hand, Reduce side Join (Repartitioned Join) is simpler than map side join since the input datasets need not be structured. However, it is less efficient as it will have to go through sort and shuffle phases, coming with network overheads.

How can you transfer data from Hive to HDFS?

By writing the query:

hive> insert overwrite directory ‘/’ select * from emp;

You can write your query for the data you want to import from Hive to HDFS. The output you receive will be stored in part files in the specified HDFS path.

What companies use Hadoop, any idea?

Yahoo! (the biggest contributor to the creation of Hadoop) – Yahoo search engine uses Hadoop, Facebook – Developed Hive for analysis , Amazon, Netflix, Adobe, eBay, Spotify, Twitter, Adobe



HDFS Interview Questions And Answers


1) Apache Hadoop framework is composed of which modules?

Hadoop common, Hadoop YARN, Hadoop mapreduce, HDFS (Hadoop Distributed file system)

2) What does the term "Replication factor" denotes?

Replication factor is the number of times a file needs to be replicated in HDFS

3) What is the default replication factor in HDFS?

Three

4) Typical block size of an HDFS block?

64 MB( Extendable to custom defined 128 MB)

5) Explain HDFS Functionality in brief?

HDFS is scalable distributed Storage file system which is used for storing large amount of data in a Replicated environment.

6) What is NameNode?

The NameNode is the bookkeeper of HDFS. It keeps track of the data files and how they get split into different file blocks, storage of various file blocks in respective nodes and overall health of the distributed file system. The administrative functions of the NameNode are highly memory and I/O intensive.

7) Explain the key functionalities of Secondary Name node?

Based on the intervals specified in the cluster configuration, the Secondary Namenode (SNN) communicates with the NameNode to take snapshots of the HDFS metadata. In a Hadoop cluster, single point of failure is mainly caused by nameNode and the SNN snapshots help minimize the downtime and loss of data.

8) What is meant by Rack Awareness in Hadoop?

The NN (Name Node) stores the Metadata information of the storage location of the files like the rack, node and block. In Hadoop terminology, it is known as Rack awareness.

9) Which component in Hadoop is responsible for Job scheduling and monitoring?

Job Tracker

10) Name the structure provided by the MR(Map Reduce)?

Dynamic schema

11) Explain the heartbeat mechanism in Hadoop?

At regular intervals, Namenode get acknowledgement from various data nodes regarding space allocations and free memory. Typically, datanode send the heart beat every three seconds.

12) Explain failover fence in Hadoop?

It is also known as decommissioning of datanodes. When we want to reduce datanode machine in a cluster due to the datanode malfunction, load optimization issues, we decommission certain datanodes.

13) List all the daemons required to run the Hadoop cluster ?

NameNode
DataNode
JobTracker
TaskTracker

14) What is HDFS federation?

The process of maintaining multiple Namenodes in the Hadoop cluster environment to provide backup, recovery and failure control over the cluster.

15) Assume that the Hadoop spawned 50 tasks for a job and one of the task failed. What will Hadoop do?

If a task fails, Hadoop will restart the task on some other task tracker. In case, the restarted task fails more than four times, Hadoop will kill the job. Number of max restarts required before killing a task can be specified in the settings file.

16) In what format, MR process the data?

MR process data in Key-Value pairs.

17) How many input splits, the Hadoop framework will create for the scenario given below?

A MR system with HDFS block size 128 MB, having three files of size 64K, 300MB and 127MB with FileInputFormat as Input format

Hadoop will create five splits of following sizes
1 split for 64K files
3 splits for 300Mb files
1 splits for 127Mb file

18) Explain Speculative Execution?

If multiple mappers are working on the same task and if the one mapper goes down due to some unspecified reason, the JT assigns the shutdown mapper task to another mapper, parallelly to avoid data loss. This phenomenon is known as Speculative Execution.

19) What is Hadoop Streaming?

Hadoop Streaming API allows programmers to use programs written in various programming languages as Hadoop mapper and reducer implementations.

20) What is Distributed Cache in Hadoop?

The Map Reduce framework provides Distributed Cache functionality to cache the files (text, jars, arcHives, etc.) required by the applications during job execution. Before starting any tasks of a job in a node, the framework copies the required files to the slave node.

What is HDFS?

HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.

What are the key features of HDFS?

HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware.

What is Fault Tolerance?

Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also. So even if one or two of the systems collapse, the file is still available on the third system.

Replication causes data redundancy ,then why is it pursued in HDFS?

HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places. 

Any data on HDFS gets stored at atleast 3 different locations. So, even if one of them is corrupted and the other is unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no chance of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant.

Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be replicated on the other two?

Since there are 3 nodes, when we send the MapReduce programs, calculations will be done only on the original data. The master node will know which node exactly has that particular data. In case, if one of the nodes is not responding, it is assumed to be failed. Only then, the required calculation will be done on the second replica.

What is throughput? How does HDFS get a good throughput?

Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. 

So all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously.

What is streaming access?

As HDFS works on the principle of ‘Write Once, Read Many‘, the feature of streaming access is extremely important in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data.

What is a commodity hardware? Does commodity hardware include RAM?

Commodity hardware is a non-expensive system which is not of high quality or high-availability. Hadoop can be installed in any average commodity hardware. We don’t need super computers or high-end hardware to work on Hadoop. Yes, Commodity hardware includes RAM because there will be some services which will be running on RAM.

What is a Namenode?

Namenode is the master node on which job tracker runs and consists of the metadata. It maintains and manages the blocks which are present on the datanodes. It is a high-availability machine and single point of failure in HDFS.

Is Namenode also a commodity?

No. Namenode can never be a commodity hardware because the entire HDFS rely on it. It is the single point of failure in HDFS. Namenode has to be a high-availability machine.

What is a metadata?

Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on.

What is a Datanode?

Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.

Why do we use HDFS for applications having large data sets and not when there are lot of small files?

HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files. This is because Namenode is a very expensive high performance system, so it is not prudent to occupy the space in the Namenode by unnecessary amount of metadata that is generated for multiple small files. 

So, when there is a large amount of data in a single file, name node will occupy less space. Hence for getting optimized performance, HDFS supports large data sets instead of multiple small files.

What is a daemon?

Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The equivalent of Daemon in Windows is “services” and in Dos is ” TSR”.

What is a job tracker?

Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. 

It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.

What is a task tracker?

Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. 

While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

Is Namenode machine same as datanode machine as in terms of hardware?

It depends upon the cluster you are trying to create. The Hadoop VM can be there on the same machine or on another machine. For instance, in a single node cluster, there is only one machine, whereas in the development or in a testing environment, Namenode and datanodes are on different machines.

What is a heartbeat in HDFS?

A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.

Are Namenode and job tracker on the same host?

No, in practical environment, Namenode is on a separate host and job tracker is on a separate host.

What is a ‘block’ in HDFS?

A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost of seeks.

If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size? No, not at all! 64 mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient manner.

What are the benefits of block transfer?

A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.

If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks, can the blocks be broken at the time of replication?

In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node will figure out what is the actual amount of space required, how many block are being used, how much space is available, and it will allocate the blocks accordingly.

How indexing is done in HDFS?

Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.

If a data Node is full how it’s identified?

When data is stored in datanode, then the metadata of that data will be stored in the Namenode. So Namenode will identify if the data node is full.

If datanodes increase, then do we need to upgrade Namenode?

While installing the Hadoop system, Namenode is determined based on the size of the clusters. Most of the time, we do not need to upgrade the Namenode because it does not store the actual data, but just the metadata, so such a requirement rarely arise.

Are job tracker and task trackers present in separate machines?

Yes, job tracker and task tracker are present in different machines. The reason is job tracker is a single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

When we send a data to a node, do we allow settling in time, before sending another data to that node?

Yes, we do.

Does hadoop always require digital data to process?

Yes. Hadoop always require digital data to be processed.

On what basis Namenode will decide which datanode to write on?

As the Namenode has the metadata (information) related to all the data nodes, it knows which datanode is free.

Doesn’t Google have its very own version of DFS?

Yes, Google owns a DFS known as “Google File System (GFS)” developed by Google Inc. for its own use.

Who is a ‘user’ in HDFS?

A user is like you or me, who has some query or who needs some kind of data.

Is client the end user in HDFS?

No, Client is an application which runs on your machine, which is used to interact with the Namenode (job tracker) or datanode (task tracker).

What is the communication channel between client and namenode/datanode?

The mode of communication is SSH.

What is a rack?

Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

On what basis data will be stored on a rack?

When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be stored. 

While placing the datanodes, the key rule followed is “for every block of data, two copies will exist in one rack, third copy in a different rack“. This rule is known as “Replica Placement Policy“.

Do we need to place 2nd and 3rd data in rack 2 only?

Yes, this is to avoid datanode failure.

What if rack 2 and datanode fails?

If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by changing the value in replication factor which is set to 3 by default.

What is a Secondary Namenode? Is it a substitute to the Namenode?

The secondary Namenode constantly reads the data from the RAM of the Namenode and writes it into the hard disk or the file system. It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system goes down.

What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?

In Gen 1 Hadoop, Namenode is the single point of failure. In Gen 2 Hadoop, we have what is known as Active and Passive Namenodes kind of a structure. If the active Namenode fails, passive Namenode takes over the charge.

What is MapReduce?

Map Reduce is the ‘heart‘ of Hadoop that consists of two parts – ‘map’ and ‘reduce’. Maps and reduces are programs for processing data. ‘Map’ processes the data first to give some intermediate output which is further processed by ‘Reduce’ to generate the final output. Thus, MapReduce allows for distributed processing of the map and reduction operations.

Can you explain how do ‘map’ and ‘reduce’ work?

Namenode takes the input and divide it into parts and assign them to data nodes. These datanodes process the tasks assigned to them and make a key-value pair and returns the intermediate output to the Reducer. The reducer collects this key value pairs of all the datanodes and combines them and generates the final output.

What is ‘Key value pair’ in HDFS?

Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.

What is the difference between MapReduce engine and HDFS cluster?

HDFS cluster is the name given to the whole configuration of master and slaves where data is stored. Map Reduce Engine is the programming module which is used to retrieve and analyze data.


Big Data Hadoop Interview Questions


1. What is BIG DATA?
2. Can you give some examples of Big Data?
3. Can you give a detailed overview about the Big Data being generated by Facebook?
4. According to IBM, what are the three characteristics of Big Data?
5. How Big is ‘Big Data’?
6. How analysis of Big Data is useful for organizations?
7. Who are ‘Data Scientists’?
8. What is Hadoop?
9. Why the name ‘Hadoop’?
10. Why do we need Hadoop?
11. What are some of the characteristics of Hadoop framework?
12. Give a brief overview of Hadoop history.
13. Give examples of some companies that are using Hadoop structure?
14. What is the basic difference between traditional RDBMS and Hadoop?
15. What is structured and unstructured data?
16. What are the core components of Hadoop?
17. What is HDFS?
18. What are the key features of HDFS?
19. What is Fault Tolerance?
20. Replication causes data redundancy then why is pursued in HDFS?
21. Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be replicated on the other two?
22. What is throughput? How does HDFS get a good throughput?
23. What is streaming access?
24. What is a commodity hardware? Does commodity hardware include RAM?
25. What is a Namenode?
26. Is Namenode also a commodity?
27. What is a metadata?
28. What is a Datanode?
29. Why do we use HDFS for applications having large data sets and not when there are lot of small files?
30. What is a daemon?
31. What is a job tracker?
32. What is a task tracker?
33. Is Namenode machine same as datanode machine as in terms of hardware?
34. What is a heartbeat in HDFS?
35. Are Namenode and job tracker on the same host?
36. What is a ‘block’ in HDFS?
37. If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size?
38. What are the benefits of block transfer?
39. If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks, can the blocks be broken at the time of replication?
40. How indexing is done in HDFS?
41. If a data Node is full how it’s identified?
42. If datanodes increase, then do we need to upgrade Namenode?
43. Are job tracker and task trackers present in separate machines?
44. When we send a data to a node, do we allow settling in time, before sending another data to that node?
45. Does hadoop always require digital data to process?
46. On what basis Namenode will decide which datanode to write on?
47. Doesn’t Google have its very own version of DFS?
48. Who is a ‘user’ in HDFS?
49. Is client the end user in HDFS?
50. What is the communication channel between client and namenode/datanode?
51. What is a rack?
52. On what basis data will be stored on a rack?
53. Do we need to place 2nd and 3rd data in rack 2 only?
54. What if rack 2 and datanode fails?
55. What is a Secondary Namenode? Is it a substitute to the Namenode?
56. What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?
57. What is MapReduce?
58. Can you explain how do ‘map’ and ‘reduce’ work?
59. What is ‘Key value pair’ in HDFS?
60. What is the difference between MapReduce engine and HDFS cluster?
61. Is map like a pointer?
62. Do we require two servers for the Namenode and the datanodes?
63. Why are the number of splits equal to the number of maps?
64. Is a job split into maps?
65. Which are the two types of ‘writes’ in HDFS?
66. Why ‘Reading‘ is done in parallel and ‘Writing‘ is not in HDFS?
67. Can Hadoop be compared to NOSQL database like Cassandra?
68. What is MapReduce?
69. What are ‘maps’ and ‘reduces’?
70. What are the four basic parameters of a mapper?
71. What are the four basic parameters of a reducer?
72. What do the master class and the output class do?
73. What is the input type/format in MapReduce by default?
74. Is it mandatory to set input and output type/format in MapReduce?
75. What does the text input format do?
76. What does job conf class do?
77. What does conf.setMapper Class do?
78. What do sorting and shuffling do?
79. What does a split do?
80. What does a MapReduce partitioner do?
81. How is Hadoop different from other data processing tools?
82. Can we rename the output file?
83. Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that?
84. What is Streaming?
85. What is a Combiner?
86. What is the difference between an HDFS Block and Input Split?
87. What happens in a textinputformat?
88. What do you know about keyvaluetextinputformat?
89. What do you know about Sequencefileinputformat?
90. What do you know about Nlineoutputformat?
91. Which are the three modes in which Hadoop can be run?
92. What are the features of Stand alone (local) mode?
93. What are the features of Pseudo mode?
94. Can we call VMs as pseudos?
95. What are the features of Fully Distributed mode?
96. Does Hadoop follows the UNIX pattern?
97. In which directory Hadoop is installed?
98. What are the port numbers of Namenode, job tracker and task tracker?
99. What is the Hadoop-core configuration?
100.What are the Hadoop configuration files at present?

Hadoop Interview Questions


What is shuffling in map reduce?
What is the difference between an HDFS Block and Input Split?
What are the mapfiles in Hadoop?
Why do we need a password-less SSH in Fully Distributed environment?
What is the use of .pagination class?
Why is Replication pursued in HDFS in spite of its data redundancy?
What are the core components of Hadoop?
What differentiates Hadoop from other parallel computing solutions?
What is Difference between Secondary namenode, Checkpoint namenode & backupnod Secondary Namenode, a poorly named component of hadoop.
What happens when a datanode fails ?
What are the Side Data Distribution Techniques. Explain what is Sequencefileinputformat?
What is shuffleing in mapreduce?
What is partitioning?
Explain what happens in textinformat ?
Can we change the file cached by Distributed Cache What if job tracker machine is down?
Explain what does the conf.setMapper Class do ?
Can we deploy job tracker other than name node?
What are the four modules that make up the Apache Hadoop framework?
How did you debug your Hadoop code?
Which modes can Hadoop be run in?
List a few features for each mode. What is Hadoop Streaming?
Where are Hadoop’s configuration files located?
What is a combiner in Hadoop?
What is the functionality of JobTracker in Hadoop?
How many instances of a JobTracker run on Hadoop cluster?
List Hadoop’s three configuration files.
Is it necessary to write jobs for Hadoop in Java language?
What are “slaves” and “masters” in Hadoop?
What commands are used to see all jobs running in the Hadoop cluster and kill a job in LINUX?
How many datanodes can run on a single Hadoop cluster?
What is identity mapper?
What is job tracker in Hadoop?
How many job tracker processes can run on a single Hadoop cluster?
What is Identity reducer?
What sorts of actions does the job tracker process perform?
What is Commodity Hardware?
How does job tracker schedule a job for the task tracker?
What are the main components of Job flow in YARN architecture ?
What does the mapred.job.tracker command do?
What are the Main configuration parameters that user need to specify to run Mapreduce Job ?
What is “PID”?
What is “jps”?
Is there another way to check whether Namenode is working?
How would you restart Namenode?
What is Chain Reducer ?

If you know any answers for the above questions, write your answers in the below comments box.

Comments

  1. This comment has been removed by the author.

    ReplyDelete
  2. It is very useful and a Creative blog to crack the interview. Hope it ll be useful for all Beginners and Hadoop freshers.
    For more relevent interview and Placement Training visit the below page where i learned.it was really helpful for me Hadoop Training in Chennai

    Hadoop z easy to learn Now..!

    ReplyDelete
  3. Question collection was great and i got a good stuff about hadoop. Thanks...

    Hadoop Big Data Classes in Pune

    ReplyDelete
  4. It is nice blog Thank you provide important information and I am searching for the same information to save my time Big Data Hadoop Online Training

    ReplyDelete
  5. Superb article.The information I have been searching was you uploaded exactly. It helped me a lot. Keep coming with more such informative articles. We Would love to follow them. Big Data Training in Pune

    ReplyDelete
  6. Really I like your Blog! Thanks to Admin for Sharing the above information related to Hadoop Technology. I bookmarked this link for Feature reference. Keep sharing such good Articles. Addition to your Story here I am Contributing one more Similar Story Top 5 reasons to learn Hadoop.

    ReplyDelete

Post a Comment

Popular posts from this blog