BigData / Hadoop Interview Questions
2. What is the difference between Block size & input split?
Ans: Block is a physical division whereas, input split is a logical division of the data.
Ans: Each file occupies 2 blocks of data(block1 – 64 MB & block2 – 36 MB) and hence 100 files would occupy 200 map slots.
4. What is data locality optimization?
Ans: In Hadoop, execution is done near the data. This execution can be done in 3 possible ways, out of which the first way is always preferred by the Namenode Same node execution: Tasktracker process is initiated in the Datanode where the block of data is stored. Off-node execution: In the event of the unavailability of Tasktracker slots in the Datanode(where the data block is located), this block of data is copied to the nearest datanode in the same rack and execution is done. Off-rack execution: If no slot is free to run the Tasktracker in the entire rack where the block of data is present, the block of data is moved across to a different rack and executed.
5. What is Speculative execution?
Ans: If one of the tasks of a MapReduce job is slow, it pulls down the overall performance of the job. Hence, Jobtracker continuously monitors each task for progress(via heart beat signals). If certain task does not respond in the given time-interval, then the job trackerspeculates that the task is down and initiates a similar Tasktracker on a different replica of the same block. This concept is called Speculative execution. Important thing to note here is that, it will not kill the slow running task. Both tasks would run simultaneously. Only when one of the tasks get completed, the remaining task would be killed.
6. What are the different types of File permissions in HDFS?
drwxrwxrwx user1 prog 10 Aug 16 15:02 myfolder
-rwxrwxrwx user1 prog 10 Aug 01 07:02 myfile.sas
Position 1: ‘d’ means folder, ‘-’ means file
Positions 2-4: Owners permissions on file/folder
Positions 5-7: Group permissions on file or folder
Positions 8-10: Global permissions on file or folder
7. What is Rack-awareness?
Ans: In HDFS, not all replicas of a single block are copied in the same Rack. This concept is called Rack-awareness. In the event of an entire Rack going down, if all the replicas are in that rack, there would be no way of recovering that block of data.
8. What are the different modes of HDFS that one can run? Where do we configure these modes?
Ans: Hadoop can be configured to run on one of the following modes.
a. Standalone Mode or local (default mode)
b. Psuedo distributed mode
c. Fully distributed mode. These configuration settings can be set via – core-site.xml, mapred-site.xml, hdfs-site.xml
9. What are the available data-types in Hadoop?
Ans: To support serialization-deserialization and to be able to get compared with one another, hadoop has built its own datatypes. Following is the list of types that implement WritableComparable- Primitives: BooleanWritable, ByteWritable, ShortWritable, IntWritable, VIntWritable,FloatWritable, LongWritable, Others: NullWritable, Text, BytesWritable, MD5Hash
10. Explain the command ‘-getMerge’
Ans: hadoop fs -getmerge <directory> <merged file name> This option gets all the files in the directory and merges them into a single file.
11. Explain the anatomy of a file read in HDFS
Ans: 1. Client opens the file (calls open() on Distributed File System).
2. DFS calls the namenode to get block locations.
3. DFS creates FSDataInputStream and client invokes read() on this object.
4. Using DFSDataInputStream(a sub class of FSDataInputStream), read operation is done on datanodes where file blocks are present. Blocks are read in the order. Once reading all the blocks is finished, client calls close() on the FSDataInputStream
12. Explain the anatomy of a file write in HDFS
Ans: 1. Client creates a files (calls create() on DFS)
2. Client calls namenode(NN) to create a file. NN checks for client’s access permissions to thefile and if file already exists. If the file already exists, it throws an IO Exception
3. The DFS returns an FSDataOutputStream to write data into. FSDataOutputStream has a subclass DFSDataOutputStream which handles communication with NN & datanode(DN)
4. DFSDataOutputStream writes data in the form of packets(small units of data) and these packets are written to various DNs to form blocks of data. A pipeline is formed that consists of the list of DNs that a single block has to be replicated to.
5. When a block of data is written to all DNs in the pipeline, acknowledgement comes from the DNs in the pipeline in the reverse order.
6. When client has finished writing the data, it calls close() on the stream
7. Waits for acknowledgement before contacting the name to signal that file is complete.
1. What is Distributed Cache?
Ans: Distributed Cache is a mechanism by which ‘Side Data’ (extra read-only data needed by a MR program is distributed
2. What is ‘Sequence File’ format? Where do we use it?
Ans: SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. It is also worth noting that, internally, the temporary outputs of maps are stored using SequenceFile. The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting respectively. There are 3 different SequenceFile formats:
a. Uncompressed key/value records
b. Record compressed key/value records – only ‘values’ are compressed here.
c. Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed. The size of the ‘block’ is configurable
3. What are the different File Input Formats in MapReduce?
Ans: FileInputFormat is the base class for all implementations ofInputFormat that uses file as their data source. The sub-classes of FileInputFormat are: CombineFileInputFormat, TextInputFormat (default), KeyValueTextInputFormat, NLineInputFormat, SequenceFileInputFormat. SequenceFileInputFormat has few subclasses like – SequenceFileAsBinaryInputFormat, SequenceFileAsTextInputFormat, SequenceFileInputFilter
4. What is ‘Shuffling & sorting’ phase in MapReduce?
Ans: This phase occurs between Map & Reduce phases. During this phase, the all keys emitted by various mappers is collected, grouped and copied to the reducers.
5. How many instances of a ‘jobtracker’ run in a cluster?
Ans: Only one instance of Jobtracker would run in a cluster
6. Can two different Mappers communicate with each other?
Ans: No, Mappers/Reducers run independently of each other.
7. How do you make sure that only one mapper runs your entire file?
Ans: Create a Custom ‘InputFormat’ and override the’issplitable()’ to return false. (or) a rather rude way to do is – set the block size greater than the size of the input file.
8.When will the reducer phase start in a MR program?
Ans: Reducer phase starts only after all mappers finish execution.
9. Explain various phases of a MapReduce program.
Sort & Shuffle phase:
10. What is a ‘Task instance’ ?
Ans: Task instance is the child JVM process that is initiated by the Tasktracker itself. This is to ensure that process failure does not take down the Tasktracker.
1.What is HBase?
Hbase is Column-Oriented , Open-Source, Multidimensional, Distributed database. It run on the top of HDFS
2.Why we use Habse?
Hbase provide random read and write, Need to do thousand of operation per second on large data set.
3.List the main component of HBase?
4.How many Operational command in Hbase?
There are five main command in HBase.
5.How to open a connection in Hbase?
If you are going to open connection with the help of Java API. The following code provide the connection
Configuration myConf = HBaseConfiguration.create();
HTableInterface usersTable = new HTable(myConf, “users”);
6.When Should I Use HBase?
HBase isnt suitable for every problem. First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle. Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An applicationbuilt against an RDBMS cannot be ported to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port. Third, make sure you have enough hardware. Even HDFS doesnt do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
7. How does Hbase achieve random read/write?
HBase stores data in HFiles that are indexed (sorted) by their key. Given a random key, the client can determine which region server to ask for the row from. The region server can determine which region to retrieve the row from, and then do a binary search through the region to access the correct row. This is accomplished by having sufficient statistics to know the number of blocks, block size, start key, and end key. For example: A table may contain 10 TB of data. But, the table is broken up into regions of size 4GB. Each region has a start/end key. The client can get the list of regions for a table and determine which region has the key it is looking for. Regions are broken up into blocks, so that the region server can do a binary search through its blocks. Blocks are essentially long lists of key, attribute, value, and version. If you know what the starting key is for each block, you can determine one file to access, and what the byte-offset (block) is to start reading to see where you are in the binary search.