Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts

November 14, 2018

Cloud Computing in simple terms - Overview of Cloud Computing

Cloud Computing in simple terms:

Welcome to Cloud Computing, The first question in our mind is - What is Cloud computing?

I am pretty sure, you also looking for the same. So we will directly jump into the interesting scenario-based language. We have tons of information from various sources as NIST(National Institute of Standards and Technology), 451 Group, analyst firm like Gartner, Wikipedia etc.

As per our information - It is very large where several services and data you access all over the network. The software and data which you access for your work doesn't exist on your computer and set on the servers. This concept of using services stored on other system called cloud computing.
There was a time when several clients want to access information from several terminals but the mainframe technology was too costly so to save money they want something new. At that time it was just a hope but now we have a cloud today.

In a simple term - It's a term for a service made available over the network on-demand for an optimized highly scalable service provider, the name cloud computing was inspired by the cloud symbol. Exact information and origin of the cloud are still debated but attributes are agreed upon.

Accessing Cloud:
We can access Cloud using various devices shown below in the diagram:


July 25, 2018

Analytics - Hadoop L1


Question.
Which of the following are true about Hadoop?
Open Source
Distributed Processing Framework
Distributed Storage Framework
All of these

Answer: All of these

Question.
Which of the following are false about Hadoop?
Hadoop works in Master-Slave fashion
Master & Slave both are worker nodes
User submit his work on master, which distribute it to slaves
Slaves are actual worker node

Answer: Master & Slave both are worker nodes

Question.
What is a Metadata in Hadoop?
Data stored by user
Information about the data stored in datanodes
User information
None of these

Answer: Information about the data stored in datanodes

Question.
What is a Daemon?
Process or service that runs in background
Applications submitted by user
Web application running on web server
None of these

Answer: Process or service that runs in background

Question.
All of the following accurately describe Hadoop EXCEPT?
a. Batch processing
b.Open-source
c. Distributed computing
d. Real-time

Answer: Real-time

Question.
All of the following is a core component of Hadoop EXCEPT?
a. Hive
b. HDFS
c. MapReduce
d. YARN

Answer: Hive

Question.
Hadoop is a framework that uses a variety of related tools. Common tools included in a typical implementation include:
a. MapReduce, HDFS, Spool
b. MapReduce, MySQL, Google Apps
c. Cloudera, HortonWorks, MapR
d. MapReduce, Hive, Hbase

Answer: MapReduce, Hive, Hbase

Question.
Which of the following can be used to create workflows when multiple MapReduce and Pig programs need to be executed?
a. Sqoop
b. Zookeeper
c. Oozie
d. Hbase

Answer: Oozie

Question.
Which of the following can be used to transfer bulk data between Hadoop and structured databases
a. Sqoop
b. Hive
c. Pig
d. Spark

Answer: Sqoop

Question.
How many single points of failure does a High Availability HDFS architecture have?
a. 0
b. 1
c. 2
d. 3

Answer: 0

Question.
If a file of size 300MB needs to be stored in the HDFS (block size=64MB, replication factor=2), how many blocks are created for this file in the HDFS?
a. 10
b. 11
c. 12
d. 15

Answer: 10

Question.
What is not a default value for a data block size in the HDFS?
a. 64MB
b. 128MB
c. 512MB
d. 256MB

Answer: 512MB

Question.
Which of the following architectures best describes the HDFS architecture?
a. High Availability
b. Master-Slave
c. Connected
d. Peer

Answer: Master-Slave

Question.
Which of the following is a master process in the HDFS architecture?
a. Datanode
b. JobTracker
c. Namenode
d. Secondary Namenode

Answer: Namenode

Question.
Which of the following is true about Hadoop?

Before storing data we need to specify the schema
We will loss data if one data node crashes
We can add n no of nodes in cluster on the fly (n ~ 15000)
Data is firstly processed on master then on slaves

Answer: We can add n no of nodes in cluster on the fly (n ~ 15000)

Question.
Choose the correct statement?

Master assigns work to all the slaves
We cannot edit data once written in Hadoop
Client need to interact with master first, as it is the single place where all the meta data is available
All of these

Answer: All of these

Question.
Which of the following is the essential module of HDFS?
Node Manager
Resource Manager
DataNode
ALL of the above

Answer: DataNode

Question.
Which of the below is NOT a kind of metadata in NameNode?

Block locations of files
List of files
File access control information
No. of file records

Answer: No. of file records

Question.
Which statement is true about DataNode?

It is the actual worker node that saves and stores meta data.
It is the slave node that saves and stores metadata.
It is the Master node that saves and stores actual data.
It is the slave node that saves and stores actual data.


Answer: It is the slave node that saves and stores actual data.

Question.
Is the Secondary NameNode is the Backup node?
TRUE
FALSE

Answer: FALSE

Question.
Which of the below is programming model planned for handling out large capacities of data in parallel by dividing the effort into a set of independent tasks.

MapReduce
Hive
Pig
HDFS

Answer: MapReduce

Question.
Mappers sorted output is Input to the-
Reducer
Mapper
Shuffle
All of the mentioned

Answer: Reducer


Question.
Which of the following generate intermediate key-value pair?
Reducer
Mapper
Combiner
Partitioner

Answer: Mapper

Question.
What is the major advantages of storing data in block size 128MB?
It saves disk seek time
It saves disk processing time
It saves disk access time
It saves disk latency time

Answer: It saves disk seek time

Question.
Role of Partitioned in Map Reduce Job is :

a) To partition input data into equal parts
b) Distribute data among available reducers
c) To partition data and send to each mapper
d) Distribute data among available mappers

Answer:  Distribute data among available reducers

Question.
Which of the following is Single point of Failure?
NameNode
Secondary NameNode
DataNode
None of above

Answer: NameNode

Question.
Apache Hbase is

a) Column family oriented NoSQL database
b) Relational Database
c) Document oriented NoSQL database
d) Not part of Hadoop eco system

Answer: Column family oriented NoSQL database

Question.
Which of the following is a Table Type in Hive ?

a)Managed Table
b)Local Table
c)Persistent Table
d)Memory Table

Answer: Managed Table

Question.
Which of the following is a demon process in Hadoop?

a) NameNode
b) JobNode
c) taskNode
d) mapreducer

Answer: NameNode

Question.
Information about locations of the blocks of a file is stored at ________

a)data nodes
b)name node
c)secondary name node
d)job tracker

Answer: name node

Question.
Apache Sqoop is used to

a) Move data from local file system to HDFS
b) Move data from streaming sources to HDFS
c) Move data from RDBMS to HDFS
d) Move data between Hadoop Clusters

Answer: Move data from RDBMS to HDFS

Question.
In a Map Reduce Program, role of combiner is

a) To combine output from multiple map tasks
b) To combine output from multiple reduce tasks
c) To merge data and create a single output file
d) To aggregate the output of each map task

Answer: To aggregate the output of each map task

Question.
Hive External tables store data in

a) default Hive warehouse location in HDFS
b) default Hive warehouse location in Local file system
c) a custom location in HDFS
d) a custom location in local file system

Answer: a custom location in HDFS

Question.
MapReduce programming model is ________

a)Platform Dependent but not language-specific
b)Neither platform- nor language-specific
c)Platform independent but language-specific
d)Platform Dependent and language-specific

Answer: Neither platform- nor language-specific

Question.
Hive generates results using

a) DAG of Map Reduce Jobs
b) sequencial processing of files
c) MySQL query engine
d) List processing

Answer: DAG of Map Reduce Jobs

Question.
Clients access the blocks directly from ________for read and write

a)data nodes
b)name node
c)secondarynamenode
d)primary node

Answer: data nodes

Question.
In Apache Pig, a Data Bag stores

a) Set of columns
b) set of columns with the same data type
c) set of columns with different data type
d) Set of tuples

Answer: set of columns with the same data type

Question.
You can execute a Pig Script in local mode using the following command

a) pig -mode local
b) pig -x local
c) pig -run local
d) pig -f

Answer: pig -x local

Question.
Default bock size in HDFS is____________

a)128 KB
b)64 KB
c)32 MB
d)128MB

Answer:128MB

Question.
Apache Flume is used to

a) Move data from RDBMS to HDFS
b) Move data from HDFS to RDBMS
c) Move data from One HDFS Cluster to another
d) Move data from Streaming source to HDFS

Answer: Move data from Streaming source to HDFS

Question.
Default data field delimiter used by Hive is

a) Ctrl-a character
b) Tab
a) Ctrl-b character
d) Space

Answer: Ctrl-a character

Question.
What are the characteristics of Big Data?

a)volume, quality, variety
b)volume,velocity, variety
c)volume, quality, quantity
d)qantity and quality only

Answer: volume,velocity, variety

Question.
Which is optional in map reduce program?

a)Mapper
b)Reducer
c)both are optional
d)both are mandatory

Answer: Reducer

Question.
In Hive tables, each table partition data is stored as ?

a) files in separate folders
b) multiple files in same folder
c) a single file
d) multiple xml files

Answer: files in separate folders

Question.
What is the default storage class in Pig Called ?

a)TextStorage
b)DefaultStorage
c)PigStorage
d)BinaryStorage

Answer: PigStorage

July 05, 2018

Big Data - HDFS Overview


HDFS(Hadoop Distributed File System):
HDFS is a distributed file system, It's specially design for storing huge data set with cluster(group of hardware) of commodity hardware through streaming access pattern. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardwares.

Streaming access pattern -> "write once , read n number of times but don't change content of file".

Hadoop daemons:

Hadoop 1.0

Master- NameNode, JobTracker, Secondary NameNode,
Slave-    DataNode, TaskTracker

Hadoop 2.0

Master - NameNode, Secondary NameNode, Resource Manager
Slave -    DataNode, Node manager.

IMP: To store data what are the services we need:- NameNode(to create metadata), DataNode(to store actual data).
IMP: To process data:- JobTracker and TaskTracker.

Single Point of failure:

Hadoop 1.0:- The NameNode is the single point of failure. Each cluster has a single NameNode and if that machine is not available, the whole cluster will not be available.

Hadoop 2.0:- HDFS comes with high availability which overcome the single point of failure by providing an option to run two NameNodes in the same cluster as an Active/Passive configuration with a hot standby.


NameNode and DataNode:

HDFS has the NameNode and DataNode as known as Master/Slave. The NameNode contains Metadata and perform operations like opening, closing, and renaming files and directories. The DataNode is responsible for serving request from client's system. It also performs block creation, deletion, and replication upon instruction from the NameNode. HDFS is built using java language and because of java portability, it can deployed/run on wide range of machines.

Data Replication:

HDFS is designed to store large volume of data across machines in a large cluster. It stores each files as a sequence of blocks and the blocks of files are replicated to overcome with fault-tolerance. We can set the block size and the replication factor manually by changing in hdfs-site.xml file which usually found in the conf/ folder of the Hadoop installation directory. Once you are in the conf/ folder, find the below properties to make any change in that.

<property> 
<name>dfs.replication<name> 
<value>3<value> 
<description>Block Replication<description> 
<property>

hdfs-site.xml is used to configure the HDFS and any changes made in this file would be default replication for all the files placed in HDFS. We can perform replication factor on a per file basis also by using the below shell command.

> hdfs dfs -setrep -w 2 /user/cloudera/training/userdata.txt    [This shell command will set the replication 2 for the file userdata.txt]

IMP: What is the default replication factor(default value in HDFS) -> 3 which means for each block stored in HDFS will have 1 original block and 2 replicas.

Architecture:

Block size in HDFS

The default block size in Hadoop 2.0 is 128MB and earlier it was 64MB for the Hadoop 1.0, This is used to divide the input file into blocks and distribute those blocks across the cluster.  For example: if the input file size is 170MB which is put into HDFS and the default block size in Hadoop 2.0 is 128MB then, HDFS would split the file into 2 blocks which would be as of 128MB each. The first block will have 128MB and the other block will have 42MB space(Remaining space will be available for OS which is 86MB).

If we want to change the default size of block in HDFS so it can be via hdfs-site.xml. It will change the default block size for all the files placed into HDFS.

<property> 
<name>dfs.block.size<name> 
<value>134217728<value> 
<description>Block size<description>
<property>

IMP: Changing the block size in HDFS will not affect any files currently available in HDFS, it will only affect the file which is placed after the setting has taken effect.
IMP: Why 128mb of block size, why not 128kb of block size-> For each block in HDFS , we need to create metadata in namenode so if we go with 128kb of block size then it will have to create more metadata for every blocks.

Metadata:
What do you means by Metadata ? . It's the additional information about our data, Like - Number of input splits, number of replication, data block location, file size etc. HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called EditLog to keep update the record when every change occur in the files system.

There are 2 importaint files:
1. FsImage
2. EditLog
Metadata is kept on highly configured system as NameNode, It keeps an image of entire file system through FsImage and EditLog transactions. The FsImage would have data of a file from pre one hour data to its started execution time whereas EditLog has the latest one hour data
When the NameNode starts up, it reads the FsImage and EditLog from the disk. All the latest data flushes out from EditLog to FsImage and It can truncate the old EditLog because the transactions has been applied to FsImage. This process is called Checkpoints.
Data Node stores HDFS file in its local file system and doesn't have any knowledge about HDFS files. It stores HDFS files in separate local file system to overcome with data loss. Data Node keeps sending Heartbeat to NameNode after specific time interval. It scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files and sends this report to the NameNode: which is known as Blockreport.

Difference between Hadoop 1 vs Hadoop 2: 
The biggest difference between both is YARN technology. YARN stands for Yet Another Resource Negotiator. YARN has 2 daemons which take care of 2 tasks - JobTrackera and TaskTracker. 
In Hadoop 2.0, JobTracker and TaskTracker has been replace by Resource Manager and Node Manager. 
Resource Manager: It allocates resources to various applications.
Node Manager:   It monitors the execution of the process.

May 29, 2018

Big Data - Hadoop Overview


What is Big Data?

The Collection of data with huge size. It's too large and complex that none of the relational/traditional data management tools are able to store it or process it efficiently. 
Example: We have a laptop and the data doesn't fit in that laptop that's the big data, the data which exceed the capacity of storage.

Data management tools: Informatica, SAS Data Management, Tableau etc.

How it's Impacted Social media

500+ terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly produced in the form of photos and video uploads, message exchanges, comments etc.

Why Hadoop needed:
Since 2003, the amount of data produced by us was 5 billion gigabytes. The same amount was created in every two days in 2011, and every ten minutes in 2013. This rate is still growing enormously. Although all this information produced is meaningful and can be useful when processed, it is being neglected.

If we talk about one-liner for this, it would be 'Challenges involved in its storage (Huge amount of Data) and processing'. Traditional data management tools are not able to store it or process it efficiently this amount of data, Hence Hadoop needed.

Categories of Big Data:

Big data could be found in three forms as described below:
  • Structured: Any data that can be stored, accessed and processed in the form of a fixed format. Example. DB tables.
  • Unstructured: The data which is not in a fixed format. ExampleGoogle Search, word, pdf.
  • Semi-structured: Semi-structured data can contain both the forms of data. Example. XML data.
Characteristics of Big Data:

We will discuss this in brief as a one-liner. There are mainly 5 which is also called V's of Big Data and can be asked in Interview.
  1. Volume: which is growing at an exponential rate i.e. in Petabytes and Exabyte.
  2. Variety: During the early days, spreadsheets and databases were the only sources of data considered by most application, now data such as email, photo, video, surveillance device, PDF, audio etc.
  3. Velocity: How fast the data is generated and processed to meet demands. Today, yesterday’s data are considered as old data. Nowadays, social media is a major contributor to the velocity of growing data.
  4. Veracity: Accuracy of data, inconsistency, and incompleteness. The volume is often the reason for lack of quality and accuracy in the data.
  5. Value: At last, we all have access to big data but turn it into values is profit.
Sources of Data:

Big Problems with big data:

There may be lots of problem with big data in the handling & processing perspective, but below are 2 major problems which you need to understand.
  • False data: Ex: Google Maps giving you directions and suggesting an alternate, “faster” route. You take it only to find that it’s a dirt road under construction. Sometimes BigData systems think that they got a shortcut, but in reality, it is not exactly what the user was looking for. 
  • Big complexity: The more data you have, sometimes the harder it can be to find true value from the data. Where to put it: the more data you have, the more complex the problem of managing it can become.
Handling Big Data

When "Big Data" emerged as a problem, Apache Hadoop developed as its solution. Apache Hadoop is a framework that gives us different services or tools to store and process large data. It helps to analyze big data and make business decisions, which can not be used efficiently and effectively using traditional systems.

Terms commonly used in Hadoop:

Here is a list of most important Hadoop terms you need to know and understand, before going into the Hadoop eco-system.
  • HDFS: An acronym for “Hadoop Distributed File System”, which breaks large applications workloads into the smaller data block, To store huge amount of data with the cluster of commodity hardware for faster processing.
  • MapReduce: A software framework for easily writing applications, To process huge data which is stored in HDFS.
  • HBase: HBase is called the Hadoop database because it is a NoSQL database that runs on top of Hadoop. It combines the scalability of Hadoop by running on the Hadoop Distributed File System (HDFS), with real-time data access as a key/value store and deep analytic capabilities of Map Reduce.
  • Hive: A data warehouse infrastructure built on top of Hadoop to provide data abbreviation, query, and analysis. This allows you to query the data using a SQL-like language called HiveQL (HQL).
  • HiveQL(HQL): A SQL like a query language for Hadoop used to execute MapReduce jobs on HDFS.
  • Sqoop: A tool designed to transfer data between Hadoop and relational databases.
  • Flume: A service for collecting, aggregating, and moving large amounts of log and event data into Hadoop.
Tools used in Big Data


Note: I have started this for those, who want to learn Hadoop and thinking to add this as additional skills for their resume. The information contained as per the experience of less than a year. Some salient topics would be uploaded as soon as possible with exercises.

Stay tuned.

Skils Set would be as Java, Hadoop, HDFS, Map Reduce, HBase, Zookeeper, Hive, Pig, Sqoop, Flume, Oozie.