May 29, 2018

Big Data - Hadoop Overview


What is Big Data?

The Collection of data with huge size. It's too large and complex that none of the relational/traditional data management tools are able to store it or process it efficiently. 
Example: We have a laptop and the data doesn't fit in that laptop that's the big data, the data which exceed the capacity of storage.

Data management tools: Informatica, SAS Data Management, Tableau etc.

How it's Impacted Social media

500+ terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly produced in the form of photos and video uploads, message exchanges, comments etc.

Why Hadoop needed:
Since 2003, the amount of data produced by us was 5 billion gigabytes. The same amount was created in every two days in 2011, and every ten minutes in 2013. This rate is still growing enormously. Although all this information produced is meaningful and can be useful when processed, it is being neglected.

If we talk about one-liner for this, it would be 'Challenges involved in its storage (Huge amount of Data) and processing'. Traditional data management tools are not able to store it or process it efficiently this amount of data, Hence Hadoop needed.

Categories of Big Data:

Big data could be found in three forms as described below:
  • Structured: Any data that can be stored, accessed and processed in the form of a fixed format. Example. DB tables.
  • Unstructured: The data which is not in a fixed format. ExampleGoogle Search, word, pdf.
  • Semi-structured: Semi-structured data can contain both the forms of data. Example. XML data.
Characteristics of Big Data:

We will discuss this in brief as a one-liner. There are mainly 5 which is also called V's of Big Data and can be asked in Interview.
  1. Volume: which is growing at an exponential rate i.e. in Petabytes and Exabyte.
  2. Variety: During the early days, spreadsheets and databases were the only sources of data considered by most application, now data such as email, photo, video, surveillance device, PDF, audio etc.
  3. Velocity: How fast the data is generated and processed to meet demands. Today, yesterday’s data are considered as old data. Nowadays, social media is a major contributor to the velocity of growing data.
  4. Veracity: Accuracy of data, inconsistency, and incompleteness. The volume is often the reason for lack of quality and accuracy in the data.
  5. Value: At last, we all have access to big data but turn it into values is profit.
Sources of Data:

Big Problems with big data:

There may be lots of problem with big data in the handling & processing perspective, but below are 2 major problems which you need to understand.
  • False data: Ex: Google Maps giving you directions and suggesting an alternate, “faster” route. You take it only to find that it’s a dirt road under construction. Sometimes BigData systems think that they got a shortcut, but in reality, it is not exactly what the user was looking for. 
  • Big complexity: The more data you have, sometimes the harder it can be to find true value from the data. Where to put it: the more data you have, the more complex the problem of managing it can become.
Handling Big Data

When "Big Data" emerged as a problem, Apache Hadoop developed as its solution. Apache Hadoop is a framework that gives us different services or tools to store and process large data. It helps to analyze big data and make business decisions, which can not be used efficiently and effectively using traditional systems.

Terms commonly used in Hadoop:

Here is a list of most important Hadoop terms you need to know and understand, before going into the Hadoop eco-system.
  • HDFS: An acronym for “Hadoop Distributed File System”, which breaks large applications workloads into the smaller data block, To store huge amount of data with the cluster of commodity hardware for faster processing.
  • MapReduce: A software framework for easily writing applications, To process huge data which is stored in HDFS.
  • HBase: HBase is called the Hadoop database because it is a NoSQL database that runs on top of Hadoop. It combines the scalability of Hadoop by running on the Hadoop Distributed File System (HDFS), with real-time data access as a key/value store and deep analytic capabilities of Map Reduce.
  • Hive: A data warehouse infrastructure built on top of Hadoop to provide data abbreviation, query, and analysis. This allows you to query the data using a SQL-like language called HiveQL (HQL).
  • HiveQL(HQL): A SQL like a query language for Hadoop used to execute MapReduce jobs on HDFS.
  • Sqoop: A tool designed to transfer data between Hadoop and relational databases.
  • Flume: A service for collecting, aggregating, and moving large amounts of log and event data into Hadoop.
Tools used in Big Data


Note: I have started this for those, who want to learn Hadoop and thinking to add this as additional skills for their resume. The information contained as per the experience of less than a year. Some salient topics would be uploaded as soon as possible with exercises.

Stay tuned.

Skils Set would be as Java, Hadoop, HDFS, Map Reduce, HBase, Zookeeper, Hive, Pig, Sqoop, Flume, Oozie.