April 26, 2018

Important QnA on Data Warehouse - Mid Sem Excercise


Text Books
T1 Ponniah P, “Data Warehousing Fundamentals”, Wiley Student Edition, 2012
T2 Kimball R, “The Data Warehouse Toolkit”, 3e, John Wiley, 2013

Question.1
Match the following :

Answer:
Nonvolatile data
Read only
Dual data granularity
Details and summary
Dependent data mart
Data from main DW
Disparate data
Because of multiple sources
Decision support
Subject-oriented
Data staging
Workbench for data Integration
Data mining
Knowledge discovery
Meta data
Roadmap for users
Operational systems
Application flavor
Internal data
Private spreadsheets

Question.2
Classify metadata for following as Data Acquisition, Data Storage, or Information Delivery metadata.

1) Data Archival Rules
2) Predefined queries and reports
3) Source system from which data come
4) Information about last successful update
5) Types of aggregations
6) Information about data transformation rules

Answer
1) Data Archival Rules : Data Storage
2) Predefined queries and reports : Information Delivery 
3) Source system from which data come : Data Acquisition 
4) Information about last successful update: Data Storage
5) Types of aggregations: Data Acquisition
6) Information about data transformation rules : Data Acquisition 

Question.3
Provide short answers to the following : 

a. How do you differentiate a Data Warehouse from an application database?
b. What is a conformed dimension?
c. Differentiate initial load, incremental load, and full refresh
d. How can MapReduce processing help DW?
e. What is the difference between WINDOW and GROUP BY clauses in SQL SELECT statements?

Answer
A. Differentiate a Data Warehouse from an application database:

Database:
  • It’s normally used for online transaction processing (OLTP) however it can be used for other purposes such as data warehousing. It records data from the user for history.
  • Tables and joins are complex because they are generalized (for RDMS). This is done to reduce unnecessary data and save storage space.
  • Optimized for write operation.
Data Warehouse:
  • It’s normally used for used for Online Analytical Processing (OLAP). It reads the historical data for the Users for business decisions.
  • Optimized for read operations.
  • Tables and joins are simple since they are De-normalized. It is done to reduce the response time for analytical queries.
Answer
BConformed dimension:
  • A conformed dimension is a dimension that has exactly the same meaning and content when being referred from different fact tables. 
  • It can refer to multiple tables in multiple data marts within within a single organization.
  • Example, Except for the primary dimension, two dimension tables are exactly the same, which are not considered to be conformed dimensions.
Answer
C. Differentiate between initial load, full refresh and incremental load

Initial Load : It is the process of populating data warehousing tables for the very first time.

full refresh : When loading the data for the first time, all set records are loaded on a stretch based on the volume. It wipes and reloads all the contents of tables with fresh data.

Incremental Load : In this load we apply dynamic changes when needed in a specific period. The schedule is prescribed for each period (to get the latest record)

Answer
D
MapReduce programming paradigm is very powerful and flexible, It has been widely used for parallel computing in data-intensive areas due to its good scalability.

How MapReduce processing help DW:

* MapReduce makes it easier to write a distributed computing program by providing
  • inter-process communication
  • fault-tolerance
  • load balancing
  • and task scheduling.
* MapReduce integration allows fully parallel, distributed jobs against your data with locality awareness[Batch processing].
* This reduces difficult programming work, for the user only has to create a configuration file with some lines of code to declare the dimensions and fact objects and necessary change functions.

Answer
E. Difference between WINDOW and GROUP BY clauses in SQL SELECT statements:
  • Group by is an aggregate whereas over() is a window function.
  • When a group by clause is used all the columns in the select list should either be in group by or should be in an aggregate function. 
  • While in over() clause, we don't have any such restriction. Aggregates or other functions can be calculated using over function along-with other columns in the select list.
Question.4
For following statements, indicate true or false with proper  justification:

A. It is a good practice to drop the indexes before the initial load.                    
B. It is equally important to include unstructured data along with structured data in a data
Warehouse
C. Technical architecture consists of the vendor tools.
D. The importance of metadata is the same in a data warehouse as it is in an operational
system.
E. Backing up the data warehouse is not necessary because you can recover data from the source systems.
F. Metadata standards facilitate metadata interchange among tools.

Answer:
A. True: Index entry creation during mass load can be very time consuming. Therefore, to speed up load, drop the index before loading. You can rebuild or reproduce the index when the load is completed.

B. True: Both types of data are vital in the modern digital enterprise, but they must be managed differently. Because most of the business-relevant information originates in unstructured form.

C.True: If the vendor's investments in new technologies can be leveraged, their customers do not have to solve these same problems. Buying the new package is to take the advantages of the lessons the vendor have learned in how to apply the new technologies.

D. True: In an operational system, users get information thru predefined screens and reports. In DW, users seek information thru ad-hoc queries.

E. False: Information in DW is accumulated over long periods and elaborately prepossessed

F. True: We may find that the system vendor suggests selecting parts of different schemas and constructing a pack and mix schema of your own. This is known as pragmatic approach, we need to understand of data loss through truncation and mapping. in this situation Metadata standard facilitate the interchange between system rather than describe a form of storage.

Question.5
While designing fact table it is being advised to use non-intelligent keys; but while designing summary table it is advised to use intelligent key. Why ?

Answer:
Unlike fact tables, summary tables are re-created on a regular basis, possibly every time new data is loaded. The purpose of the summary is to speed up queries, and reduce data joins with dimension data takes significant amounts of time.

Question. 6
List some challenges (at least 5) in ETL testing.

Answer:
ETL: Stands for Extract-Transform-Load and a typical process of loading data from a source system to the actual data warehouse/other data integration projects.

ETL Testing: It is done to ensure that the data that has been loaded from a source to the destination after business transformation is accurate.

ETL Testing Challenges
  1. Lack of proper flow of business information  
  2. Existence of many ambiguous software requirements.
  3. Loss of data might be there during the ETL process
  4. Production sample data is not a true representation of all possible business processes.
  5.  Existence of apparent trouble acquiring and building test data
  6. It requires SQL programming: This has become a major issue for the testers because they are manual testers and have limited SQL coding skills, thus making data testing very difficult.
Question.7
Given a SALES table with attributes(Product, Region, Sales), where each row represents sales of a product in a region. Using WINDOW functions, Write SQL queries that show

A. Sale of each Product in each region along with average sales over all regions, products in table
B. Sale of each product in each region along with average sales of the product among all regions.

Answer:
Window functions:
Window functions are similar to aggregate functions, but there is one important difference. When we use aggregate functions with the GROUP BY clause, we “lose” the individual rows. The function is performed on the rows as an entire group. This is not the case when we use window functions, we can generate a result set with some attributes of an individual row together with the results of the window function.

A. Sale of each Product in each region along with average sales over all regions, products in table

SELECT Product,
       Region,
       AVG(Sales) OVER(PARTITION BY Region) AS Average_Sales
  FROM SALES;

B. Sale of each product in each region along with average sales of the product among all regions.

SELECT Product,
       Region,
       AVG(Sales) OVER(PARTITION BY Product) AS Average_Sales
  FROM SALES;

Question.8
Consider a retail store data cube with dimensions Product, Date, and Store, and the measure attribute Sales
a. Draw the lattice of cuboids for this data cube without considering hierarchies.
b. Suppose that the dimensional attributes have the following hierarchies defined. How many cuboids would the data cube contain for this set of hierarchies?

Answer:

A.
Drawing a lattice of cuboids for the dimension


B. Note: we are note sure about answer. Kindly post in comment box, if you have a correct answer.

Question.9
Consider the following relation on Stock Trading.    

Record Id
Security Name
Trading Volume
Closing Price
Exchange
1
SBI
55075000
275.5
NSE
2
ICICI
23500000
285.25
BSE
3
Axis Bank
4567895
345.45
BSE
4
Bank of Baroda
32700
178
BSE
5
Yes Bank
670000
875.75
BSE
6
Kotak Mahindra Bank
50750000
945.7
NSE
7
HDFC Bank
45730000
955.9
BSE
8
Allahabad Bank
45000
156.3
BSE
9
Karnataka Bank
5600
145.3
BSE
10
Syndicate Bank
746000
135.7
NSE
11
Canara Bank
34670
245.6
NSE
12
Indusnd Bank
44550000
456.8
NSE

a. Construct a bitmap index for the attributes Exchange and Trading Volume for this table. 
b. Indicate how these two bitmap-indices can be used to answer the query: Select all those tuples which traded in NSE and that have a trading volume of more than 30 million shares.

Answer:

a.
Bitmap on Exchange
NSE
1
0
0
0
0
1
0
0
0
1
1
1
BSE
0
1
1
1
1
0
1
1
1
0
0
0

Bitmap on Trading Volume

55075000
1
0
0
0
0
0
0
0
0
0
0
0
23500000
0
1
0
0
0
0
0
0
0
0
0
0
4567895
0
0
1
0
0
0
0
0
0
0
0
0
32700
0
0
0
1
0
0
0
0
0
0
0
0
670000
0
0
0
0
1
0
0
0
0
0
0
0
50750000
0
0
0
0
0
1
0
0
0
0
0
0
45730000
0
0
0
0
0
0
1
0
0
0
0
0
45000
0
0
0
0
0
0
0
1
0
0
0
0
5600
0
0
0
0
0
0
0
0
1
0
0
0
746000
0
0
0
0
0
0
0
0
0
1
0
0
34670
0
0
0
0
0
0
0
0
0
0
1
0
44550000
0
0
0
0
0
0
0
0
0
0
0
1

b.
NSE: (1 0 0 0 0 1 0 0 0 1 1 1)
TRADING VOLUME: (1 0 0 0 0 1 1 0 0 0 0 1)

We notice that the 1st, 6th and 12th bit in our question are traded at both the condition. We simply AND both bitmaps together.

Solution:
NSE: (1 0 0 0 0 1 0 0 0 1 1 1)
TdV: (1 0 0 0 0 1 1 0 0 0 0 1) AND
------------------------------------------
         (1 0 0 0 0 1 0 0 0 0 0 1)

Given that the 1st, the 6th and the 12th bits of the resulting bitmap are set to 1, we can know that these stocks are traded in NSE and that have a trading volume of more than 30 million shares.

NOTE: BELOW IS THE EXAMPLE FOR THE SAME QUESTION, UNDERSTAND AND GET THE SOLUTIONS FOR THE SAME[THIS IS JUST FOR UNDERSTANDING]

Bitmap Index:
Bitmap indexes are used by DBMSs to accelerate decision support queries. The main advantage of the bitmap index is that complex logical selection operations can be done very quickly by minimizing the search space before low cost Boolean operations like OR, AND, and NOT, before moving on to primary source data.

A Bitmap index: consists of one bitmap for each value that an attribute can take.

We will illustrate our discussion with a simple example about stock trading. Let us see the following table on stocks traded at different stock exchanges and consider the last column (Exchange).
Example:


We notice that stocks are traded at two different stock exchanges at NASDAQ and at NYSE. Indeed, the attribute Exchange has two distinct attribute values. In addition, we see that our stock examples include 12 different stocks, which are uniquely identified by their record ID given in the first column.
Simple bitmap for Stocks and their corresponding trading values:

NASDAQ: (1 0 0 0 0 1 0 0 0 1 1 1)
NYSE: (0 1 1 1 1 0 1 1 1 0 0 0)

We need two bitmaps because we have two distinct attribute values for the stock exchange. Each bitmap consists of 12 bit values since our example comprises 12 different stocks.

For instance, the last bit of the bitmap of NYSE is set to 0 because the last stock is not traded at NYSE. However, the second stock is traded at NYSE and, thus, the bit 2 is set to 1. In general, a bit is set to 1 if the stock is traded at the particular stock exchange, and it is set to 0 otherwise. Therefore, we have a straightforward way of describing the stock exchange by means of bitmaps.

Data Warehouse - Mid Sem Excercise

Q.1   Indicate if the following statements are true/false with a brief justification  (1 X 5 marks)
a.       In type 3 changes of slowly changing dimensions, attribute values are overwritten.
b.      Fact table always contains explicit measures.
c.       Data Warehouses handle unstructured data seamlessly.
d.      In Federated DW architecture, information delivery appears centralized.
e.       Data transformation rules are contained in operational metadata.
Ans. a)
False
Reason: In type 3 changes of slowly changing dimensions there is adding a new column. In this type usually, only the current and previous value of the dimension is kept in the database.
•Attribute values are overwritten in type 1 changes of slowly changing dimensions.

Ans. b)
True
Reason: A fact table corresponds to a physical observable event, and not to the demands of a particular report. In a star schema, data is stored as either facts or dimensional attributes,.

Ans. c)
True
Reason: Data warehouses have relied on batch-processing Extract-Transform-Load jobs.

Ans. d)
False
Reason: A federated data warehouse is used to integrate key business measures and dimensions.
•The iterative manner of federated data warehouse approach helps reduce the implementation time and cost but the delivery reporting is standard and not centralized.

Ans. e)
False
Reason: Operational metadata contain all of the information about the operational data sources that allow us to trace back to the original source.
•Category of metadata which contains information about all the data transformations taking place in the data staging area is called Transformation Metadata.
Q.2  Write a brief note on the following                                                                             
    1. OLTP vs. OLAP
    2. ETL vs. Data Virtualization
    3. Semi-additive vs. Non-additive measures
    4. Periodic snapshot vs. Accumulating snapshot
    5. Time dimension
Ans.a)   OLTP vs OLAP:


Ans.b)  ETL vs. Data Virtualization: 

Extract, Transform, and Load (ETL) 
ETL tools that are designed to make bulk copies of large data sets, with millions of rows from large structured data sources.
Perform complex, multi-pass data changes and cleaning operations and load data into targeted data stores.
Creating historical records of data, e.g. snapshots at a particular time, to analyze how the dataset changes over time.

Data virtualization 
Acting as a virtual data source to augment an ETL process.
Federating multiple data warehouses.
Extending existing data warehouses with new data sources.
Separating applications with changes in the underlying data sources (e.g., migrating data warehouses.)

Ans.c)  Semi-additive vs. Non-additive measures

Semi-Additive measures
-These facts are the specific classes of measures that can be collected in all the dimensions and their hierarchy except for the time dimensions.
-Facts which can be summarized for certain dimensions in the facts table, but not others.
-Example: Daily balance can be understood through the dimensions of customers, but not through time dimension.

Non-additive measures
−These facts are those specific classes of measures that can not be integrated into all / any dimension and their hierarchy.
−Non-additives are facts which can not be summed up for any dimensions present in the facts table.
- Examples: The factors in which the percentages are, the ratios were calculated.

Ans.d)  Periodic snapshot vs. Accumulating snapshot

Periodic snapshot

These fact tables frequently comprise many facts because any measurement event reliable with the fact table grain is allowable
It captures the state of the measures based on the occurrence of a status events occurring over a standard period, such as a day, a week, or a month.
These are uniformly dense in their foreign keys because even if no action takes place through the period, a row is naturally inserted in the fact table having a zero or null for each fact

Accumulating snapshot
−A accumulation snapshot summarizes the measurement events occurring at the estimated steps between a row start and end of the process in the fact table.
Pipeline or workflow processes, such as claim processing or order fulfillment, in which there is an initial starting point, standard intermediate stage, and defined endpoint, this type of model can be modeled with the fact table.

−Snapshot Facts table is a foreign key for other dimensions, and alternatively, there are unalterable dimensions

Ans.e)  Time dimension

−The Date dimension is one of these dimension tables related to the Fact.
−Although the calendar date dimension is the most important time dimension, we also need a calendar month dimension when the fact table’s time grain is a month. In some environments, we may need to build calendar week, quarter, or year dimensions as well if there are fact tables at each of these grains.
−The most common and useful time dimension is the calendar date dimension with the granularity of a single day.  
The calendar month dimension should be a separate physical table and should be created by physically eliminating selected rows and columns from the calendar day dimension.

Q.3)What is a staging area? What is the purpose of a staging area? Do we need it?(1+2+2 marks)
Ans)
  • Staging Area: It's a place where you hold temporary tables on the Data warehouse server. These tables are connected to fact tables.
  • Do we need it: We basically need the staging area to hold the data, and perform data cleansing and merging before loading the data into the warehouse. In the deficiency of a staging area, the data load will have to go from the OLTP system to the OLAP system directly, which in fact will rigorously hamper the performance of the OLTP system.
  • Purpose of Staging area
    • Do flat mapping i.e. Putting all the OLTP data in to it without applying any business rules. 
    • Used for cleaning and validating data using first logic.
    • Pushing data into the staging table will take fewer time because there are no business rules of alteration applied to it.
Q.4)A retail store chain operates at many locations and keeps track of individual customer information and their purchases. The company (that operates the retail store chain) is interested in the understanding effectiveness of promotions offered to their customers. Design a star schema to help with their analytical requirements.

Ans)


Retail Store Star Schema

Q.5) Suppose that a data warehouse consists of the four dimensions, date, spectator, location, and game, and the two measures, count and charge, where the charge is the ticket price that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors,with each category having its own rate
(a) Draw a star schema diagram for the data warehouse.
Star Schema

(b) Starting with the base cuboid [date, spectator, location, game], what specific OLAP operations should one perform in order to find the total charge paid by student spectators at Eden Gardens Stadium in 2016
                        
Ans. b)The specific OLAP operations to be performed are:
Roll-up on date from date id to year.
Roll-up on a game from game id to all.
Roll-up on location from location id to location name.
Roll-up on spectator from spectator id to status.
Dice with status=“ students”, location name=“ Eden Gardens Stadium”, and year = 2016.

Data Warehouse - Important QnA 3


Question. Provide brief answers to the following:

Data warehouse is subject oriented. For following companies what will be critical business subjects? 

i. Manufacturing company   
ii. Insurance company
iii. Bank
iv. University

Give an example of each to state the condition when to treat Customer as Fact and when as a Dimension. Why does Kimball advise construction of first-level data marts before embarking on consolidated data marts? What are the key and cardinality constraints between fact and dimension tables.
Discuss a 'Conformed Dimension' with example.
Ans. A: Business subjects for the below companies:  
       

Ans B. Example of treating Customer as Fact and Dimension:  This is basically depend on your business requirements.
  • If you want to get the customer's data in a descriptive format so make it customer as a dimension table. In this it will get the column name as : Customer Key , Customer name, Customer Code, Address, State. 
  • If you want to get the customer's data as in a summarize format so make it customer as a fact table. In this it will have the column name as : All the Dimensions Key, Total Experience, Count of Customers, Proficiency.
Ans C.  First-level data marts before embarking on consolidated data marts: 
  • It is prudent to focus on the first-level data marts as dimensional building blocks before tackling the task of consolidating and Kimball advised the same because they minimize the risk of signing up for an implementation that is too ambitious as well as Difficult to integrate if the overall requirements are not considered in the beginning.
  • Once it is time to begin a data mart development project, it is recommended to start the actual implementation with first-level data marts. Most of the overall risk of failure comes from biting off too much of the extract transformation-load (ETL) data staging design and development effort. Once we’ve fully enumerated the list of first-level data marts, then we can identify more complex multi source marts as a second step. These data marts are referred as consolidated data marts because they typically cross business processes.
Ans D. Constraints between Fact and Dimension tables: Following are the constraints in order to differentiate the fact and dimension table.
  • A dimension table usually contains a collection of surrogate keys, natural keys, and properties. In the struggle for this, a fact table will have a foreign key, measure / quantity, and dissolved dimensions.
  • Dimension tables deliver expressive or circumstantial information for the measurement of a fact table. In the case of fact tables, they provide the measurements of an initiative.
  • The dimension table has to be loaded first. However when loading the fact tables, one must have a look at the dimension table. Reason being the fact table has foreign keys, measures, facts, and the foreign keys of fact table are the primary keys in the dimension table.
  • The primary key in the dimension table is the foreign key in the fact table. Each record in a dimension table can describe many records in the fact table, making the join cardinality of dimension tables to fact tables one-to-many.
Ans E.  Conformed Dimension:  
  • A conformed dimension has exactly an equivalent significance and content when being mentioned from different fact tables. 
  • A conformed dimension can refer to multiple tables in multiple data marts within the same organization. 
  • If two dimension tables are needed to be considered as conformed, they must either be same or one must be a subdivision/subset of another. 
  • For instance, two dimension tables that are precisely identical with the exception of the primary key are not considered conformed dimensions.
  • The time dimension is a communal conformed dimension in an association or business. 
Question. Mention the approaches of Top-down (Inmon) and Bottom-up (Kimball) for Enterprise Data Warehouse development. What are the advantages of each approach? 

Ans.
Top-down Approach(Inmon): Inmon defines a data warehouse as a centralized repository for the entire enterprise. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form

Bottom-up Approach(Kimball)In a bottom-up approach, Data warehouse is the conglomerate(Group) of all data marts within the enterprise. Information is always stored in the dimensional model. Unlike the top-down approach, Data marts contain both atomic and summary that users may want or need in the future. Data is modeled in a star schema design to optimize usability and query performance.

Advantages: 





Question. A bank management wants to analyze the customer transactions based on the branch, customer and account type. They monitor transactions using counts and amounts. Draw a star schema to satisfy their need.
In the same above scenario, it was noticed that the customer table was having a lot of redundant values in terms of the city, state and country. Modify the above schema to handle redundancy. Also mention any demerits, if any, with the same.


Ans. Star Schema for Customer Transactions: 



Star Schema for Customer Transactions to reduce redundancy: