April 26, 2018

Data Warehouse - Mid Sem Excercise

Q.1   Indicate if the following statements are true/false with a brief justification  (1 X 5 marks)
a.       In type 3 changes of slowly changing dimensions, attribute values are overwritten.
b.      Fact table always contains explicit measures.
c.       Data Warehouses handle unstructured data seamlessly.
d.      In Federated DW architecture, information delivery appears centralized.
e.       Data transformation rules are contained in operational metadata.
Ans. a)
False
Reason: In type 3 changes of slowly changing dimensions there is adding a new column. In this type usually, only the current and previous value of the dimension is kept in the database.
•Attribute values are overwritten in type 1 changes of slowly changing dimensions.

Ans. b)
True
Reason: A fact table corresponds to a physical observable event, and not to the demands of a particular report. In a star schema, data is stored as either facts or dimensional attributes,.

Ans. c)
True
Reason: Data warehouses have relied on batch-processing Extract-Transform-Load jobs.

Ans. d)
False
Reason: A federated data warehouse is used to integrate key business measures and dimensions.
•The iterative manner of federated data warehouse approach helps reduce the implementation time and cost but the delivery reporting is standard and not centralized.

Ans. e)
False
Reason: Operational metadata contain all of the information about the operational data sources that allow us to trace back to the original source.
•Category of metadata which contains information about all the data transformations taking place in the data staging area is called Transformation Metadata.
Q.2  Write a brief note on the following                                                                             
    1. OLTP vs. OLAP
    2. ETL vs. Data Virtualization
    3. Semi-additive vs. Non-additive measures
    4. Periodic snapshot vs. Accumulating snapshot
    5. Time dimension
Ans.a)   OLTP vs OLAP:


Ans.b)  ETL vs. Data Virtualization: 

Extract, Transform, and Load (ETL) 
ETL tools that are designed to make bulk copies of large data sets, with millions of rows from large structured data sources.
Perform complex, multi-pass data changes and cleaning operations and load data into targeted data stores.
Creating historical records of data, e.g. snapshots at a particular time, to analyze how the dataset changes over time.

Data virtualization 
Acting as a virtual data source to augment an ETL process.
Federating multiple data warehouses.
Extending existing data warehouses with new data sources.
Separating applications with changes in the underlying data sources (e.g., migrating data warehouses.)

Ans.c)  Semi-additive vs. Non-additive measures

Semi-Additive measures
-These facts are the specific classes of measures that can be collected in all the dimensions and their hierarchy except for the time dimensions.
-Facts which can be summarized for certain dimensions in the facts table, but not others.
-Example: Daily balance can be understood through the dimensions of customers, but not through time dimension.

Non-additive measures
−These facts are those specific classes of measures that can not be integrated into all / any dimension and their hierarchy.
−Non-additives are facts which can not be summed up for any dimensions present in the facts table.
- Examples: The factors in which the percentages are, the ratios were calculated.

Ans.d)  Periodic snapshot vs. Accumulating snapshot

Periodic snapshot

These fact tables frequently comprise many facts because any measurement event reliable with the fact table grain is allowable
It captures the state of the measures based on the occurrence of a status events occurring over a standard period, such as a day, a week, or a month.
These are uniformly dense in their foreign keys because even if no action takes place through the period, a row is naturally inserted in the fact table having a zero or null for each fact

Accumulating snapshot
−A accumulation snapshot summarizes the measurement events occurring at the estimated steps between a row start and end of the process in the fact table.
Pipeline or workflow processes, such as claim processing or order fulfillment, in which there is an initial starting point, standard intermediate stage, and defined endpoint, this type of model can be modeled with the fact table.

−Snapshot Facts table is a foreign key for other dimensions, and alternatively, there are unalterable dimensions

Ans.e)  Time dimension

−The Date dimension is one of these dimension tables related to the Fact.
−Although the calendar date dimension is the most important time dimension, we also need a calendar month dimension when the fact table’s time grain is a month. In some environments, we may need to build calendar week, quarter, or year dimensions as well if there are fact tables at each of these grains.
−The most common and useful time dimension is the calendar date dimension with the granularity of a single day.  
The calendar month dimension should be a separate physical table and should be created by physically eliminating selected rows and columns from the calendar day dimension.

Q.3)What is a staging area? What is the purpose of a staging area? Do we need it?(1+2+2 marks)
Ans)
  • Staging Area: It's a place where you hold temporary tables on the Data warehouse server. These tables are connected to fact tables.
  • Do we need it: We basically need the staging area to hold the data, and perform data cleansing and merging before loading the data into the warehouse. In the deficiency of a staging area, the data load will have to go from the OLTP system to the OLAP system directly, which in fact will rigorously hamper the performance of the OLTP system.
  • Purpose of Staging area
    • Do flat mapping i.e. Putting all the OLTP data in to it without applying any business rules. 
    • Used for cleaning and validating data using first logic.
    • Pushing data into the staging table will take fewer time because there are no business rules of alteration applied to it.
Q.4)A retail store chain operates at many locations and keeps track of individual customer information and their purchases. The company (that operates the retail store chain) is interested in the understanding effectiveness of promotions offered to their customers. Design a star schema to help with their analytical requirements.

Ans)


Retail Store Star Schema

Q.5) Suppose that a data warehouse consists of the four dimensions, date, spectator, location, and game, and the two measures, count and charge, where the charge is the ticket price that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors,with each category having its own rate
(a) Draw a star schema diagram for the data warehouse.
Star Schema

(b) Starting with the base cuboid [date, spectator, location, game], what specific OLAP operations should one perform in order to find the total charge paid by student spectators at Eden Gardens Stadium in 2016
                        
Ans. b)The specific OLAP operations to be performed are:
Roll-up on date from date id to year.
Roll-up on a game from game id to all.
Roll-up on location from location id to location name.
Roll-up on spectator from spectator id to status.
Dice with status=“ students”, location name=“ Eden Gardens Stadium”, and year = 2016.

Data Warehouse - Important QnA 3


Question. Provide brief answers to the following:

Data warehouse is subject oriented. For following companies what will be critical business subjects? 

i. Manufacturing company   
ii. Insurance company
iii. Bank
iv. University

Give an example of each to state the condition when to treat Customer as Fact and when as a Dimension. Why does Kimball advise construction of first-level data marts before embarking on consolidated data marts? What are the key and cardinality constraints between fact and dimension tables.
Discuss a 'Conformed Dimension' with example.
Ans. A: Business subjects for the below companies:  
       

Ans B. Example of treating Customer as Fact and Dimension:  This is basically depend on your business requirements.
  • If you want to get the customer's data in a descriptive format so make it customer as a dimension table. In this it will get the column name as : Customer Key , Customer name, Customer Code, Address, State. 
  • If you want to get the customer's data as in a summarize format so make it customer as a fact table. In this it will have the column name as : All the Dimensions Key, Total Experience, Count of Customers, Proficiency.
Ans C.  First-level data marts before embarking on consolidated data marts: 
  • It is prudent to focus on the first-level data marts as dimensional building blocks before tackling the task of consolidating and Kimball advised the same because they minimize the risk of signing up for an implementation that is too ambitious as well as Difficult to integrate if the overall requirements are not considered in the beginning.
  • Once it is time to begin a data mart development project, it is recommended to start the actual implementation with first-level data marts. Most of the overall risk of failure comes from biting off too much of the extract transformation-load (ETL) data staging design and development effort. Once we’ve fully enumerated the list of first-level data marts, then we can identify more complex multi source marts as a second step. These data marts are referred as consolidated data marts because they typically cross business processes.
Ans D. Constraints between Fact and Dimension tables: Following are the constraints in order to differentiate the fact and dimension table.
  • A dimension table usually contains a collection of surrogate keys, natural keys, and properties. In the struggle for this, a fact table will have a foreign key, measure / quantity, and dissolved dimensions.
  • Dimension tables deliver expressive or circumstantial information for the measurement of a fact table. In the case of fact tables, they provide the measurements of an initiative.
  • The dimension table has to be loaded first. However when loading the fact tables, one must have a look at the dimension table. Reason being the fact table has foreign keys, measures, facts, and the foreign keys of fact table are the primary keys in the dimension table.
  • The primary key in the dimension table is the foreign key in the fact table. Each record in a dimension table can describe many records in the fact table, making the join cardinality of dimension tables to fact tables one-to-many.
Ans E.  Conformed Dimension:  
  • A conformed dimension has exactly an equivalent significance and content when being mentioned from different fact tables. 
  • A conformed dimension can refer to multiple tables in multiple data marts within the same organization. 
  • If two dimension tables are needed to be considered as conformed, they must either be same or one must be a subdivision/subset of another. 
  • For instance, two dimension tables that are precisely identical with the exception of the primary key are not considered conformed dimensions.
  • The time dimension is a communal conformed dimension in an association or business. 
Question. Mention the approaches of Top-down (Inmon) and Bottom-up (Kimball) for Enterprise Data Warehouse development. What are the advantages of each approach? 

Ans.
Top-down Approach(Inmon): Inmon defines a data warehouse as a centralized repository for the entire enterprise. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form

Bottom-up Approach(Kimball)In a bottom-up approach, Data warehouse is the conglomerate(Group) of all data marts within the enterprise. Information is always stored in the dimensional model. Unlike the top-down approach, Data marts contain both atomic and summary that users may want or need in the future. Data is modeled in a star schema design to optimize usability and query performance.

Advantages: 





Question. A bank management wants to analyze the customer transactions based on the branch, customer and account type. They monitor transactions using counts and amounts. Draw a star schema to satisfy their need.
In the same above scenario, it was noticed that the customer table was having a lot of redundant values in terms of the city, state and country. Modify the above schema to handle redundancy. Also mention any demerits, if any, with the same.


Ans. Star Schema for Customer Transactions: 



Star Schema for Customer Transactions to reduce redundancy: