Showing posts with label dwh. Show all posts
Showing posts with label dwh. Show all posts

August 01, 2018

Brief about Book Library Data Warehouse System

Topic: BOOK LIBRARY
Subject: DATA WAREHOUSE
Prepared by: Sumit

Q1) Identify the business processes of interest to senior management in the industry (domain) allocated to your group.
Answer)
Major libraries have large collections and circulation. Managing libraries electronically has resulted in the creation and management of large library databases, Library to the students and teachers who are cooperating in this e-learning environment.

Below are some of the business processes of interest to senior management:
  • Variety of Books: Need to better understand what books customers wanted and were willing to pay for. 
  • Fund the Books: Need to change its costs and cash flow so that the book library could continue to operate. 
  • Make Library Reliable: It has to be a library that has its customers to their wanted books on-time.
  • Book Borrowing
A crucial part of a library is the human intermediary the librarian. This intermediary connects the users to the information needed and can assist with advice about using the information retrieval systems and working with information.

Q2) List some questions that would be raised by senior management for improving the business process.
Answer)
There are many questions that can be asked by senior management for improving the above business process.
Some of the questions that will be asked are :
  • When the item was collected?
  • Which librarian registered it?
  • What is the item about?
  • Which branch library the item was registered at?
Q3) To address the above-mentioned questions; propose a DW design (schema diagram).
Answer)
In general for a DW Design basically four main characteristics are used:
Step 1: Identify the Business Process
Step 2: Declare the Grain
Step 3: Identify the Dimensions
Step 4: Identify the Facts

Our Book Library case, the following are steps:
  1. Business Process: Book borrowing is the business process.
  2. Declare the Grain: The second step is to declare the grain of the business process. In the book borrowing process, we declare a transaction issued in library automation system as the grain, which means an item is borrowed by a patron.
  3. Identify the Dimensions: The third step is to choose the dimensions. Dimensions represent how people describe and inspect the data from the process. Following are dimension table I will be using :
    • The Patron-Dimension describes the library patron’s characteristics. The attributes of Patron-Dimension include the name of the patron, gender, occupation, patron type, department, college, and so on.
    • The Item-Dimension describes every item belonging to the library, and its attributes indicating what relating to this item, including call number, title, author, subject, classification, language, location, MARC, collecting source, and so on. 
    • The Location-Dimension describes branch libraries supervised by the city library, and its attributes include the name of the branch library, named of the district it is located and the name of region library.
    • The Date-Dimension describes every hour of one day, and its attributes include hour, date, week, month and year. 
  4. Identify the Facts: The fourth step is to identify the facts. In the case of book borrowing, we identify the fact to measure the number of books borrowed. We declared a transaction that an item was borrowed by a patron as the grain in the prior step. Thus, the number of books borrowed here is equal to one.
  • The star schema is perhaps the simplest data warehouse schema.
  • It is called a star schema because the entity-relationship diagram of this schema resembles a star, with points radiating from a central table. 
  • The center of the star consists of a large fact table and the points of the star are the dimension tables.
Star Schema for Library Book Borrowing:


Q4) List aggregations to improve the DW performance. Justify.
Answer)
  • Aggregates provide improvements in performance because of the significantly smaller number of records.
  • Aggregates allow quick access to Book Dimension data during reporting. Similar to database indexes, they serve to improve performance.
  • Aggregates are particularly useful in the following cases:
    • Executing and navigating in query data leads to delays if you have a group of queries
    • You want to speed up the execution and navigation of a specific query
    • You often use attributes in queries
    • You want to speed up reporting with specific hierarchies by adding a level of a specific hierarchy.
  • Aggregates are particularly useful in the following cases:
  • If the aggregate contains data that is to be evaluated by a query, the query data is read automatically from the aggregate.
  • Query: Total sales for books during the first week of December 2000 for location Mumbai.

Q5) List and justify any 5 metadata items that will be of interest to various stakeholders.
Answer)
  • Metadata means "data about data". 
  • Data that provides information about one or more aspects of metadata data is defined as; It is used to summarize the basic information about the data that can be tracked and can work with specific data.
  • Below are metadata items of various interest to stakeholders:
    • Purpose of the book
    • Time and date of issuing the book
    • Creator or author of the book
    • Location on a computer network where the book was issued.
    • Book quantity
    • Book quality
  • Below are metadata items of various interest to stakeholders:
Types of Meta Data:
  • Descriptive metadata is usually used for search and identification, such as searching and finding an object, such as title, author, topic, keyword, and publisher.
  • Administrative metadata provides information to help manage the source. Administrative metadata refers to the technical information, including file type, or when and how the file was created.
  • Structural metadata describes how components of an object are organized. An example of structural metadata will be how the pages are ordered to make chapters of a book.
Following are some key points that to be included in MetaData:

Definition of data warehouse − It includes the description of the structure of data warehouse. The description is defined by schema, view, hierarchy, derivative data definitions, and data mart locations and materials.

Operational Metadata − It includes currency of data and data lineage. The currency of the data means that the data is active, stored or pure, or not. The genealogy of the data means the history of the migrated data and the changes applied to it.

Business metadata − It has the data ownership information, business definition, and changing policies

May 06, 2018

Data Warehouse - MCQS

Question.
The data Warehouse is______.
A. read write only.
B. write only.
C. read only.
D. none.
ANSWER: A

Question.
__________ is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of
management decisions.
A. Web Mining.
B. Data Warehousing.
C. Data Mining.
D. Text Mining.

ANSWER: B

Question.
If you have to summarize data as per hierarchy in dimension attributes, which is appropriate SQL clause?

Select one:
SUM
HAVING
WHERE
ROLLUP

Answer: ROLLUP

Question.
Which of the following is Apache data warehouse?

Select one:
Hbase
Sqoop
HDFS
Hive

Answer: Hive

Question.
A Data Mart whose source is directly from transactional systems, legacy applications, or external data feeds

Select one:
Dependent Data Mart
Intra-entry Data Mart
Independent Data Mart
Inter-entry Data Mart

Answer: Independent Data Mart

Question.
In a star schema, the number of rows in a dimension table is

Select one:
1. Larger than the fact table
2. Smaller than the fact table
3. Same as in the fact table

4. None of the options


AnswerSmaller than the fact table

Question.
The difference between ROLLUP and CUBE in SQL is

Select one:
CUBE summarizes for all combinations while ROLLUP does along hierarchy
CUBE summarizes along a hierarchy while ROLLUP does for all combinations
CUBE and ROLLUP handle NULLs differently
CUBE and ROLLUP are the same

Answer: CUBE summarizes for all combinations while ROLLUP does along hierarchy

Question.
Data scrubbing is which of the following?

Select one:
A process to load the data in the data warehouse and to create the necessary indexes
A process to upgrade the quality of data before it is moved into a data warehouse
A process to aggregate data after it is moved into a data warehouse
A process to reject data from the data warehouse and to create the necessary indexes


Answer: A process to upgrade the quality of data before it is moved into a data warehouse

Question.
Which of the following is NOT a kind of data warehouse application?

Select one:
Data mining
Analytical processing
Information processing
Transaction processing

AnswerTransaction processing

Question.
Which one of the following is NOT a characteristic of metadata?

Select one:
Self-describing
Includes user data
Data about data
Helps user to build ad-hoc queries

Answer: Includes user data

Question.
A _______ is a table on disk that contains the result set of a query

Select one:
Materialized View
Pivot
View
Table

AnswerMaterialized View

Question.
Which one is NOT the Codd’s rule for OLAP?

Select one:
Restricted cross dimensional operation
Transparency
Consistent Reporting Performance
Intuitive Data Manipulation

Answer: Restricted cross dimensional operation

Question.
A bitmap index is most appropriate on which of these columns

Select one:
Salary
Gender
Date of Birth
Name

Answer: Gender

Question.
SQL Windowing is based on

Select one:
Partitions defined for the table
Indexes
Materialized Views
Partitions defined in query

Answer: Partitions defined in query

Question.
The SQL analytical function to divide records into equal sized groups is

Select one:
PERCENTILE
GROUP BY
NTILE
QUARTILE

Answer: NTILE

Question.
The operation of moving from coarser granularity data to a finer granularity  is called

Select one:
Drill down
Dice
Rollup
Slice


AnswerDrill down

Question.
Views in SQL are of the type

Select one:
Always Physical
Sometimes physical
Always Virtual
Always up-to-date

AnswerSometimes physical

Question.
MOLAP stores aggregate values in which of the following structure?

Select one:
Table
Index
View
Array

AnswerArray

Question.
Which of the following OLAP techniques offers best latency?

Select one:
ROLAP
MOLAP
DOLAP
HOLAP

Answer: ROLAP

Question.
Aggregate navigator is based on

Select one:
Indexes
Summary tables
Logical Views
Materialized Views

AnswerMaterialized Views

Question.
Summarization for a table is to be done using 4 attributes. How many SQL statements using GROUP BY are required to produce equivalent results?

Select one:
4
8
64
16

Answer: 16

Question.
Difference between Rank and Sort in SQL is

Select one:
Sort allows aggregates while Rank does not
Rank gives a separate sequence of numbers
Sort is based on attribute while Rank is not
They are same

AnswerRank gives a separate sequence of numbers

Question.
Dimensions D1 and D2 are not considered conforming if

Select one:
D1 columns are subset of D2
D1 identical to D2
D1 and D2 are present on the same server
D1 is rolled up subset of D2

AnswerD1 and D2 are present on the same server

Question.
In SQL, what is the difference between GROUP BY and PARTITION BY clauses ?

Select one:
They are same.
SELECT clause
Type of attributes
WHERE clause

AnswerWHERE clause

Question.
When were analytic functions introduced in SQL?

Select one:
2011
1999
1992
2003


Answer2003

Question.
What is the difference between Rank and Dense_Rank in SQL?

Select one:
Rank and Dense_Rank are the same
Dense_Rank uses consecutive numbers
Dense_Rank hides some results
Dense_Rank rejects duplicates


Answer: Dense_Rank uses consecutive numbers

Question.
The type of relationship in star schema from dimension to fact is

Select one:
Many-to-Many
One-to-Many
One-to-One
Many-to-one 


Answer: one-to-many 

Question.
A table can not be partitioned by

Select one:
Hash
List
Range
Rank

AnswerRank

Question.
A snowflake schema is which of the following type of table?

Select one:
1. Fact
2. Normalized Dimension
3. Factless Dimension
4. All of the above

Answer: All of the above

Question
Enterprise Data Warehouse is a _________________warehouse.

Select one:
1. Public
2. Non Centralized
3. Centralized
4. Private

Answer: Centralized

Question
Which of the following is not a data component of a data warehouse?

Select one:
1. Component Key.
2. Metadata
3. Current detail data.

4. Lightly summarized data.

Answer: Component Key

Question
Data warehouses commonly use a _______________key to uniquely identify an entity for various purposes.

Select one:
1. Composite
2. Primary
3. Super
4. Surrogate

Answer: Surrogate

April 26, 2018

Data Warehouse - Mid Sem Excercise

Q.1   Indicate if the following statements are true/false with a brief justification  (1 X 5 marks)
a.       In type 3 changes of slowly changing dimensions, attribute values are overwritten.
b.      Fact table always contains explicit measures.
c.       Data Warehouses handle unstructured data seamlessly.
d.      In Federated DW architecture, information delivery appears centralized.
e.       Data transformation rules are contained in operational metadata.
Ans. a)
False
Reason: In type 3 changes of slowly changing dimensions there is adding a new column. In this type usually, only the current and previous value of the dimension is kept in the database.
•Attribute values are overwritten in type 1 changes of slowly changing dimensions.

Ans. b)
True
Reason: A fact table corresponds to a physical observable event, and not to the demands of a particular report. In a star schema, data is stored as either facts or dimensional attributes,.

Ans. c)
True
Reason: Data warehouses have relied on batch-processing Extract-Transform-Load jobs.

Ans. d)
False
Reason: A federated data warehouse is used to integrate key business measures and dimensions.
•The iterative manner of federated data warehouse approach helps reduce the implementation time and cost but the delivery reporting is standard and not centralized.

Ans. e)
False
Reason: Operational metadata contain all of the information about the operational data sources that allow us to trace back to the original source.
•Category of metadata which contains information about all the data transformations taking place in the data staging area is called Transformation Metadata.
Q.2  Write a brief note on the following                                                                             
    1. OLTP vs. OLAP
    2. ETL vs. Data Virtualization
    3. Semi-additive vs. Non-additive measures
    4. Periodic snapshot vs. Accumulating snapshot
    5. Time dimension
Ans.a)   OLTP vs OLAP:


Ans.b)  ETL vs. Data Virtualization: 

Extract, Transform, and Load (ETL) 
ETL tools that are designed to make bulk copies of large data sets, with millions of rows from large structured data sources.
Perform complex, multi-pass data changes and cleaning operations and load data into targeted data stores.
Creating historical records of data, e.g. snapshots at a particular time, to analyze how the dataset changes over time.

Data virtualization 
Acting as a virtual data source to augment an ETL process.
Federating multiple data warehouses.
Extending existing data warehouses with new data sources.
Separating applications with changes in the underlying data sources (e.g., migrating data warehouses.)

Ans.c)  Semi-additive vs. Non-additive measures

Semi-Additive measures
-These facts are the specific classes of measures that can be collected in all the dimensions and their hierarchy except for the time dimensions.
-Facts which can be summarized for certain dimensions in the facts table, but not others.
-Example: Daily balance can be understood through the dimensions of customers, but not through time dimension.

Non-additive measures
−These facts are those specific classes of measures that can not be integrated into all / any dimension and their hierarchy.
−Non-additives are facts which can not be summed up for any dimensions present in the facts table.
- Examples: The factors in which the percentages are, the ratios were calculated.

Ans.d)  Periodic snapshot vs. Accumulating snapshot

Periodic snapshot

These fact tables frequently comprise many facts because any measurement event reliable with the fact table grain is allowable
It captures the state of the measures based on the occurrence of a status events occurring over a standard period, such as a day, a week, or a month.
These are uniformly dense in their foreign keys because even if no action takes place through the period, a row is naturally inserted in the fact table having a zero or null for each fact

Accumulating snapshot
−A accumulation snapshot summarizes the measurement events occurring at the estimated steps between a row start and end of the process in the fact table.
Pipeline or workflow processes, such as claim processing or order fulfillment, in which there is an initial starting point, standard intermediate stage, and defined endpoint, this type of model can be modeled with the fact table.

−Snapshot Facts table is a foreign key for other dimensions, and alternatively, there are unalterable dimensions

Ans.e)  Time dimension

−The Date dimension is one of these dimension tables related to the Fact.
−Although the calendar date dimension is the most important time dimension, we also need a calendar month dimension when the fact table’s time grain is a month. In some environments, we may need to build calendar week, quarter, or year dimensions as well if there are fact tables at each of these grains.
−The most common and useful time dimension is the calendar date dimension with the granularity of a single day.  
The calendar month dimension should be a separate physical table and should be created by physically eliminating selected rows and columns from the calendar day dimension.

Q.3)What is a staging area? What is the purpose of a staging area? Do we need it?(1+2+2 marks)
Ans)
  • Staging Area: It's a place where you hold temporary tables on the Data warehouse server. These tables are connected to fact tables.
  • Do we need it: We basically need the staging area to hold the data, and perform data cleansing and merging before loading the data into the warehouse. In the deficiency of a staging area, the data load will have to go from the OLTP system to the OLAP system directly, which in fact will rigorously hamper the performance of the OLTP system.
  • Purpose of Staging area
    • Do flat mapping i.e. Putting all the OLTP data in to it without applying any business rules. 
    • Used for cleaning and validating data using first logic.
    • Pushing data into the staging table will take fewer time because there are no business rules of alteration applied to it.
Q.4)A retail store chain operates at many locations and keeps track of individual customer information and their purchases. The company (that operates the retail store chain) is interested in the understanding effectiveness of promotions offered to their customers. Design a star schema to help with their analytical requirements.

Ans)


Retail Store Star Schema

Q.5) Suppose that a data warehouse consists of the four dimensions, date, spectator, location, and game, and the two measures, count and charge, where the charge is the ticket price that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors,with each category having its own rate
(a) Draw a star schema diagram for the data warehouse.
Star Schema

(b) Starting with the base cuboid [date, spectator, location, game], what specific OLAP operations should one perform in order to find the total charge paid by student spectators at Eden Gardens Stadium in 2016
                        
Ans. b)The specific OLAP operations to be performed are:
Roll-up on date from date id to year.
Roll-up on a game from game id to all.
Roll-up on location from location id to location name.
Roll-up on spectator from spectator id to status.
Dice with status=“ students”, location name=“ Eden Gardens Stadium”, and year = 2016.

Data Warehouse - Important QnA 3


Question. Provide brief answers to the following:

Data warehouse is subject oriented. For following companies what will be critical business subjects? 

i. Manufacturing company   
ii. Insurance company
iii. Bank
iv. University

Give an example of each to state the condition when to treat Customer as Fact and when as a Dimension. Why does Kimball advise construction of first-level data marts before embarking on consolidated data marts? What are the key and cardinality constraints between fact and dimension tables.
Discuss a 'Conformed Dimension' with example.
Ans. A: Business subjects for the below companies:  
       

Ans B. Example of treating Customer as Fact and Dimension:  This is basically depend on your business requirements.
  • If you want to get the customer's data in a descriptive format so make it customer as a dimension table. In this it will get the column name as : Customer Key , Customer name, Customer Code, Address, State. 
  • If you want to get the customer's data as in a summarize format so make it customer as a fact table. In this it will have the column name as : All the Dimensions Key, Total Experience, Count of Customers, Proficiency.
Ans C.  First-level data marts before embarking on consolidated data marts: 
  • It is prudent to focus on the first-level data marts as dimensional building blocks before tackling the task of consolidating and Kimball advised the same because they minimize the risk of signing up for an implementation that is too ambitious as well as Difficult to integrate if the overall requirements are not considered in the beginning.
  • Once it is time to begin a data mart development project, it is recommended to start the actual implementation with first-level data marts. Most of the overall risk of failure comes from biting off too much of the extract transformation-load (ETL) data staging design and development effort. Once we’ve fully enumerated the list of first-level data marts, then we can identify more complex multi source marts as a second step. These data marts are referred as consolidated data marts because they typically cross business processes.
Ans D. Constraints between Fact and Dimension tables: Following are the constraints in order to differentiate the fact and dimension table.
  • A dimension table usually contains a collection of surrogate keys, natural keys, and properties. In the struggle for this, a fact table will have a foreign key, measure / quantity, and dissolved dimensions.
  • Dimension tables deliver expressive or circumstantial information for the measurement of a fact table. In the case of fact tables, they provide the measurements of an initiative.
  • The dimension table has to be loaded first. However when loading the fact tables, one must have a look at the dimension table. Reason being the fact table has foreign keys, measures, facts, and the foreign keys of fact table are the primary keys in the dimension table.
  • The primary key in the dimension table is the foreign key in the fact table. Each record in a dimension table can describe many records in the fact table, making the join cardinality of dimension tables to fact tables one-to-many.
Ans E.  Conformed Dimension:  
  • A conformed dimension has exactly an equivalent significance and content when being mentioned from different fact tables. 
  • A conformed dimension can refer to multiple tables in multiple data marts within the same organization. 
  • If two dimension tables are needed to be considered as conformed, they must either be same or one must be a subdivision/subset of another. 
  • For instance, two dimension tables that are precisely identical with the exception of the primary key are not considered conformed dimensions.
  • The time dimension is a communal conformed dimension in an association or business. 
Question. Mention the approaches of Top-down (Inmon) and Bottom-up (Kimball) for Enterprise Data Warehouse development. What are the advantages of each approach? 

Ans.
Top-down Approach(Inmon): Inmon defines a data warehouse as a centralized repository for the entire enterprise. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form

Bottom-up Approach(Kimball)In a bottom-up approach, Data warehouse is the conglomerate(Group) of all data marts within the enterprise. Information is always stored in the dimensional model. Unlike the top-down approach, Data marts contain both atomic and summary that users may want or need in the future. Data is modeled in a star schema design to optimize usability and query performance.

Advantages: 





Question. A bank management wants to analyze the customer transactions based on the branch, customer and account type. They monitor transactions using counts and amounts. Draw a star schema to satisfy their need.
In the same above scenario, it was noticed that the customer table was having a lot of redundant values in terms of the city, state and country. Modify the above schema to handle redundancy. Also mention any demerits, if any, with the same.


Ans. Star Schema for Customer Transactions: 



Star Schema for Customer Transactions to reduce redundancy: