Data Mining - Mid Sem Solutions

Question:
Give an example for each of the following preprocessing activates
a. Incomplete
b. Inconsistent

Answer:
Data Processing: It is a data mining technique that involves transforming raw data into an understandable format. Our Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Hence it is needed for resolving such issues.
"Preprocessing is needed to improve data quality"

A. Incomplete: Lacking attribute values, lacking certain attributes of interest, or containing only aggregate data.
E.g. Many tuples have no recorded value for several attributes,
Occupation = “ ” (missing data)

B. Inconsistent: Containing discrepancies in codes or names.
E.g.
Age = “42”, Birthday = “03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records

Question:
Classify following attributes as binary, discrete or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio).
a. Number of patients in hospital
b. ISBN numbers for books

Answer:
A. Number of patients in hospital - Discrete, quantitative, ratio
B. ISBN numbers for books - Discrete, qualitative, nominal

For your reference:

Attribute: A data field, representing a characteristic or feature of a data object.

Types:

Nominal
Binary
Numeric: quantitative

Interval-scaled
Ratio-scaled

Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes

Binary
Nominal attribute with only 2 states (0 and 1)

Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)

Ordinal
Values have a meaningful order (ranking) but magnitude between successive values is not known.
Size = {small, medium, large}, grades, army rankings

Examples for exercise:
Brightness as measured by a light meter.
Answer: Continuous, quantitative, ratio

Angles as measured in degrees between 0 ◦ and 360 ◦
Answer: Continuous, quantitative, ratio

Bronze, Silver, and Gold medals as awarded at the Olympics.
Answer: Discrete, qualitative, ordinal

Ability to pass light in terms of the following values: opaque, translucent, transparent.
Answer: Discrete, qualitative, ordinal

Military rank.
Answer: Discrete, qualitative, ordinal

Density of a substance in grams per cubic centimeter.
Answer: Discrete, quantitative, ratio

Question:

Suppose that the data for analysis includes the attribute price, 4, 8, 15, 21, 21, 24, 25, 28, and 34. Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps.

Answer:

Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps. Comment on the effect of this technique for the given data. The following steps are required to smooth the above data using smoothing by bin means with a bin depth of 3.

Step 1: Sort the data. [Which is already sorted]

Step 2: Partition the data into equi depth bins of depth 3.

Bin 1: 4, 8, 15

Bin 2: 21, 21, 24

Bin 3: 25, 28, 34

Step 3: Calculate the arithmetic mean of each bin. [Sum of no’s / Count of numbers]

Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin.

Bin 1: 9, 9, 9

Bin 2: 22, 22, 22

Bin 3: 29, 29, 29

For Reference: Can be asked as

Using Equi width:
width=(max−min)/N
Given as N=3

Step1: Sort the data.

Step2: width=(34-4)/3 = 10

Step3: So interval of bins are as- (1,10) , (11,20), (21,30), (31,40)

Step4: Find the number betwin the interval:
Bin1 - 4, 8
Bin2 - 15
Bin3 - 21, 21, 24, 25,28
Bin4 - 34

Using Equi depth: After sorting the data, follow the below step:

Sorted: 4, 8, 15, 21, 21, 24, 25, 28, 34

Step: Partition the data into equi depth bins of depth 3.

Bin 1: 4, 8, 15

Bin 2: 21, 21, 24

Bin 3: 25, 28, 34

Question:
Consider the one-dimensional data set shown on the below table

X	0.6	3.2	4.5	4.6	4.9	5.2	5.6	5.8	7.1	9.5
Y	-	-	+	+	+	-	-	+	-	-

Classify the data point x=5.0 according to its 3- and 9- nearest neighbors (Using majority Vote)

Answer:

We need to first find the difference of each data set with respect to x=5.0, Refer the below table for the same.

x	X	Difference (x & X)	Y
5.0	0.6	4.4	−
5.0	3.2	1.8	−
5.0	4.5	0.5	+
5.0	4.6	0.4	+
5.0	4.9	0.1	+
5.0	5.2	0.2	−
5.0	5.6	0.6	−
5.0	5.8	0.8	+
5.0	7.1	2.1	−
5.0	9.5	4.5	−

As asked,

Using 3- nearest neighbors method, 3 Closest points to the point x=5.0 will be the one who has least difference among them - > 4.9, 5.2, 4.6

Classes -> + − +

Using Majority Vote, 3-nearest neighbor: +

Using 9- nearest neighbors method, 9 Closest points to the point x=5.0 will be the one who has least difference among them - > 4.9, 5.2, 4.6, 4.5, 5.6, 5.8, 3.2, 7.1, 0.6

Classes -> + − + + − + − − −

Using Majority Vote, 9-nearest neighbor: −

Question:

Consider the following data set for a binary class problem.

A	B	Class Label
T	F	+
T	T	+
T	T	+
T	F	-
T	T	+
F	F	-
F	F	-
F	F	-
T	T	-
T	F	-

Calculate the gain in the Gini index when splitting on A and B. Which Attribute would the decision tree induction algorithm choose?

Answer:

First we will create table as per information provided above:

Using Gini Index method:
The overall gini before splitting is: Goriginal= 1 − (4/10)² − (6/10)²= 0.48

The gain in gini after splitting on A is:
GA=T = 1 − ( 4/7)² − (3/7)² = 0.4898
GA=F = 1 − ( 3/3)² − (0/3)² = 0
∆ = Goriginal − 7/10GA=T − 3/10GA=F = 0.1371

The gain in gini after splitting on B is:
GB=T = 1 − ( 1/4)² − (3/4)² = 0.3750
GB=F = 1 − ( 1/6)² − (5/6)² = 0.2778
∆ = Goriginal − 4/10GB=T − 6/10GB=F = 0.1633

Therefore, attribute B will be chosen to split the node.

Note: Below is the video, we have created for your reference, hope you will have at-least idea of solving the above problem.

For Reference: Can be asked

--> Calculate the information gain when splitting on A and B. Which attribute would the decision tree induction algorithm choose ?

Answer:
Using Information Gain method:
The contingency tables after splitting on attributes A and B are:

The overall entropy before splitting is:
Eoriginal = −0.4log0.4 − 0.6log0.6=0.9710

The information gain after splitting on A is:
EA=T  = −(4/7)log(4/7) − (3/7)log(3/7) = 0.9852
EA=F  = −(3/3)log(3/3) − (0/3)log(0/3) = 0
∆ = Eoriginal − 7/10EA=T − 3/10EA=F = 0.2813

The information gain after splitting on B is:
EB=T  = −(3/4)log(3/4) − (1/4)log(1/4) = 0.8113
EB=F  = −(1/6)log(1/6) − (5/6)log(5/6) = 0.6500
∆ = Eoriginal − 4/10EB=T − 6/10EB=F = 0.2565

Therefore, attribute A will be chosen to split the node.

Question: [10 Marks]
Derive Rules for the following data shown in table 1 using indirect method (Based on decision tree) and assign the class using the derived rules for the data given in table 2.

Table1

Table 2

Outlook	Temp	Humidity	Windy	Class
Rain	65	80	FALSE
Sunny	90	75	TRUE

Answer:
Note: Find the video for your reference to derive rules Given:

The data set has five attributes.
There is a special attribute: the attribute class is the class label.
The attribute, temp and humidity are numerical attributes.
Other attributes are categorical, that is, they can not be ordered.
So based on table1 data set, we want to derive a set of rules to know what values of outlook, temp, humidity and wind determine whether or not to play.

We will first split the table in two part as below: Select the pure node

Outlook	Sunny	Overcast	Rain
Play	2	4	3
Don't	3	0	2

Windy	TRUE	FALSE
Play	3	6
Don't	4	1

As we can see that the pure node is obtained in outlook while splitting. So it can be taken as first attribute for splitting. [Pure Node means, when all of its data belongs to a single class ]

Now the second step, we will look for Sunny and Rain only.
Note: we will define range for temperature as well as Humidity

Outlook = Sunny
Temp	61..70	71..80	81..90
Play	1	1	0
Don't	0	2	1

Outlook = Sunny
Humidity	61..70	71..80	81..90	91..100
Play	2	0	0	0
Don't	0	0	2	1

Outlook = Sunny
Windy	TRUE	FALSE
Play	1	1
Don't	2	1

As we can see that the pure node is obtained for attribute Humidity, So it can be taken as second attribute for splitting. [71..80 - there is no activity so won't considered in Tree]

When Outlook = Rain

Outlook = Rain
Temp	61..70	71..80	81..90
Play	2	1	1
Don't	1	1	0

Outlook = Rain
Humidity	61..70	71..80	81..90	91..100
Play		3	0	1
Don't	1	1	0	0

Outlook = Rain
Windy	TRUE	FALSE
Play	0	4
Don't	2	0

As we can see that the pure node is obtained for attribute Windy.

Below are the derived rules using indirect method (Based on decision tree) :
R1. IF (outlook=sunny) and (humidity=61..70) THEN (class=Play).
R2. IF (outlook=sunny) and (humidity=81..100) THEN (class=Don't).
R3. IF (outlook=overcast) THEN (class=Play).
R4. IF (outlook=rain) and (windy=true) THEN (class=Don't).
R5. IF (outlook=rain) and (windy=false) THEN (class=Play).

Assigning class in table 2:

Outlook	Temp	Humidity	Windy	Class
Rain	65	80	FALSE	Play
Sunny	90	75	TRUE	Don't

Referenced Question:
Draw a decision tree for the below scenario.

Answer:
Note: This is the same question as we solved for the above scenario, you just need to calculate pure node for Temperature and Humidity also.

November 12, 2018

Data Mining - Mid Sem Solutions

No comments:

Post a Comment

𝐒𝐞𝐚𝐫𝐜𝐡 𝐓𝐡𝐢𝐬 𝐁𝐥𝐨𝐠

𝐅𝐀𝐂𝐄𝐁𝐎𝐎𝐊

𝐓𝐨𝐭𝐚𝐥 𝐒𝐜𝐡𝐨𝐥𝐚𝐫

𝕽𝖊𝖈𝖊𝖓𝖙

𝕱𝖊𝖆𝖙𝖚𝖗𝖊𝖉

Systems Programming- MCQS

𝐁𝐥𝐨𝐠 𝐀𝐫𝐜𝐡𝐢𝐯𝐞

Categories

Recent Comments