WASEIAN: Data Mining

Showing posts with label Data Mining. Show all posts

February 15, 2019

Data Mining - Comprehensive Paper Solution

Note: This is just a reference paper which you can go through, we are facing some issue with the website. If you have any more important question/answer, let us know.
Share it on our Email - 1trickyworld1@gmail.com

Question:
For the following vectors x and y, calculate the cosine similarity and euclidean distance measures:
x =(4,4,4,4), y=(2,2,2,2)

Solution:

Cosine
x ● y = 4*2 + 4*2 + 4*2 + 4*2 = 32
||x|| = sqrt(4*4 + 4*4 + 4*4 + 4*4) = sqrt (64) = 8
||y|| = sqrt(2*2 + 2*2 + 2*2 + 2*2) = sqrt (16) = 4
cos(x,y) = (x ● y) / (||x||*||y||) = (32)/ (8*4)
cos(x,y) = 1

Euclidean
d(x, y) = sqrt((4-2)^2 + (4-2)^2 + (4-2)^2 + (4-2)^2)
Euclidean distance = 4

Question:
Consider the one-dimensional data set shown on the below table

X	0.6	3.2	4.5	4.6	4.9	5.2	5.6	5.8	7.1	9.5
Y	-	-	+	+	+	-	-	+	-	-

Classify the data point x=5.0 according to its 3- and 9- nearest neighbors (Using majority Vote)

Answer:

We need to first find the difference of each data set with respect to x=5.0, Refer the below table for the same.

x	X	Difference (x & X)	Y
5.0	0.6	4.4	−
5.0	3.2	1.8	−
5.0	4.5	0.5	+
5.0	4.6	0.4	+
5.0	4.9	0.1	+
5.0	5.2	0.2	−
5.0	5.6	0.6	−
5.0	5.8	0.8	+
5.0	7.1	2.1	−
5.0	9.5	4.5	−

As asked,

Using 3- nearest neighbors method, 3 Closest points to the point x=5.0 will be the one who has least difference among them - > 4.9, 5.2, 4.6

Classes -> + − +

Using Majority Vote, 3-nearest neighbor: +

Using 9- nearest neighbors method, 9 Closest points to the point x=5.0 will be the one who has least difference among them - > 4.9, 5.2, 4.6, 4.5, 5.6, 5.8, 3.2, 7.1, 0.6

Classes -> + − + + − + − − −

Using Majority Vote, 9-nearest neighbor: −

Question:
Suppose a group of 12 sales price records has been sorted as follows:
5; 10; 11; 13; 15; 35; 50; 55; 72; 90; 204; 215:
Partition them into three bins by each of the following methods.
(a) equal-frequency partitioning
(b) equal-width partitioning
(c) clustering

Answer:
(a) equal-frequency (equidepth) partitioning:
Partition the data into equidepth bins of depth 4: [given as n=4]
Bin 1: 5, 10, 11, 13
Bin 2: 15, 35, 50, 55
Bin 3: 72, 90, 204, 215

(b) equal-width partitioning:
Partitioning the data into 3 equi-width bins will require the width to be (215−5)/3 = 70.
We get interval like- (1,70),(71,140),(141,210),(211,280)
Bin 1: 5, 10, 11, 13, 15, 35, 50, 55
Bin 2:72, 90
Bin 3: 204
Bin 4: 215

(c) clustering:
Using K-means clustering to partition the data into three bins we get
Bin 1: 5, 10, 11, 13, 15, 35
Bin 2: 50, 55, 72, 90
Bin 3: 204, 215

Data Mining - MCQS 2

Waseian Data Mining , quiz

Question
This clustering approach initially assumes that each data instance represents a single cluster.

Select one:
a. expectation maximization
b. K-Means clustering
c. agglomerative clustering
d. conceptual clustering

The correct answer is:agglomerative clustering

Question
The correlation coefficient for two real-valued attributes is –0.85. What does this value tell you?

Select one:
a. The attributes are not linearly related.
b. As the value of one attribute decreases the value of the second attribute increases.
c. As the value of one attribute increases the value of the second attribute also increases.
d. The attributes show a linear relationship

The correct answer is: As the value of one attribute decreases the value of the second attribute increases.

Question
Time Complexity of k-means is given by

Select one:
a. O(mn)
b. O(tkn)
c. O(kn)
d. O(t2kn)

The correct answer is: O(tkn)

Question
Given a rule of the form IF X THEN Y, rule confidence is defined as the conditional probability that

Select one:
a. Y is false when X is known to be false.
b. Y is true when X is known to be true.
c. X is true when Y is known to be true
d. X is false when Y is known to be false.

The correct answer is: Y is true when X is known to be true.

Data Mining - Mid Sem Solutions

Waseian Data Mining , DM

Question:
Give an example for each of the following preprocessing activates
a. Incomplete
b. Inconsistent

Answer:
Data Processing: It is a data mining technique that involves transforming raw data into an understandable format. Our Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Hence it is needed for resolving such issues.
"Preprocessing is needed to improve data quality"

A. Incomplete: Lacking attribute values, lacking certain attributes of interest, or containing only aggregate data.
E.g. Many tuples have no recorded value for several attributes,
Occupation = “ ” (missing data)

B. Inconsistent: Containing discrepancies in codes or names.
E.g.
Age = “42”, Birthday = “03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records

Data Mining - MCQS

Waseian Data Mining , DM

Question

Which of the following activities is NOT a data mining task?

Select one:

a. Monitoring the heart rate of a patient for abnormalities

b. Monitoring and predicting failures in a hydropower plant

c. Predicting the future stock price of a company using historical records

d. Extracting the frequencies of a sound wave

The correct answer is: Extracting the frequencies of a sound wave

Question
Which of the following is not a data mining task?

Select one:
a. Feature Subset Detection
b. Association Rule Discovery
c. Regression
d. Sequential Pattern Discovery

The correct answer is: Feature Subset Detection

Question

Value set {poor, average, good, excellent} is an example of

Select one:

a. Nominal attribute

b. Numeric attribute

c. Continuous attribute

d. Ordinal attribute

The correct answer is: Ordinal attribute

Question
Which data mining task can be used for predicting wind velocities as a function of temperature, humidity, air pressure, etc.?

Select one:
a. Cluster Analysis
b. Regression
c. Clasification
d. Sequential pattern discovery

The correct answer is: Regression

Question
Identify the example of sequence data

Select one:
a. weather forecast
b. data matrix
c. market basket data
d. genomic data

The correct answer is: genomic data

Question

In a data mining task where it is not clear what type of patterns could be interesting, the data mining system should

Select one:

a. handle different granularities of data and patterns

b. perform all possible data mining tasks

c. allow interaction with the user to guide the mining process

d. perform both descriptive and predictive tasks

The correct answer is: allow interaction with the user to guide the mining process

Question

Removing duplicate records is a data mining process called________

Select one:

a. data isolation

b. recovery

c. data pruning

d. data cleaning

The correct answer is: data cleaning

Question

Various visualization techniques are used in ___________ step of KDD

Select one:

a. selection

b. interpretation

c. transformation

d. data mining

The correct answer is: interpretation

Question

Which of the following is not a Visualization Method?

Select one:

a. Hierarchical visualization technique

b. Tuple based visualization Technique

c. Icon based visualization techniques

d. Pixel oriented visualization technique

The correct answer is: Tuple based visualization Technique

Question
Data set {brown, black, blue, green , red} is example of

Select one:
a. Continuous attribute
b. Ordinal attribute
c. Numeric attribute
d. Nominal attribute

The correct answer is: Nominal attribute

Question
Which of the following is NOT a data quality related issue?

Select one:
a. Attribute value range
b. Outlier records
c. Missing values
d. Duplicate records

The correct answer is: Attribute value range

Question
To detect fraudulent usage of credit cards, the following data mining task should be used

Select one:
a. Outlier analysis
b. prediction
c. association analysis
d. feature selection

The correct answer is: Outlier analysis

Question

Which of the following is NOT example of ordinal attributes?

Select one:

a. Ordered numbers

b. Military ranks

c. Zip codes

d. Movie ratings

The correct answer is: Zip codes

Question

Which of the following is not a data pre-processing methods

Select one:

a. Data Cleaning

b. Data Visualization

c. Data Discretization

d. Data Reduction

The correct answer is: Data Visualization

Question

Incorrect or invalid data is known as _________

Select one:

a. Outlier

b. Missing data

c. Changing data

d. Noisy data

The correct answer is: Noisy data

Question
Which of the following is an Entity identification problem?

Select one:
a. One person with different email address
b. One person's name written in different way
c. Title for person
d. One person with multiple phone numbers

The correct answer is: One person's name written in different way

Question

Data Visualization in mining cannot be done using

Select one:

a. Graphs

b. Information Graphics

c. Charts

d. Photos

The correct answer is: Photos

Question

Nominal and ordinal attributes can be collectively referred to as_________ attributes

Select one:

a. perfect

b. consistent

c. qualitative

d. optimized

The correct answer is: qualitative

Question

The number of item sets of cardinality 4 from the items lists {A, B, C, D, E}

Select one:

a. 20

b. 2

c. 10

d. 5

The correct answer is: 5

Question

Identify the example of Nominal attribute

Select one:

a. Salary

b. Temperature

c. Gender

d. Mass

The correct answer is: Gender

Question

Which of the following are descriptive data mining activities?

Select one:

a. Clustering

b. Deviation detection

c. Regression

d. Classification

The correct answer is: Clustering

Question

Which statement is not TRUE regarding a data mining task?

Select one:

a. Deviation detection is a predictive data mining task

b. Classification is a predictive data mining task

c. Clustering is a descriptive data mining task

d. Regression is a descriptive data mining task

The correct answer is: Regression is a descriptive data mining task

Question

Correlation analysis is used for

Select one:

a. identifying redundant attributes

b. eliminating noise

c. handling missing values

d. handling different data formats

The correct answer is: identifying redundant attributes

Question

In Binning, we first sort data and partition into (equal-frequency) bins and then which of the following is not a valid step

Select one:

a. smooth by bin boundaries

b. smooth by bin median

c. smooth by bin values

d. smooth by bin means

The correct answer is: smooth by bin values

Question

Which of the following is NOT data mining efficiency/scalability issue?

Select one:

a. The running time of a data mining algorithm

b. Incremental execution

c. Data partitioning

d. Easy to use user interface

The correct answer is: Easy to use user interface

Question

Synonym for data mining is

Select one:

a. Data Warehouse

b. Knowledge discovery in database

c. Business intelligence

d. OLAP

The correct answer is: Knowledge discovery in database

Question

Data scrubbing can be defined as

Select one:

a. Check field overloading

b. Delete redundant tuples

c. Use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make ions

d. Analyzing data to discover rules and relationship to detect violators

The correct answer is: Use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make ions

Question

Dimensionality reduction reduces the data set size by removing _________

Select one:

a. irrelevant attributes

b. composite attributes

c. derived attributes

d. relevant attributes

The correct answer is: irrelevant attributes

Question

In asymmetric attibute

Select one:

a. Range of values is important

b. No value is considered important over other values

c. Only non-zero value is important

d. All values are equals

The correct answer is: Only non-zero value is important

Question

Which of the following is not a data mining task?

Select one:

a. Feature Subset Detection

b. Regression

c. Sequential Pattern Discovery

d. Association Rule Discovery

The correct answer is: Feature Subset Detection

Question

Which of the following is NOT an example of data quality related issue?

Select one:

a. Using a field for different purposes

b. Contradicting values

c. Noise

d. Multiple date formats

The correct answer is: Multiple date formats

Question

Similarity is a numerical measure whose value is

Select one:

a. Higher when objects are more alike

b. Lower when objects are more alike

c. Increases with Minkowski distance

d. Higher when objects are not alike

The correct answer is: Higher when objects are more alike

Question

The dissimilarity between two data objects is

Select one:

a. Lower when objects are more alike

b. Higher when objects are more alike

c. Lower when objects are not alike

d. Applies only categorical attributes

The correct answer is: Lower when objects are more alike

Question

The important characteristics of structured data are

Select one:

a. Resolution, Distribution, Dimensionality ,Objects

b. Sparsity, Centroid, Distribution , Dimensionality

c. Dimensionality, Sparsity, Resolution, Distribution

d. Sparsity, Resolution, Distribution, Tuples

The correct answer is: Dimensionality, Sparsity, Resolution, Distribution

Question

Which of the following statement is not TRUE for a Tag Cloud

Select one:

a. Tag cloud is a visualization of statistics of user-generated tags

b. Tag cloud can be used for numeric data only

c. The importance of a tag is indicated by font size or color

d. Tags may be listed alphabetically in a tag cloud

The correct answer is: Tag cloud can be used for numeric data only

Question

Which of the following data mining task is known as Market Basket Analysis?

Select one:

a. Clasification

b. Regression

c. Association Analysis

d. Outlier Analysis

The correct answer is: Association Analysis

Question

Which of the following is not a Data discretization Method?

Select one:

a. Histogram analysis

b. Cluster Analysis

c. Data compression

d. Binning

The correct answer is: Data compression

Question

Which of the following activities is a data mining task?

Select one:

a. Monitoring the heart rate of a patient for abnormalities

b. Dividing the customers of a company according to their profitability

c. Extracting the frequencies of a sound wave

d. Predicting the outcomes of tossing a (fair) pair of dice

The correct answer is: Monitoring the heart rate of a patient for abnormalities

Question

Sorted data (attribute values ) for price are: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34. Identify which is NOT a bin smoothed by boundaries?

Select one:

a. Bin 2: 21, 21, 25, 25

b. Bin 1: 4, 4, 4, 15

c. Bin 1: 4, 4, 15, 15

d. Bin 3: 26, 26, 26, 34

The correct answer is: Bin 1: 4, 4, 15, 15

Question
The difference between supervised learning and unsupervised learning is given by

Select one:
a. unlike unsupervised learning, supervised learning needs labeled data
b. unlike unsupervised learning, supervised learning can be used to detect outliers
c. unlike supervised leaning, unsupervised learning can form new classes
d. there is no difference

The Correct answer is: unlike unsupervised learning, supervised learning needs labeled data

Question

The Data Sets are made up of

Select one:

a. Data Objects

b. Attributes

c. Dimensions

d. Database

The correct answer is: Data Objects

February 15, 2019

Data Mining - Comprehensive Paper Solution

January 02, 2019

Data Mining - MCQS 2

November 12, 2018

Data Mining - Mid Sem Solutions

October 23, 2018

Data Mining - MCQS

Question

𝐒𝐞𝐚𝐫𝐜𝐡 𝐓𝐡𝐢𝐬 𝐁𝐥𝐨𝐠

𝐅𝐀𝐂𝐄𝐁𝐎𝐎𝐊

𝐓𝐨𝐭𝐚𝐥 𝐒𝐜𝐡𝐨𝐥𝐚𝐫

𝕽𝖊𝖈𝖊𝖓𝖙

𝕱𝖊𝖆𝖙𝖚𝖗𝖊𝖉

Systems Programming- MCQS

𝐁𝐥𝐨𝐠 𝐀𝐫𝐜𝐡𝐢𝐯𝐞

Categories

Recent Comments