Showing posts with label Data Mining. Show all posts
Showing posts with label Data Mining. Show all posts

February 15, 2019

Data Mining - Comprehensive Paper Solution



Note: This is just a reference paper which you can go through,  we are facing some issue with the website. If you have any more important question/answer, let us know. 
Share it on our Email - 1trickyworld1@gmail.com


Question:
For the following vectors x and y, calculate the cosine similarity and euclidean distance measures:
x =(4,4,4,4), y=(2,2,2,2)

Solution:

Cosine
x ● y = 4*2 + 4*2 + 4*2 + 4*2 = 32
||x|| = sqrt(4*4 + 4*4 + 4*4 + 4*4) = sqrt (64)   = 8
||y|| = sqrt(2*2 + 2*2 + 2*2 + 2*2) = sqrt (16) = 4
cos(x,y) = (x ● y) /  (||x||*||y||) = (32)/ (8*4) 
cos(x,y) = 1

Euclidean
d(x, y) = sqrt((4-2)^2 + (4-2)^2 + (4-2)^2 + (4-2)^2) 
Euclidean distance = 4

Question:
Consider the one-dimensional data set shown on the below table

X  
0.6  
3.2  
4.5  
4.6  
4.9  
5.2  
5.6  
5.8  
7.1  
9.5  
Y
-
-
+
+
+
-
-
+
-
-

Classify the data point x=5.0 according to its 3- and 9- nearest neighbors (Using majority Vote)

Answer:
We need to first find the difference of each data set with respect to x=5.0, Refer the below table for the same.

x
X
Difference (x & X)
Y
5.0
0.6
4.4
5.0
3.2
1.8
5.0
4.5
0.5
+
5.0
4.6
0.4
+
5.0
4.9
0.1
+
5.0
5.2
0.2
5.0
5.6
0.6
5.0
5.8
0.8
+
5.0
7.1
2.1
5.0
9.5
4.5

As asked,
Using 3- nearest neighbors method, 3 Closest points to the point x=5.0 will be the one who has least difference among them - > 4.9, 5.2, 4.6
Classes ->   +
Using Majority Vote, 3-nearest neighbor: +

Using 9- nearest neighbors method, 9 Closest points to the point x=5.0 will be the one who has least difference among them - > 4.9, 5.2, 4.6, 4.5, 5.6, 5.8, 3.2, 7.1, 0.6
Classes -> +  + +  +   
Using Majority Vote, 9-nearest neighbor: 

Question:
Suppose a group of 12 sales price records has been sorted as follows:
5; 10; 11; 13; 15; 35; 50; 55; 72; 90; 204; 215:
Partition them into three bins by each of the following methods.
(a) equal-frequency partitioning
(b) equal-width partitioning
(c) clustering

Answer:
(a) equal-frequency (equidepth) partitioning:
Partition the data into equidepth bins of depth 4: [given as n=4]
Bin 1: 5, 10, 11, 13
Bin 2: 15, 35, 50, 55
Bin 3: 72, 90, 204, 215

(b) equal-width partitioning:
Partitioning the data into 3 equi-width bins will require the width to be (215−5)/3 = 70.
We get interval like- (1,70),(71,140),(141,210),(211,280)
Bin 1: 5, 10, 11, 13, 15, 35, 50, 55
Bin 2:72, 90
Bin 3: 204
Bin 4: 215

(c) clustering:
Using K-means clustering to partition the data into three bins we get
Bin 1: 5, 10, 11, 13, 15, 35
Bin 2: 50, 55, 72, 90
Bin 3: 204, 215

January 02, 2019

Data Mining - MCQS 2


Question
This clustering approach initially assumes that each data instance represents a single cluster.

Select one:
a. expectation maximization
b. K-Means clustering
c. agglomerative clustering
d. conceptual clustering

The correct answer is:agglomerative clustering

Question
The correlation coefficient for two real-valued attributes is –0.85. What does this value tell you?

Select one:
a. The attributes are not linearly related.
b. As the value of one attribute decreases the value of the second attribute increases.
c. As the value of one attribute increases the value of the second attribute also increases.
d. The attributes show a linear relationship

The correct answer is: As the value of one attribute decreases the value of the second attribute increases.

Question
Time Complexity of k-means is given by

Select one:
a. O(mn)
b. O(tkn)
c. O(kn)
d. O(t2kn)

The correct answer is: O(tkn)

Question
Given a rule of the form IF X THEN Y, rule confidence is defined as the conditional probability that

Select one:
a. Y is false when X is known to be false.
b. Y is true when X is known to be true.
c. X is true when Y is known to be true
d. X is false when Y is known to be false.

The correct answer is: Y is true when X is known to be true.

November 12, 2018

Data Mining - Mid Sem Solutions


Question:
Give an example for each of the following preprocessing activates
a. Incomplete
b. Inconsistent

Answer:
Data Processing: It is a data mining technique that involves transforming raw data into an understandable format. Our Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Hence it is needed for resolving such issues.
"Preprocessing is needed to improve data quality"

A. Incomplete: Lacking attribute values, lacking certain attributes of interest, or containing only aggregate data.
E.g. Many tuples have no recorded value for several attributes,
Occupation = “ ” (missing data)

B. Inconsistent: Containing discrepancies in codes or names.
E.g.
Age = “42”, Birthday = “03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records

October 23, 2018

Data Mining - MCQS



Question

Which of the following activities is NOT a data mining task?
Select one:
a. Monitoring the heart rate of a patient for abnormalities
b. Monitoring and predicting failures in a hydropower plant
c. Predicting the future stock price of a company using historical records
d. Extracting the frequencies of a sound wave
The correct answer is: Extracting the frequencies of a sound wave

Question
Which of the following is not a data mining task?

Select one:
a. Feature Subset Detection
b. Association Rule Discovery
c. Regression
d. Sequential Pattern Discovery

The correct answer is: Feature Subset Detection
Question
Value set {poor, average, good, excellent} is an example of
Select one:
a. Nominal attribute
b. Numeric attribute
c. Continuous attribute
d. Ordinal attribute
The correct answer is: Ordinal attribute

Question
Which data mining task can be used for predicting wind velocities as a function of temperature, humidity, air pressure, etc.?

Select one:
a. Cluster Analysis
b. Regression
c. Clasification
d. Sequential pattern discovery

The correct answer is: Regression

Question
Identify the example of sequence data

Select one:
a. weather forecast
b. data matrix
c. market basket data
d. genomic data

The correct answer is: genomic data

Question
In a data mining task where it is not clear what type of patterns could be interesting, the data mining system should
Select one:
a. handle different granularities of data and patterns
b. perform all possible data mining tasks
c. allow interaction with the user to guide the mining process 
d. perform both descriptive and predictive tasks
The correct answer is: allow interaction with the user to guide the mining process
Question 
Removing duplicate records is a data mining process called________
Select one:
a. data isolation
b. recovery
c. data pruning
d. data cleaning 
The correct answer is: data cleaning
Question 
Various visualization techniques are used in ___________ step of KDD
Select one:
a. selection
b. interpretation 
c. transformation
d. data mining
The correct answer is: interpretation
Question 
Which of the following is not a Visualization Method?
Select one:
a. Hierarchical visualization technique
b. Tuple based visualization Technique
c. Icon based visualization techniques
d. Pixel oriented visualization technique 
The correct answer is: Tuple based visualization Technique

Question
Data set {brown, black, blue, green , red} is example of

Select one:
a. Continuous attribute
b. Ordinal attribute
c. Numeric attribute
d. Nominal attribute

The correct answer is: Nominal attribute

Question
Which of the following is NOT a data quality related issue?

Select one:
a. Attribute value range
b. Outlier records
c. Missing values
d. Duplicate records

The correct answer is: Attribute value range

Question
To detect fraudulent usage of credit cards, the following data mining task should be used

Select one:
a. Outlier analysis
b. prediction
c. association analysis
d. feature selection

The correct answer is: Outlier analysis
Question 
Which of the following is NOT example of ordinal attributes?
Select one:
a. Ordered numbers
b. Military ranks
c. Zip codes
d. Movie ratings 
The correct answer is: Zip codes
Question 
Which of the following is not a data pre-processing methods
Select one:
a. Data Cleaning
b. Data Visualization 
c. Data Discretization
d. Data Reduction
The correct answer is: Data Visualization
Question 
Incorrect or invalid data is known as _________
Select one:
a. Outlier
b. Missing data
c. Changing data
d. Noisy data 
The correct answer is: Noisy data

Question
Which of the following is an Entity identification problem?

Select one:
a. One person with different email address
b. One person's name written in different way
c. Title for person
d. One person with multiple phone numbers

The correct answer is: One person's name written in different way
Question 
Data Visualization in mining cannot be done using
Select one:
a. Graphs
b. Information Graphics
c. Charts
d. Photos 
The correct answer is: Photos
Question
Nominal and ordinal attributes can be collectively referred to as_________ attributes
Select one:
a. perfect
b. consistent
c. qualitative
d. optimized
The correct answer is: qualitative
Question 
The number of item sets of cardinality 4 from the items lists {A, B, C, D, E}
Select one:
a. 20
b. 2 
c. 10
d. 5
The correct answer is: 5
Question 
Identify the example of Nominal attribute
Select one:
a. Salary
b. Temperature 
c. Gender
d. Mass
The correct answer is: Gender
Question 
Which of the following are descriptive data mining activities?
Select one:
a. Clustering
b. Deviation detection 
c. Regression
d. Classification
The correct answer is: Clustering

Question 
Which statement is not TRUE regarding a data mining task?
Select one:
a. Deviation detection is a predictive data mining task
b. Classification is a predictive data mining task 
c. Clustering is a descriptive data mining task
d. Regression is a descriptive data mining task
The correct answer is: Regression is a descriptive data mining task
Question 
Correlation analysis is used for
Select one:
a. identifying redundant attributes
b. eliminating noise
c. handling missing values
d. handling different data formats 
The correct answer is: identifying redundant attributes
Question 
In Binning, we first sort data and partition into (equal-frequency) bins and then which of the following is not a valid step
Select one:
a. smooth by bin boundaries
b. smooth by bin median 
c. smooth by bin values
d. smooth by bin means
The correct answer is: smooth by bin values

Question 
Which of the following is NOT data mining efficiency/scalability issue?
Select one:
a. The running time of a data mining algorithm
b. Incremental execution
c. Data partitioning 
d. Easy to use user interface
The correct answer is: Easy to use user interface
Question 
Synonym for data mining is
Select one:
a. Data Warehouse
b. Knowledge discovery in database
c. Business intelligence
d. OLAP 
The correct answer is: Knowledge discovery in database
Question 
Data scrubbing can be defined as
Select one:
a. Check field overloading
b. Delete redundant tuples
c. Use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make ions
d. Analyzing data to discover rules and relationship to detect violators 
The correct answer is: Use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make ions
Question 
Dimensionality reduction reduces the data set size by removing _________
Select one:
a. irrelevant attributes 
b. composite attributes
c. derived attributes
d. relevant attributes
The correct answer is: irrelevant attributes
Question 
In asymmetric attibute
Select one:
a. Range of values is important
b. No value is considered important over other values
c. Only non-zero value is important
d. All values are equals 
The correct answer is: Only non-zero value is important
Question 
Which of the following is not a data mining task?
Select one:
a. Feature Subset Detection 
b. Regression
c. Sequential Pattern Discovery
d. Association Rule Discovery
The correct answer is: Feature Subset Detection
Question 
Which of the following is NOT an example of data quality related issue?
Select one:
a. Using a field for different purposes 
b. Contradicting values
c. Noise
d. Multiple date formats 
The correct answer is: Multiple date formats
Question 
Similarity is a numerical measure whose value is
Select one:
a. Higher when objects are more alike
b. Lower when objects are more alike
c. Increases with Minkowski distance
d. Higher when objects are not alike 
The correct answer is: Higher when objects are more alike
Question 
The dissimilarity between two data objects is
Select one:
a. Lower when objects are more alike
b. Higher when objects are more alike
c. Lower when objects are  not alike
d. Applies only categorical attributes 
The correct answer is: Lower when objects are more alike
Question 
The important characteristics of structured data are

Select one:
a. Resolution, Distribution, Dimensionality ,Objects
b. Sparsity, Centroid, Distribution , Dimensionality
c. Dimensionality, Sparsity, Resolution, Distribution 
d. Sparsity, Resolution, Distribution, Tuples

The correct answer is: Dimensionality, Sparsity, Resolution, Distribution

Question 
Which of the following statement is not TRUE for a Tag Cloud

Select one:
a. Tag cloud is a visualization of statistics of user-generated tags 
b. Tag cloud can be used for numeric data only
c. The importance of a tag is indicated by font size or color
d. Tags may be listed alphabetically in a tag cloud

The correct answer is: Tag cloud can be used for numeric data only

Question 
Which of the following data mining task is known as Market Basket Analysis?

Select one:
a. Clasification
b. Regression
c. Association Analysis 
d. Outlier Analysis

The correct answer is: Association Analysis

Question 
Which of the following is not a Data discretization Method?

Select one:
a. Histogram analysis
b. Cluster Analysis
c. Data compression
d. Binning

The correct answer is: Data compression

Question 
Which of the following activities is a data mining task?

Select one:
a. Monitoring the heart rate of a patient for abnormalities
b. Dividing the customers of a company according to their profitability 
c. Extracting the frequencies of a sound wave
d. Predicting the outcomes of tossing a (fair) pair of dice

The correct answer is: Monitoring the heart rate of a patient for abnormalities

Question 
Sorted data (attribute values ) for price are: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34. Identify which is NOT a bin smoothed by boundaries?

Select one:
a. Bin 2: 21, 21, 25, 25
b. Bin 1: 4, 4, 4, 15 
c. Bin 1: 4, 4, 15, 15
d. Bin 3: 26, 26, 26, 34

The correct answer is: Bin 1: 4, 4, 15, 15

Question
The difference between supervised learning and unsupervised learning is given by

Select one:
a. unlike unsupervised learning, supervised learning needs labeled data
b. unlike unsupervised learning, supervised learning can be used to detect outliers
c. unlike supervised leaning, unsupervised learning can form new classes
d. there is no difference

The Correct answer is: unlike unsupervised learning, supervised learning needs labeled data

Question 
The Data Sets are made up of

Select one:
a. Data Objects
b. Attributes 
c. Dimensions
d. Database

The correct answer is: Data Objects