Data Mining - Comprehensive Paper Solution

Note: This is just a reference paper which you can go through, we are facing some issue with the website. If you have any more important question/answer, let us know.
Share it on our Email - 1trickyworld1@gmail.com

Question:
For the following vectors x and y, calculate the cosine similarity and euclidean distance measures:
x =(4,4,4,4), y=(2,2,2,2)

Solution:

Cosine
x ● y = 4*2 + 4*2 + 4*2 + 4*2 = 32
||x|| = sqrt(4*4 + 4*4 + 4*4 + 4*4) = sqrt (64) = 8
||y|| = sqrt(2*2 + 2*2 + 2*2 + 2*2) = sqrt (16) = 4
cos(x,y) = (x ● y) / (||x||*||y||) = (32)/ (8*4)
cos(x,y) = 1

Euclidean
d(x, y) = sqrt((4-2)^2 + (4-2)^2 + (4-2)^2 + (4-2)^2)
Euclidean distance = 4

Question:
Consider the one-dimensional data set shown on the below table

X	0.6	3.2	4.5	4.6	4.9	5.2	5.6	5.8	7.1	9.5
Y	-	-	+	+	+	-	-	+	-	-

Classify the data point x=5.0 according to its 3- and 9- nearest neighbors (Using majority Vote)

Answer:

We need to first find the difference of each data set with respect to x=5.0, Refer the below table for the same.

x	X	Difference (x & X)	Y
5.0	0.6	4.4	−
5.0	3.2	1.8	−
5.0	4.5	0.5	+
5.0	4.6	0.4	+
5.0	4.9	0.1	+
5.0	5.2	0.2	−
5.0	5.6	0.6	−
5.0	5.8	0.8	+
5.0	7.1	2.1	−
5.0	9.5	4.5	−

As asked,

Using 3- nearest neighbors method, 3 Closest points to the point x=5.0 will be the one who has least difference among them - > 4.9, 5.2, 4.6

Classes -> + − +

Using Majority Vote, 3-nearest neighbor: +

Using 9- nearest neighbors method, 9 Closest points to the point x=5.0 will be the one who has least difference among them - > 4.9, 5.2, 4.6, 4.5, 5.6, 5.8, 3.2, 7.1, 0.6

Classes -> + − + + − + − − −

Using Majority Vote, 9-nearest neighbor: −

Question:
Suppose a group of 12 sales price records has been sorted as follows:
5; 10; 11; 13; 15; 35; 50; 55; 72; 90; 204; 215:
Partition them into three bins by each of the following methods.
(a) equal-frequency partitioning
(b) equal-width partitioning
(c) clustering

Answer:
(a) equal-frequency (equidepth) partitioning:
Partition the data into equidepth bins of depth 4: [given as n=4]
Bin 1: 5, 10, 11, 13
Bin 2: 15, 35, 50, 55
Bin 3: 72, 90, 204, 215

(b) equal-width partitioning:
Partitioning the data into 3 equi-width bins will require the width to be (215−5)/3 = 70.
We get interval like- (1,70),(71,140),(141,210),(211,280)
Bin 1: 5, 10, 11, 13, 15, 35, 50, 55
Bin 2:72, 90
Bin 3: 204
Bin 4: 215

(c) clustering:
Using K-means clustering to partition the data into three bins we get
Bin 1: 5, 10, 11, 13, 15, 35
Bin 2: 50, 55, 72, 90
Bin 3: 204, 215

Question:

a. How do you evaluate a classifier when there is a class imbalance?

Answer: In normal case, accuracy and error rate can help. In case of class imbalance, we need specificity and sensitivity.

b. Assume that a search for ‘computer programming’ gave you a result of 100 web pages. Give 5-6 factors that would have been used by the search engine to determine the order in which the result pages are listed.

Answer:

Frequency of the search terms in the pages, Place of occurrence – title/tags/paragraphs etc., no. of users visiting the page, no. of links to the page, geographical location, search-click history of user, domain’s importance, search term appearing in the domain etc.

c. How do clustering tendency, cluster validity help in data mining?

Answer:

Clustering makes sense only if the data is non-random. Clustering tendency measures such as Hopkin statistic help. Cluster validity helps in evaluating clusters using unsupervised, supervised, or relative measures.

Question:

1. A database has five transactions. Let min sup = 60% and min conf = 80%. (5+2 marks)

TID	items bought
T100	Bread,Butter,Beans,Potato,Jam, Milk
T200	Bread,Butter,Shampoo,Potato,Jam, Milk
T300	Beans,Soap,Butter, Bread
T400	Beans, Onion, Apple, Butter, Milk
T500	Apple, Banana, Jam, Bread,Butter

(a) Find all frequent itemsets using FP-growth algorithm.

(b) List all of the strong association rules (with support s and confidence c) matching the following buys(X; item1) ^ buys(X; item2) => buys(X; item3) [s; c]

Solution:NOTE: To solve this question, we will first go through 1 question below for your practice and then you can do it by yourself.

Reference question:

A database has five transactions. Let min sup = 60% and min conf = 75%.

TID	items bought
T100	M, O, N, N, K, E, Y, Y
T200	D, D, O, N, K, E, Y
T300	M, M, A, K, E, E
T400	M, U, C, C, Y, C, E, O
T500	C, O, O, K, I, I, E

(a) Find all frequent itemsets using Apriori method.

Solution:
Database is scanned once to generate frequent 1-itemsets. To do this, I use absolute support, where duplicate values are counted only once per TID. The total number of TID is 5, so minimum support of 60% is equivalent to 3/5. Thus itemsets with 1 or 2 support counts are eliminated.

Table 1a. 1-itemset results, raw

Table 1b. 1-itemset results, consolidated

Now, database is scanned second time to generate frequent 2-itemsets. The possible combinations are 5!/(3!2!) = 10. Using absolute support, each combination is counted per TID, and combinations that are below support value of 3 are eliminated.

Table 2a. 2-itemset results, raw

Table 2a. 2-itemset results, consolidated

I proceed to scan the database again to generate frequent 3-itemsets. Sets {E, K}, {K, O}, {E, O} make {E, K, O} possible. Likewise, {E, O}, {E, Y}, {O, Y} make {E, O, Y}.

Table 3a. 3-itemset results

Frequent 4-itemsets cannot be generated, because sets {K, O, Y} and {E, K, Y} are missing. So, all frequent itemsets have been found.

(b) List all of the strong association rules (with support s=60% and confidence c=75%) matching the following metarule, where X is a variable representing customers, and itemi denotes variables representing items (e.g., “A”, “B”, etc.): buys(X; item1) and buys(X; item2) ) => buys(X; item3)

[s; c]

Solution:

The highest itemsets are {E, K, O} and {E, O, Y}. Thus, there can be 2(3!/(1!2!)) = 6 total possible association rules following the metarule of selecting 2 inputs for testing association with 1 output.

Association rules from {E, K, O}:

R1. E ∩ K -> O confidence = #{E, K, O} / #{E, K} = 3 / 4 = 75% Therefore, R1 is a strong association rule.

R2. E ∩ O -> K confidence = #{E, K, O} / #{E, O} = 3 / 4 = 75% Therefore, R2 is a strong association rule.

R3. K ∩ O -> E confidence = #{E, K, O} / #{K, O} = 3 / 3 = 100% Therefore, R3 is a strong association rule.

Association rules from {E, O, Y}:

R4. E ∩ O -> Y confidence = #{E, O, Y} / #{E, O} = 3 / 4 = 75% Therefore, R4 is a strong association rule.

R5. E ∩ Y -> O confidence = #{E, O, Y} / #{E, Y} = 3 / 3 = 100% Therefore, R5 is a strong association rule.

R6. O ∩ Y -> E confidence = #{E, O, Y} / #{O, Y} = 3 / 3 = 100% Therefore, R6 is a strong association rule.

In this case, all 6 association rules are strong, meaning that customers who purchase any of the two products among E, K, O are likely to purchase the remaining one, and customers who purchase two items among E, O, Y are likely to purchase the remaining one.

Question: Give appropriate solutions for the following 3+3=6 Marks

a. Suppose that the data for analysis includes the attribute age. The age values for the data tuples are 70, 20, 16, 16, 52, 15, 20, 21, 22, 25, 22, 30, 25, 46, 25, 33, 36, 35, 40, 35, 35, 33, 35, 45, 25, 19, 13.Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your

steps.

Answer:

Step 1: Sort the data. 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35,

35, 36, 40, 45, 46, 52, 70

Step 2: Partition the data into equal-frequency bins of size 3.
Bin 1: 13, 15, 16
Bin 2: 16, 19, 20

Bin 3: 20, 21, 22
Bin 4: 22, 25, 25
Bin 5: 25, 25, 30
Bin 6: 33, 33, 35
Bin 7: 35, 35, 35
Bin 8: 36,40, 45
Bin 9: 46, 52, 70

Step 3: Calculate the arithmetic mean of each bin.

Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin.

Bin 1: 14.6, 14.6, 14.6

Bin 2: 18.3, 18.3, 18.3

Bin 3: 21, 21, 21

Bin 4: 24, 24, 24

Bin 5: 26.6, 26.6, 26.6

Bin 6: 33.6, 33.6, 33.6

Bin 7: 35, 35, 35

Bin 8: 40.3,40.3, 40.3

Bin 9: 56, 56, 56

b. Outliers are often discarded as noise. However, one person’s garbage could be another’s treasure. For example, exceptions in credit card transactions can help us detect the fraudulent use of credit cards. Taking fraudulence detection as an example, propose two methods that can be used to detect outliers and discuss which one is more reliable.

Answer:

-> Using clustering techniques: After clustering, the different clusters represent the different kinds of data (transactions). The outliers are those data points that do not fall into any cluster. Among the various kinds of clustering methods, density-based clustering may be the most effective.

-> Using prediction (or regression) techniques: Constructed a probability (regression) model based on all of the data. If the predicted value for a data point differs greatly from the given value, then the given value may be consider an outlier.

21 comments:

dhivyaMarch 12, 2019 at 2:49 AM

In the beginning, I would like to thank you much about this great post. Its very useful and helpful for anyone looking for tips. I like your writing style and I hope you will keep doing this good working.
Ethical Hacking Course in Chennai
Certified Ethical Hacking Course in Chennai
PHP Training in Chennai
ccna Training in Chennai
Web Designing Course in Chennai
ethical hacking course in chennai
hacking course in chennai
Vicky RamMarch 16, 2019 at 4:38 AM
I have read your blog its very attractive and impressive. I like it your blog.

Guest posting sites
Technology
marksonSeptember 11, 2019 at 8:06 AM
With class 100 cleanrooms, it can even open hard drives, and concentrate information utilizing its exclusive apparatuses and systems. ExcelR Data Science Courses
VijayakashAugust 19, 2020 at 10:46 PM
gold jewellery shops in chennai

The craze on jewelry never goes down. Are you looking for the best Jewellery shops in Chennai? Here, is the list for you.
VijayakashAugust 29, 2020 at 6:05 AM
hadoop interview questions and answers for freshers

Cracking Hadoop interview will be easy with 101 Hadoop Interview Questions with Answers. Hope this will gain you with more knowledge!
hadoop interview questions and answers for experienced
VijayakashSeptember 1, 2020 at 1:47 AM
oracle interview questions for freshers

Learn these Oracle Interview Questions to crack the interview and get placed in MNC. prepare well and brush up your knowledge to obtain your desired career.
oracle interview questions, oracle dba interview questions
VijayakashSeptember 9, 2020 at 2:52 AM
j2ee interview questions

Are you a fresher who wants to crack the Java Interview Questions and Answers? Here we offer you some of the major Core Java Interview Questions for Freshers.
java interview questions for experienced
VijayakashSeptember 10, 2020 at 3:30 AM
This Information Very Helpful to everyone
selenium interview questions and answers
selenium interview questions and answers for experienced
selenium interview questions and answers pdf download
selenium automation framework interview questions and answers
data science interview questions and answers
data scientist interview questions and answers pdf
data science interview questions and answers for freshers
AnonymousOctober 28, 2024 at 6:40 AM
The "Data Mining - Comprehensive Paper Solution" likely refers to a detailed academic paper or report that covers various aspects of data mining, including techniques, methodologies, applications, and challenges in the field. This solution may include practical examples, case studies, and theoretical insights to provide a thorough understanding of data mining processes and their significance in extracting valuable information from large datasets.

If you need specific insights or content related to this topic, please let me know!

Data Science Courses in Kolkata
sathvik nayakOctober 28, 2024 at 6:45 AM
Every paragraph offered something new and valuable. You’ve put together a truly comprehensive post that I’ll be revisiting often.
Data science courses in Noida
SadhviNovember 11, 2024 at 10:15 AM
Thanks for sharing the information. Data science courses in Visakhapatnam
akki sharmaNovember 16, 2024 at 4:48 AM
All questions are very good and helpful. It covers most of the topics. Thanks for sharing the information. IIM SKILLS offers Data science courses in Berlin with flexible learning and experienced faculty.
Data science Courses in Berlin
RachanaNovember 21, 2024 at 10:16 AM
This blog provides clear and detailed solutions for complex data mining problems. The step-by-step explanations make it an excellent resource for learning and practice. Great effort in compiling such comprehensive notes!
Data science courses in Gujarat
IIM Skills Data ScienceDecember 6, 2024 at 7:12 AM
"Great post! The comprehensive explanation of data mining techniques and their applications is very insightful. Thanks for sharing!"
Data science Courses in Canada
kriti sharmaDecember 9, 2024 at 3:07 AM
"This post does a fantastic job explaining the scope of data mining. It’s nice to see such a detailed breakdown of its applications across different industries. The solutions you’ve provided seem very practical and insightful. Anyone interested in data analysis or machine learning will definitely find this article useful!
Data science courses in Glasgow
NavneetDecember 27, 2024 at 5:33 PM
This post about Data Mining - Comprehensive Paper Solution is excellent! I love how easy you made it to understand a complex topic. Your simple yet effective examples were spot on. I’m eager to read more posts like this. Keep up the great work!
Online Data Analytics Courses
AnjaliJanuary 1, 2025 at 8:08 AM
Thank you for this fantastic guide! Your approach to Kafka message deletion is concise and effective. Looking forward to more such posts.
Data Analytics Courses In Chennai
usha singhJanuary 7, 2025 at 9:07 AM
The comprehensive overview of data mining techniques and solutions is highly informative. This is a great resource for anyone delving into the field of data science.
digital marketing course in chennai fees
kritishaJanuary 7, 2025 at 10:45 PM
this is very detailed
Top 10 Digital marketing courses in pune
reenaiimskillsJanuary 26, 2025 at 4:51 AM
great post ! the way you explained every point in very simple way clear all doubts
thank you so much
top 10 digital marketing agency in delhi
AnjaliJanuary 29, 2025 at 9:12 AM
Thank you for this fantastic guide! Your approach to Kafka message deletion is concise and effective. Looking forward to more such posts.
digital marketing course in varanasi

February 15, 2019

Data Mining - Comprehensive Paper Solution

21 comments:

𝐒𝐞𝐚𝐫𝐜𝐡 𝐓𝐡𝐢𝐬 𝐁𝐥𝐨𝐠

𝐅𝐀𝐂𝐄𝐁𝐎𝐎𝐊

𝐓𝐨𝐭𝐚𝐥 𝐒𝐜𝐡𝐨𝐥𝐚𝐫

𝕽𝖊𝖈𝖊𝖓𝖙

𝕱𝖊𝖆𝖙𝖚𝖗𝖊𝖉

Systems Programming- MCQS

𝐁𝐥𝐨𝐠 𝐀𝐫𝐜𝐡𝐢𝐯𝐞

Categories

Recent Comments