CSE521
Data Mining
Spring 2023
Course Information
NEWS
TEST 2 - Thursday, April 13, in class
Test covers Lectures 12-16
FINAL PRESENTATIONS - APRIL 24 - MAY
4
Consult Blakckboard for DETAILS
FINAL PRESENTATIONS Description mailed to students and
Posted
TEST 1 and PROJECT
Grades Consultation Week – March 27
- 31
DETAILS posted in BLackboard amd mailed to studens
TEST 1 Solutions POSTED
Time: Tuesday,
Thursday 11:30 am - 12:50pm
Place: Melville
Library, Room E4320
Professor: Anita Wasilewska
208 New CS Building
Phone: 632-8458
e-mail: anitaatcs.stonybrook.edu
Professor Office Hours:
Tuesday, Thursday 5:00 pm - 6:00 pm and by appointment
In person: 208 New CS Building
and email
I read emails DAILY and respond within a day or two
Teaching Assistants
ALL GRADES are listed on BLACKBOARD
Contact TAs if you need more
information or need to talk about grading
We have very good TAs - please e-mail them, go to see them
anytime you need help
TA: posted on
Blackboard
e-mail:
Office Hours:
Office Location: 2126 Old CS Building
Course Book
DATA MINING Concepts and Techniques
Jiawei Han, Micheline Kamber
Morgan Kaufman Publishers, 2003, 2011
Second or Third Edition
General Course Description:
Data Mining
(DM), called also Knowledge Discovery in Databases
(KDD) is a multidisciplinary field.
It brings together research and ideas from database
technology, machine learning, neural networks,
statistics, pattern recognition,
knowledge based systems, information retrieval,
high-performance computing, and data visualization.
Its main focus is the automated extraction of
patterns representing knowledge implicitly stored in
large databases,
data warehouses, and other massive information
repositories.
The course will closely follow the book and is
designed to give a broad, yet in-depth overview of the
Data Mining field
and examine the most recognized techniques in a more
rigorous detail.
Grading General Principles and Workload
TESTING
ALL TESTS
are personal, IN CLASS
tests
PROJECT, and FINAL PRESENTATIONS are to be
conducted in TEAMS of
4-5 students
All members of the Team receive the same grade
TEAMS FORMATION
Please e-mail TA
tba names, IDs, and e-mails of
your Team members denoting
the designated Team Leader
TA will assign a Team Number to each team and
email it to each Team
Leader to be used for future
correspondence.
CONTACT him if you do not HAVE a team partner. He will
help you to FORM A TEAM
Course Structure and
Content
The course
is divided into six parts. Course Lectures
slides are written by me, except when
other sources are indicated
We list here Chapters numbers from 2nd edition followed by
respective Chapters numbers from 3rd edition put between
parenthesis
In particular we will cover all or part
of the following subjects
PART 1
: Introduction; Data
Preprocessing, Data Warehouse
Book chapters 1 - 3 (1- 4) and Lectures 1 - 3
PART 2 :
Classification: Decision Tree
Induction and Neural Networks
Book chapter 6 (8-9) and Lectures 4 - 11
TEST 1
CLASSIFICATION
PROJECT
PART 3
: Association Analysis: Apriori Algorithm, Classification
by Association
Book chapters 5, 6 (6, 7) and Lectures 12
- 14
PART 4
: Other Classification Models
Genetic Algorithms
Bayesian Classification
Book chapter 6 (9) and Lectures 15, 16
TEST 2
PART 5
: Cluster Analysis
Book chapter 7 (10, 11) and Lectures 17,
18
PART 6
: Other DM Areas, Foundations of Data
Mining
Book chapters 9-10 (13) Lectures 19 -23
FINAL PRESENTATIONS
Attention
Project, Final
Presentations are to be conducted
in Teams
Teams consist of
4-5 students and must be the SAME for all assignments.
All members of the Team receive the same grade
Tests and
Assignments -
PRELIMINARY Schedule
TEST 1 - THURSDAY,
MARCH 9
Spring Break March 13 -
19
Project - due Tuesday, March 23
TEST 2 - TUESDAY,
APRIL 13
Final Presentation - APRIL
24 - MAY 4
We will use
my own Lecture Notes
and I will also post the original Book Slides as
a reference
We will follow the BOOK very closely and
in particular we will cover a part or all of the
following chapters and subjects. Chapters numbers
below are from 2nd edition. Respective Chapters
numbers in 3rd edition are listed in the Course Structure section
The order does not need to be
sequential
Chapter 1 (1) Introduction. General overview:
what is Data Mining, which data, what kinds of
patterns can be mined
Chapter 2 (2,3) Data Preprocessing: data
cleaning, data integration and transformation, data
reduction, discretization and concept hierarchy
generation
Chapter 3 (4,5) Data Warehouse and OLAP technology for
Data Mining
Chapter 5 (6,7) Mining Association Rules in
transactional databases and Apriori Algorithm
Chapter 6 (8,9) Classification and prediction
1. Decision Tree Induction ID3, C4.5
2. Neural Networks
3. Bayesian Classification
4. Classification based on Concepts from Association
rule mining
5. Genetic algorithms
Chapter 7 (10,11,12) Cluster Analysis
A Categorization of major Clustering methods
Chapter 10 Text Mining
Chapter 11 (13) Other DM Areas and Foundations
of DM (13)
Grading Components
During
the the semester you have to complete the
following
1. TEST 1 70pt
2. TEST 2 70pts
3. Project 30
pts
4. Final Presentation -
30pts
NONE of grades will be
CURVED
During the
semester you can earn 200pts
The grade will be determine in the following
way: # of earned points/2
= % grade
The % grade which is translated into letter grade in a
standard way i.e.
100 - 90 % is A range, 89 - 80 % is B range, 79 - 70 %
is C range, 69 - 60 % is D range, and F is below 60%.
See course SYLLABUS for
details
Records of students grades are being kept on
Blackboard
Contact TAs for
information and questions about grading
PROJECT
and FINAL PRESENTATIONS
PROJECT Description
Project Data: - play around
with the project data and familiarize yourself with it
bakarydata.xls
Final
Presentations description
HERE
Downloads
TEST 1 Solutions
SYLLABUS
Syllabus Slides
PROJECT Description
FINAL PRESENTATIONS
Description
TEST1 Review
TEST2 Review
Lectures
L1. Chapter1 (1):
Introduction
L2. Chapter2 (2,3):
Preprocessing
L2a. Chapter 2 (2,3):
Short Preprocessing
L2b. BOOK 3rd Edition Chapter 1
Overview
L3. Chapter 3 (4,5):
Data Warehouse
L4. Chapter 6 (8,
9): Classification Introduction
L5. Chapter 6:
Classification Testing
L6. Example: Data
Preparation and Metaclassifiers
Paper: A model
Proteins SSP Metaclassifiers
L7. Chapter 6: Decision
Trees Introduction
L8. Chapter 6: Decision
Trees Full Algorithm
L9. Chapter 6: Neural
Networks
L10. Modular Neural
Network
L11. Image
Classification and Convolutional NN
L12. Chapter 5 (6,7):
Association Analysis
L13. Association Analysis
Review
L14. Classification
by Association
L15. Chapter 6: Generic
Algorithms
L16. Generic Algorithms
Examples
L17. Chapter 7 (10):
Basics of Cluster Analysis
L18. Chapter 7 (11,12):
Cluster Analysis
L19. Deep Learning
L20. Text Data
Mining
L21. NLP-Natural Language
Processing
L22.
Frequent Patterns Mining Basic - Book
3rd Edition chapter 6
L23.
Frequent Patterns Mining Advanced-
Book 3rd Edition chapter 7
Lectures-Presentations
Here are some Lectures-Presentations for FINAL
REPORT - YOU CAN also USE only YIUR Own Sources
Bayes 1
Bayes 2
Genetic Algorithms
Applications
Image
Classification
NLP Models
Natural Language
Processing
Opinion Mining
Clustering 1
Clustering 2
Regression 1
Regression 2
Regression 3
Text Mining 1
Text Mining 2
Web Mining 1
Web Mining 2
Data Mining Book Slides
Here are some book slides - more to be posted
Book Chapter 2
Book Chapter 5
Book Chapter 6
Book Chapter 7
SOME DATASETS
Datasets
for data mining and knowledge discovery
Datasets
for data mining competitions
University
California Irvine KDD Archive
World
Bank datasets
Academic Integrity Statement
Each student must pursue his or her academic goals
honestly and be personally accountable for all
submitted work. Representing another person's work as
your own is always wrong. Any suspected instance of
academic dishonesty will be reported to the Academic
Judiciary. For more comprehensive information on
academic integrity, including categories of academic
dishonesty, please refer to the academic judiciary
website at Academic
Judiciary Website
Stony Brook University Syllabus Statements -
included in the course SYLLABUS