Home Page for COMPSCI 590D: Algorithms for Data Science
Office: CS 232. Office phone: (413) 577-2510. E-mail: barna@cs.umass.edu.
Sainyam’s Office Hour: Tue 2-3 pm, Lederle Graduate Research Tower -–EDLAB – LGRT room 223
Course Overview:
-
Clustering
-
Estimating Statistical Properties of Data
-
Near Neighbor Search
-
Algorithms over Massive Graphs and Social Networks
-
Learning Algorithms
-
Randomized Algorithms
Text Book: We will use reference materials from the following books. Both can be downloaded for free.
-
Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman and Jeff Ullman.
-
Foundations of Data Science, a book in preparation, by John Hopcroft and Ravi Kannan
Prerequisities. CMPSCI 311 and CMPSCI 240 or equivalent courses are required with grade of B or better in both the courses. Students require proper background in algorithm design and basic probability, and will not be admitted in the course without satisfying the prerequisities.
Requirements and Grading.
There will be 3-4 homeworks/programming assignments which will be done in a group of two or three. It will count towards 20% of the grade. Class participation will consist of in-class problem solving and coding exercises and will count towards 30% of the grade. There will be one midterm and one final with respective weights of 20% and 30%.
Homeworks:
Late Homework Policy: No late submission is allowed unless there are compelling reasons and pre-approved by the instructor.
Previous Year’s lecture notes.
Data streaming & Reservoir Sampling lecture-2, Bloom Filter lecture-3, Counting Distinct Items lecture-4, Count-Min Sketch lecture-5, Frequency Moment Estimation lecture-6,
Finding similar items, Document Similarity lecture-7, Locality sensitive hashing lecture-8,
Clustering lecture-9, k-means++ lecture-10, Correlation clustering lecture-11
MapReduce Introduction lecture-12, MapReduce Algorithms lecture-13, notes
Lower bounds lecture-14
Class Schedule
Lecture |
Topic |
References/Announcements |
Sept 8th |
Class overview. Reservoir Sampling | Lecture-1 |
Sept 13th |
Probability Review. Expectation, Variance, Markov Inequality, Chebyshev’s Inequality.
[Mini-homework-1 posted] |
Lecture-2 |
Sept 15th |
The Chernoff Bound | Lecture-2 |
Sept 20th |
Sampling
[Homework 1 posted] |
Lecture-3 |
Sept 22nd |
Discussion Session, In-class implementation of some of the sampling algorithms | Updated-Lecture-3 |
Sept 27th |
More Chernoff bound, Boosting by median, Extensions of reservoir sampling | Lecture-4 |
Sept 29th |
No class due to instructor travel | |
Oct 4th |
Hashing + review | Bloom Filter |
Oct 6th |
Bloom Filter,
[Homework 2 posted date: Oct 8th] |
lecture-5 |
Oct 11th |
No Class-Monday Class schedule followed | L |
Oct 13th |
Midterm [A sample exam to give you an idea. Some topics may not have been covered yet.] | midterm will be in-class |
Oct 18th |
Data stream processing
Finding frequent items |
Count-Min Sketch |
Oct 20th |
Count-Min sketch | |
Oct 25 |
Frequency Moment Estimation | lecture note |
Oct 27 |
Discussion of homework, frequency moment estimation, boosting by median | |
Nov 1st |
Finding similar items
[Mini-homework 2 posted] |
Lecture note |
Nov 3rd |
Locality sensitive hashing
[Part of class discussion: In-class implementation of min-wise hashing.] |
Lecture note |
Nov 8th |
Locality Sensitive Hashing | Lecture note |
Nov 10th |
Clustering and Unsupervised Learning Algorithms | Lecture note |
Nov 15th |
Correlation Clustering
[Homework-3] |
clustering-updated |
Nov 17th |
Clustering via crowdsourcing
[Extra Review Class 4-5 pm @CS142] |
|
Nov 20-27th |
Thanksgiving Recess | |
Nov 29th |
Dynamic Programming | lecture note |
Dec 1st |
MapReduce Algorithms
[Mini-homework-3] |
Lecture |
Dec 6th |
MapReduce Algorithms, Lower bounds | mapreduce-notes |
Dec 8th |
Miscellaneous | |
Dec 13th |
In-class Exam |