Home Page for COMPSCI 590D: Algorithms for Data Science
Office: CS 232. Office phone: (413) 5772510. Email: barna@cs.umass.edu.
Sainyam’s Office Hour: Tue 23 pm, Lederle Graduate Research Tower –EDLAB – LGRT room 223
Course Overview:

Clustering

Estimating Statistical Properties of Data

Near Neighbor Search

Algorithms over Massive Graphs and Social Networks

Learning Algorithms

Randomized Algorithms
Text Book: We will use reference materials from the following books. Both can be downloaded for free.

Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman and Jeff Ullman.

Foundations of Data Science, a book in preparation, by John Hopcroft and Ravi Kannan
Prerequisities. CMPSCI 311 and CMPSCI 240 or equivalent courses are required with grade of B or better in both the courses. Students require proper background in algorithm design and basic probability, and will not be admitted in the course without satisfying the prerequisities.
Requirements and Grading.
There will be 34 homeworks/programming assignments which will be done in a group of two or three. It will count towards 20% of the grade. Class participation will consist of inclass problem solving and coding exercises and will count towards 30% of the grade. There will be one midterm and one final with respective weights of 20% and 30%.
Homeworks:
Late Homework Policy: No late submission is allowed unless there are compelling reasons and preapproved by the instructor.
Previous Year’s lecture notes.
Data streaming & Reservoir Sampling lecture2, Bloom Filter lecture3, Counting Distinct Items lecture4, CountMin Sketch lecture5, Frequency Moment Estimation lecture6,
Finding similar items, Document Similarity lecture7, Locality sensitive hashing lecture8,
Clustering lecture9, kmeans++ lecture10, Correlation clustering lecture11
MapReduce Introduction lecture12, MapReduce Algorithms lecture13, notes
Lower bounds lecture14
Class Schedule
Lecture 
Topic 
References/Announcements 
Sept 8th 
Class overview. Reservoir Sampling  Lecture1 
Sept 13th 
Probability Review. Expectation, Variance, Markov Inequality, Chebyshev’s Inequality.
[Minihomework1 posted] 
Lecture2 
Sept 15th 
The Chernoff Bound  Lecture2 
Sept 20th 
Sampling
[Homework 1 posted] 
Lecture3 
Sept 22nd 
Discussion Session, Inclass implementation of some of the sampling algorithms  UpdatedLecture3 
Sept 27th 
More Chernoff bound, Boosting by median, Extensions of reservoir sampling  Lecture4 
Sept 29th 
No class due to instructor travel  
Oct 4th 
Hashing + review  Bloom Filter 
Oct 6th 
Bloom Filter,
[Homework 2 posted date: Oct 8th] 
lecture5 
Oct 11th 
No ClassMonday Class schedule followed  L 
Oct 13th 
Midterm [A sample exam to give you an idea. Some topics may not have been covered yet.]  midterm will be inclass 
Oct 18th 
Data stream processing
Finding frequent items 
CountMin Sketch 
Oct 20th 
CountMin sketch  
Oct 25 
Frequency Moment Estimation  lecture note 
Oct 27 
Discussion of homework, frequency moment estimation, boosting by median  
Nov 1st 
Finding similar items
[Minihomework 2 posted] 
Lecture note 
Nov 3rd 
Locality sensitive hashing
[Part of class discussion: Inclass implementation of minwise hashing.] 
Lecture note 
Nov 8th 
Locality Sensitive Hashing  Lecture note 
Nov 10th 
Clustering and Unsupervised Learning Algorithms  Lecture note 
Nov 15th 
Correlation Clustering
[Homework3] 
clusteringupdated 
Nov 17th 
Clustering via crowdsourcing
[Extra Review Class 45 pm @CS142] 

Nov 2027th 
Thanksgiving Recess  
Nov 29th 
Dynamic Programming  lecture note 
Dec 1st 
MapReduce Algorithms
[Minihomework3] 
Lecture 
Dec 6th 
MapReduce Algorithms, Lower bounds  mapreducenotes 
Dec 8th 
Miscellaneous  
Dec 13th 
Inclass Exam 
Some studentwritten mostly unedited class notes from a previous course. If you find any typos/errors, please report to the instructor.
 DatastreamIntroduction: Introduction, computing frequent item deterministically, lower bound, basic concentration inequality.
 CountDistinct: Chernoff’s bound, Universal Hash Function, Counting Distinct Element
 CountMin Sketch
 Estimating F_2
 Dimensionality Reduction: Linear sketch as dimensionality reduction technique, JL Lemma, Near Neighbor Search
 Locality Sensitivity Hashing
 Sampling: Reservoir, Priority
 Clustering
 Intro to Semistreaming
 Graph Sparsifiers
 Sketching_Graphs
 Some MapReduce Algorithms