Algorithms for Data Science-Fall 2016

Home Page for COMPSCI 590D: Algorithms for Data Science

Instructor: Barna Saha
Office: CS 232. Office phone: (413) 577-2510. E-mail:
Instructor Office Hours: Thur 11:30 am–12:30 pm, CS232
Teaching Assistant: Sainyam Galhotra.

Sainyam’s Office Hour: Tue 2-3 pm, Lederle Graduate Research Tower  -–EDLAB – LGRT room 223

Class Time: TuTh 10:00 am–11:15 am. Holdsworth 202
Piazza Link: We will use Piazza for all class related discussions. Sign up here.

Course Overview:

Big Data brings us to interesting times and promises to revolutionize our society from business to government, from healthcare to academia. As we walk through this digitized age of exploded data, there is an increasing demand to develop unified toolkits for data processing and analysis. In this course our main goal is to rigorously study the mathematical foundation of big data processing, develop algorithms and learn how to analyze them. Specific Topics to be covered include (subject to change):
  1. Clustering

  2. Estimating Statistical Properties of Data

  3. Near Neighbor Search

  4. Algorithms over Massive Graphs and Social Networks

  5. Learning Algorithms

  6. Randomized Algorithms

Text Book: We will use reference materials from the following books. Both can be downloaded for free.

Prerequisities. CMPSCI 311 and CMPSCI 240 or equivalent courses are required with grade of B or better in both the courses. Students require proper background in algorithm design and basic probability, and will not be admitted in the course without satisfying the prerequisities.

Requirements and Grading. 

There will be 3-4 homeworks/programming assignments which will be done in a group of two or three. It will count towards 20% of the grade. Class participation will consist of in-class problem solving and coding exercises and will count towards 30% of the grade. There will be one midterm and one final with respective weights of 20% and 30%.


Late Homework Policy: No late submission is allowed unless there are compelling reasons and pre-approved by the instructor.

Previous Year’s lecture notes. 

Data streaming & Reservoir Sampling lecture-2, Bloom Filter lecture-3, Counting Distinct Items lecture-4, Count-Min Sketch lecture-5, Frequency Moment Estimation lecture-6,

Finding similar items, Document Similarity lecture-7, Locality sensitive hashing lecture-8,

Clustering lecture-9, k-means++ lecture-10, Correlation clustering lecture-11

MapReduce Introduction lecture-12, MapReduce Algorithms lecture-13, notes

Lower bounds lecture-14

Class Schedule




Sept 8th

Class overview. Reservoir Sampling  Lecture-1


Sept 13th

 Probability Review. Expectation, Variance, Markov Inequality, Chebyshev’s Inequality.

[Mini-homework-1 posted]


Sept 15th

 The Chernoff Bound  Lecture-2

Sept 20th


[Homework 1 posted]


Sept 22nd

 Discussion Session, In-class implementation of some of the sampling algorithms Updated-Lecture-3

Sept 27th

 More Chernoff bound, Boosting by median, Extensions of reservoir sampling Lecture-4

Sept 29th

 No class due to instructor travel

Oct 4th

Hashing + review Bloom Filter

Oct 6th

Bloom Filter,

[Homework 2 posted date: Oct 8th]


Bloom Filter

Oct 11th

No Class-Monday Class schedule followed L

Oct 13th

Midterm [A sample exam to give you an idea. Some topics may not have been covered yet.] midterm will be in-class

Oct 18th

Data stream processing 

Finding frequent items

  Count-Min Sketch

lecture note

Oct 20th

 Count-Min sketch 

Oct 25

Frequency Moment Estimation lecture note

Oct 27

Discussion of homework, frequency moment estimation, boosting by median

Nov 1st

 Finding similar items

[Mini-homework 2 posted]

Lecture note

Nov 3rd

Locality sensitive hashing

[Part of class discussion: In-class implementation of min-wise hashing.]

Lecture note

Nov 8th

Locality Sensitive Hashing Lecture note

Nov 10th

Clustering and Unsupervised Learning Algorithms Lecture note 

Nov 15th

Correlation Clustering



Nov 17th

Clustering via crowdsourcing

[Extra Review Class 4-5 pm @CS142]

Nov 20-27th

Thanksgiving Recess  

Nov 29th

Dynamic Programming lecture note

Dec 1st

MapReduce Algorithms



Dec 6th

 MapReduce Algorithms, Lower bounds  mapreduce-notes


Dec 8th


Dec 13th

 In-class Exam  

Some student-written mostly unedited class notes from a previous course. If you find any typos/errors, please report to the instructor.

  1. Data-stream-Introduction: Introduction, computing frequent item deterministically, lower bound, basic concentration inequality.
  2. Count-Distinct: Chernoff’s bound, Universal Hash Function, Counting Distinct Element
  3. Count-Min Sketch
  4. Estimating F_2
  5. Dimensionality Reduction: Linear sketch as dimensionality reduction technique, JL Lemma, Near Neighbor Search
  6. Locality Sensitivity Hashing
  7. Sampling: Reservoir, Priority
  8. Clustering
  9. Intro to Semistreaming
  10. Graph Sparsifiers
  11. Sketching_Graphs
  12. Some MapReduce Algorithms