COMPSCI-590DAlgorithms for Data Science

Home Page for COMPSCI 590D: Algorithms for Data Science

Instructor: Barna Saha
Office: CS 322. Office phone: (413) 577-2510. E-mail:
Instructor Office Hours: Thur 4:00 pm–5:00 pm, CS322
Teaching Assistant: My Phan.
My Phan’s Office Hour: Wed 10-11 am, Lederle Graduate Research Tower  –  LGRT  –  T220
Class Time: TuTh 2:30 pm–3:45 pm. Marston Hall room 211.
Piazza Link: We will use Piazza for all class related discussions. Sign up here.

Course Overview:

Big Data brings us to interesting times and promises to revolutionize our society from business to government, from healthcare to academia. As we walk through this digitized age of exploded data, there is an increasing demand to develop unified toolkits for data processing and analysis. In this course our main goal is to rigorously study the mathematical foundation of big data processing, develop algorithms and learn how to analyze them. Specific Topics to be covered include:
  1. Clustering

  2. Estimating Statistical Properties of Data

  3. Near Neighbor Search

  4. Algorithms over Massive Graphs and Social Networks

  5. Learning Algorithms

  6. Randomized Algorithms

Text Book: We will use reference materials from the following books. Both can be downloaded for free.

Prerequisities. CMPSCI 311 and CMPSCI 240 or equivalent courses are required with grade of B or better in both the courses. Students require proper background in algorithm design and basic probability, and will not be admitted in the course without satisfying the prerequisities.

Requirements and Grading. 

There will be 3-4 homeworks/programming assignments which will be done in a group of three or less. It will count towards 20% of the grade. There will be 1-2 coding assignment which will count towards 10% of the grade. Class participation which includes discussions in piazza will count towards 10%. There will be one take-home final exam that will count towards 30% of the grade. No collaboration outside the group (if your group is of size 3. Everyone is allowed to discuss with at most 2 others.) is allowed in doing the homeworks/programming assignments. Consulting materials in the Web is not allowed. No collaboration is allowed in the exams.

Each group will give a 30 minutes presentation on a relevant research paper selected by consultation with the instructor. This will count towards 30% of the grade.


Late Homework Policy: No late submission is allowed unless there are compelling reasons and preapproved by the instructor.

Class Schedule




Jan 19th

Class overview. Different Models of Computation. Counting Distinct Items. Lecture-1

Jan 21st

Data Streaming Mining: Motivating Examples, Sampling: Reservoir Sampling, Priority Sampling  Lecture-2

Chapter 4 (upto Section 4.3) of Leskovec et al.’s book.

Previous Class Note from CSCI8980

Lecture note on Priority sampling for those interested.

Jan 26th

 Filtering Data Streams. Bloom Filter Lecture-3

 Chapter 4 (Section 4.3) Leskovec et al.

A survey on Bloom Filter

Jan 28th

 Count Distinct Element, Flajolet Martin Sketch,  Lecture-4

Chapter 4 (Section 4.4) Leskovec et al.

Feb 2nd

Concentration Inequalities, Chernoff-Hoeffding bound, Boosting by Median.  Chernoff bound note (pg 1-2)

For various other useful versions of the Chernoff Bound see this

Feb 4th

No class due to Instructor travel. 

Assignment 1 posted

Feb 9th

Count-min sketch Lecture-5

Lecture note on Count-Min sketch

Note from another course

Feb 11th

Review. Frequency norm estimation Chapter 4 (Section 4.5) Leskovec et al. 

Feb 16th

 No class (Monday class schedule)


Feb 18th

Frequency norm estimation. Reflection on streaming algorithms


L Lecture note


Feb 23rd

Finding Similar Items, Document Similarity, Locality Sensitive Hashing

Assignment 1 due. 

 Chapter 3 (Section 3.1-3.4) of Leskovec et al.


Feb 25th


Locality Sensitive Hashing

 Assignment 2 posted (Feb 27th)

 Chapter 3 (Section 3.5-3.8) of Leskovec et al.


Slide from Alex Andoni’s Course on Algorithmic Techniques for Massive Data

Mar 1st

 New directions on Locality sensitive hashing

Mar 3rd

  Clustering Algorithms

Chapter 7  (Section 7.1-7.2) of Leskovec et al.Chapter 2 (Section 2.1-2.2) of Leskovec et al.

clustering Slides

clustering (1) from Jeff Ullman’s course at Stanford.

Mar 8th

  K-Center, K-Median, K-Means++  Reference-1



For analysis see this paper.

Mar 10th

  Analysis of k-means++, Correlation Clustering, Crowdsourcing.

 Assignment 2 due (Mar 14th, Extended to Mar 21st)



Mar 15th

Spring Break  

Mar 17th

Spring Break   

Mar 22nd

Correlation Clustering, Crowdsourcing

Coding Assignment Posted 


 Correlation Clustering

Factor 3-approximation for correlation clustering

Firmani, Saha, Srivastava, “Online Entity Resolution with an Oracle”, VLDB 2016


Mar 24th


Extra credit homework added.

Mar 29th

Introduction to MapReduce

Chapter 2 of Leskovec et al.



Mar 31st

Paper presentation  

Apr 5th

Paper presentation


Apr 7th

Graph Algorithms on MapReduce

Community Detection

Coding Assignment Due.



Chapter 10 (Section 10.1, 10.5) of Leskovec et al.

Apr 12th

Lower Bounds: Communication Complexity, Conditional Lower Bounds  simple-lower-bounds

Apr 14th


Assignment 3 posted (April 16th)


Apr 19th

Final Exam (take home)   

Apr 21st

Perceptron Algorithm, Support Vector Machines  Hopcroft and Kannan, pg 202-208

Apr 26th

Non-linear separators, Kernels, Boosting

Assignment 3 due (May 2nd)

 Hopcroft and Kannan, pg 209-216

Some student-written mostly unedited class notes from a previous course. If you find any typos/errors, please report to the instructor.

  1. Data-stream-Introduction: Introduction, computing frequent item deterministically, lower bound, basic concentration inequality.
  2. Count-Distinct: Chernoff’s bound, Universal Hash Function, Counting Distinct Element
  3. Count-Min Sketch
  4. Estimating F_2
  5. Dimensionality Reduction: Linear sketch as dimensionality reduction technique, JL Lemma, Near Neighbor Search
  6. Locality Sensitivity Hashing
  7. Sampling: Reservoir, Priority
  8. Clustering
  9. Intro to Semistreaming
  10. Graph Sparsifiers
  11. Sketching_Graphs
  12. Some MapReduce Algorithms

A Tentative List of Papers for Class Presentation: 

The goal is to present Theoretical Computer Science papers related to Data Science. Please send me your top-3 choices and I will do the assignments. I will also post the guidelines to assist you in reading these papers and for preparing your presentations.

  1. Priority sampling for estimation of arbitrary subset sums N Duffield, C Lund, M Thorup – Journal of the ACM (JACM), 2007

  2. Chazelle, Bernard; Kilian, Joe; Rubinfeld, Ronitt; Tal, Ayellet (2004), “The Bloomier filter: an efficient data structure for static support lookup tables”, Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (PDF), pp. 30–39

  3. Kane, D. M.; Nelson, J.; Woodruff, D. P. (2010). “An optimal algorithm for the distinct elements problem” (PDF). Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems of data – PODS ’10. p. 41
  4. A. Andoni, A. Nikolov, K. Onak, G. Yaroslavtsev Parallel Algorithms for Geometric Graph Problems. 46th ACM Symposium on the Theory of Computing (STOC) 2014.
  5. Barna Saha The Dyck Language Edit Distance Problem in Near-linear Time, 55th IEEE Symposium on Foundations of Computer Science (FOCS) 2014.
  6. Gregory Valiant, Finding correlations in subquadratic time, with applications to learning parities and juntas,  In the 53rd Annual IEEE Symposium on the Foundations of Computer Science, FOCS 2012
  7. Kook Jin Ahn, Sudipto Guha, Andrew McGregor: Analyzing graph structure via linear measurements. 23rd ACM-SIAM Symposium on Discrete Algorithms SODA 2012: 459-467
  8. Cynthia Dwork: Differential Privacy in New Settings.21st ACM-SIAM Symposium on Discrete Algorithms SODA 2010: 174-183
  9. U Feige, P Raghavan, D Peleg, E Upfal:  Computing with noisy information SIAM Journal on Computing, 1994

  10. Emmanuel Abbe, Colin Sandon: Community Detection in General Stochastic Block models: Fundamental Limits and Efficient Algorithms for Recovery. FOCS 2015: 670-688
  11. Alexandr Andoni, Rina Panigrahy, Gregory Valiant, Li Zhang: Learning Polynomials with Neural Networks. ICML 2014: 1908-1916
  12. Sanjeev Arora, Rong Ge, Ankur Moitra: Learning Topic Models – Going beyond SVD. FOCS 2012: 1-10
  13. Aditya Bhaskara, Moses Charikar, Ankur Moitra, Aravindan Vijayaraghavan: Smoothed analysis of tensor decompositions.STOC 2014: 594-603

Group Information for Class Presentation:

  1. Nick Braga, Jared Rand, Shujian Liu: Community Detection

    “Arora et al., Finding overlapping communities in social networks: toward a rigorous approach”, EC 2012.

  2. Wei Hong: Streaming”Clustering data streams: Theory and practice” by Sudipto Guha , Adam Meyerson , Nina Mishra , Rajeev Motwani, IEEE TKDE
  3. Albert Williams, Rufina Chettiar, Ted Leone: Learning Algorithms, Basics of Neural Network.
  4. Haw-Shiuan Chang, Yanjie Li, Daoxu Ye: Community Detection (paper given)

    Dense Subgraphs on Dynamic Networks“, Sarma, Atish Das and Lall, Ashwin and Nanongkai, Danupon and Trehan, Amitabh, DISC, 2012.

  5. Zhou Xu, Zhanlong Qiu, Farahnaz Maroof: Social Network (paper given)“Maximizing the Spread of Influence through a Social Network”, Kempe, Kleinberg, Tardos, KDD 2003.
  6. Anushree Ghosh, Rose John, Seema Guggari: Learning Algorithms
  7. Jie Song, Buqin wang and Swetal Bhatt: Map Reduce “A Model of Computation for MapReduceHoward Karloff, Siddharth Suri and Sergei Vassilvitskii, SODA 2010
  8. Sainyam Galhotra, Manish Motwani, Aditya Tiwari: Near Neighbor   “Rasmus Pagh. Locality-sensitive Hashing without False Negatives, SODA 2016″
  9. Nabanita De: Fast Algorithms (paper given) “Gregory Valiant, Finding correlations in subquadratic time, with applications to learning parities and juntas,  In the 53rd Annual IEEE Symposium on the Foundations of Computer Science, FOCS 2012
  10. Freddy Ngyuen, Tri Hoang Ngyuen, Xu Qianxin: Streaming Algorithms

    “Streaming algorithms via precision sampling” by Andoni, Krauthgamer and Onak, FOCS 2011.

  11. Danile Lipeles, Jeremy Alanano: Crowdsourcing (paper given)

    “Comprehensive and Reliable Crowd Assessment Algorithms” by Joglekar, Garcia-Molina, and Parameswaran.