Home Page for COMPSCI 590D: Algorithms for Data Science
Office: CS 322. Office phone: (413) 5772510. Email: barna@cs.umass.edu.
Course Overview:

Clustering

Estimating Statistical Properties of Data

Near Neighbor Search

Algorithms over Massive Graphs and Social Networks

Learning Algorithms

Randomized Algorithms
Text Book: We will use reference materials from the following books. Both can be downloaded for free.

Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman and Jeff Ullman.

Foundations of Data Science, a book in preparation, by John Hopcroft and Ravi Kannan
Prerequisities. CMPSCI 311 and CMPSCI 240 or equivalent courses are required with grade of B or better in both the courses. Students require proper background in algorithm design and basic probability, and will not be admitted in the course without satisfying the prerequisities.
Requirements and Grading.
There will be 34 homeworks/programming assignments which will be done in a group of three or less. It will count towards 20% of the grade. There will be 12 coding assignment which will count towards 10% of the grade. Class participation which includes discussions in piazza will count towards 10%. There will be one takehome final exam that will count towards 30% of the grade. No collaboration outside the group (if your group is of size 3. Everyone is allowed to discuss with at most 2 others.) is allowed in doing the homeworks/programming assignments. Consulting materials in the Web is not allowed. No collaboration is allowed in the exams.
Each group will give a 30 minutes presentation on a relevant research paper selected by consultation with the instructor. This will count towards 30% of the grade.
Homeworks:
Late Homework Policy: No late submission is allowed unless there are compelling reasons and preapproved by the instructor.
Class Schedule
Lecture 
Topic 
References/Announcements 
Jan 19th 
Class overview. Different Models of Computation. Counting Distinct Items.  Lecture1 
Jan 21st 
Data Streaming Mining: Motivating Examples, Sampling: Reservoir Sampling, Priority Sampling  Lecture2
Chapter 4 (upto Section 4.3) of Leskovec et al.’s book. Previous Class Note from CSCI8980 Lecture note on Priority sampling for those interested. 
Jan 26th 
Filtering Data Streams. Bloom Filter  Lecture3
Chapter 4 (Section 4.3) Leskovec et al. 
Jan 28th 
Count Distinct Element, Flajolet Martin Sketch,  Lecture4
Chapter 4 (Section 4.4) Leskovec et al. 
Feb 2nd 
Concentration Inequalities, ChernoffHoeffding bound, Boosting by Median.  Chernoff bound note (pg 12)
For various other useful versions of the Chernoff Bound see this 
Feb 4th 
No class due to Instructor travel.
Assignment 1 posted 

Feb 9th 
Countmin sketch  Lecture5
Lecture note on CountMin sketch Note from another course 
Feb 11th 
Review. Frequency norm estimation  Chapter 4 (Section 4.5) Leskovec et al. 
Feb 16th 
No class (Monday class schedule) 

Feb 18th 
Frequency norm estimation. Reflection on streaming algorithms

L Lecture note 
Feb 23rd 
Finding Similar Items, Document Similarity, Locality Sensitive Hashing
Assignment 1 due. 
Chapter 3 (Section 3.13.4) of Leskovec et al. 
Feb 25th 
Locality Sensitive Hashing Assignment 2 posted (Feb 27th) 
Chapter 3 (Section 3.53.8) of Leskovec et al.
Slide from Alex Andoni’s Course on Algorithmic Techniques for Massive Data 
Mar 1st 
New directions on Locality sensitive hashing  
Mar 3rd 
Clustering Algorithms 
Chapter 7 (Section 7.17.2) of Leskovec et al.Chapter 2 (Section 2.12.2) of Leskovec et al. clustering Slides clustering (1) from Jeff Ullman’s course at Stanford. 
Mar 8th 
KCenter, KMedian, KMeans++  Reference1
For analysis see this paper. 
Mar 10th 
Analysis of kmeans++, Correlation Clustering, Crowdsourcing.
Assignment 2 due (Mar 14th, Extended to Mar 21st)

Kmeans++ 
Mar 15th 
Spring Break  
Mar 17th 
Spring Break  
Mar 22nd 
Correlation Clustering, Crowdsourcing
Coding Assignment Posted

Correlation Clustering
Factor 3approximation for correlation clustering Firmani, Saha, Srivastava, “Online Entity Resolution with an Oracle”, VLDB 2016 
Mar 24th 
Review.
Extra credit homework added. 

Mar 29th 
Introduction to MapReduce  
Mar 31st 
Paper presentation  
Apr 5th 
Paper presentation 

Apr 7th 
Graph Algorithms on MapReduce
Community Detection Coding Assignment Due. 
Chapter 10 (Section 10.1, 10.5) of Leskovec et al. 
Apr 12th 
Lower Bounds: Communication Complexity, Conditional Lower Bounds  simplelowerbounds 
Apr 14th 
Review
Assignment 3 posted (April 16th) 

Apr 19th 
Final Exam (take home)  
Apr 21st 
Perceptron Algorithm, Support Vector Machines  Hopcroft and Kannan, pg 202208 
Apr 26th 
Nonlinear separators, Kernels, Boosting
Assignment 3 due (May 2nd) 
Hopcroft and Kannan, pg 209216 
Some studentwritten mostly unedited class notes from a previous course. If you find any typos/errors, please report to the instructor.
 DatastreamIntroduction: Introduction, computing frequent item deterministically, lower bound, basic concentration inequality.
 CountDistinct: Chernoff’s bound, Universal Hash Function, Counting Distinct Element
 CountMin Sketch
 Estimating F_2
 Dimensionality Reduction: Linear sketch as dimensionality reduction technique, JL Lemma, Near Neighbor Search
 Locality Sensitivity Hashing
 Sampling: Reservoir, Priority
 Clustering
 Intro to Semistreaming
 Graph Sparsifiers
 Sketching_Graphs
 Some MapReduce Algorithms
A Tentative List of Papers for Class Presentation:
The goal is to present Theoretical Computer Science papers related to Data Science. Please send me your top3 choices and I will do the assignments. I will also post the guidelines to assist you in reading these papers and for preparing your presentations.

Priority sampling for estimation of arbitrary subset sums N Duffield, C Lund, M Thorup – Journal of the ACM (JACM), 2007

Chazelle, Bernard; Kilian, Joe; Rubinfeld, Ronitt; Tal, Ayellet (2004), “The Bloomier filter: an efficient data structure for static support lookup tables”, Proceedings of the Fifteenth Annual ACMSIAM Symposium on Discrete Algorithms (PDF), pp. 30–39
 Kane, D. M.; Nelson, J.; Woodruff, D. P. (2010). “An optimal algorithm for the distinct elements problem” (PDF). Proceedings of the twentyninth ACM SIGMODSIGACTSIGART symposium on Principles of database systems of data – PODS ’10. p. 41
 A. Andoni, A. Nikolov, K. Onak, G. Yaroslavtsev Parallel Algorithms for Geometric Graph Problems. 46th ACM Symposium on the Theory of Computing (STOC) 2014.
 Barna Saha The Dyck Language Edit Distance Problem in Nearlinear Time, 55th IEEE Symposium on Foundations of Computer Science (FOCS) 2014.
 Gregory Valiant, Finding correlations in subquadratic time, with applications to learning parities and juntas, In the 53rd Annual IEEE Symposium on the Foundations of Computer Science, FOCS 2012
 Kook Jin Ahn, Sudipto Guha, Andrew McGregor: Analyzing graph structure via linear measurements. 23rd ACMSIAM Symposium on Discrete Algorithms SODA 2012: 459467
 Cynthia Dwork: Differential Privacy in New Settings.21st ACMSIAM Symposium on Discrete Algorithms SODA 2010: 174183

U Feige, P Raghavan, D Peleg, E Upfal: Computing with noisy information SIAM Journal on Computing, 1994
 Emmanuel Abbe, Colin Sandon: Community Detection in General Stochastic Block models: Fundamental Limits and Efficient Algorithms for Recovery. FOCS 2015: 670688
 Alexandr Andoni, Rina Panigrahy, Gregory Valiant, Li Zhang: Learning Polynomials with Neural Networks. ICML 2014: 19081916
 Sanjeev Arora, Rong Ge, Ankur Moitra: Learning Topic Models – Going beyond SVD. FOCS 2012: 110
 Aditya Bhaskara, Moses Charikar, Ankur Moitra, Aravindan Vijayaraghavan: Smoothed analysis of tensor decompositions.STOC 2014: 594603
Group Information for Class Presentation:
 Nick Braga, Jared Rand, Shujian Liu: Community Detection
“Arora et al., Finding overlapping communities in social networks: toward a rigorous approach”, EC 2012.
 Wei Hong: Streaming”Clustering data streams: Theory and practice” by Sudipto Guha , Adam Meyerson , Nina Mishra , Rajeev Motwani, IEEE TKDE
 Albert Williams, Rufina Chettiar, Ted Leone: Learning Algorithms, Basics of Neural Network.
 HawShiuan Chang, Yanjie Li, Daoxu Ye: Community Detection (paper given)
“Dense Subgraphs on Dynamic Networks“, Sarma, Atish Das and Lall, Ashwin and Nanongkai, Danupon and Trehan, Amitabh, DISC, 2012.
 Zhou Xu, Zhanlong Qiu, Farahnaz Maroof: Social Network (paper given)“Maximizing the Spread of Influence through a Social Network”, Kempe, Kleinberg, Tardos, KDD 2003.
 Anushree Ghosh, Rose John, Seema Guggari: Learning Algorithms
 Jie Song, Buqin wang and Swetal Bhatt: Map Reduce “A Model of Computation for MapReduce” Howard Karloff, Siddharth Suri and Sergei Vassilvitskii, SODA 2010
 Sainyam Galhotra, Manish Motwani, Aditya Tiwari: Near Neighbor “Rasmus Pagh. Localitysensitive Hashing without False Negatives, SODA 2016″
 Nabanita De: Fast Algorithms (paper given) “Gregory Valiant, Finding correlations in subquadratic time, with applications to learning parities and juntas, In the 53rd Annual IEEE Symposium on the Foundations of Computer Science, FOCS 2012“
 Freddy Ngyuen, Tri Hoang Ngyuen, Xu Qianxin: Streaming Algorithms
“Streaming algorithms via precision sampling” by Andoni, Krauthgamer and Onak, FOCS 2011.
 Danile Lipeles, Jeremy Alanano: Crowdsourcing (paper given)
“Comprehensive and Reliable Crowd Assessment Algorithms” by Joglekar, GarciaMolina, and Parameswaran.