Home Page for COMPSCI 590D: Algorithms for Data Science
Office: CS 322. Office phone: (413) 577-2510. E-mail: barna@cs.umass.edu.
Course Overview:
-
Clustering
-
Estimating Statistical Properties of Data
-
Near Neighbor Search
-
Algorithms over Massive Graphs and Social Networks
-
Learning Algorithms
-
Randomized Algorithms
Text Book: We will use reference materials from the following books. Both can be downloaded for free.
-
Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman and Jeff Ullman.
-
Foundations of Data Science, a book in preparation, by John Hopcroft and Ravi Kannan
Prerequisities. CMPSCI 311 and CMPSCI 240 or equivalent courses are required with grade of B or better in both the courses. Students require proper background in algorithm design and basic probability, and will not be admitted in the course without satisfying the prerequisities.
Requirements and Grading.
There will be 3-4 homeworks/programming assignments which will be done in a group of three or less. It will count towards 20% of the grade. There will be 1-2 coding assignment which will count towards 10% of the grade. Class participation which includes discussions in piazza will count towards 10%. There will be one take-home final exam that will count towards 30% of the grade. No collaboration outside the group (if your group is of size 3. Everyone is allowed to discuss with at most 2 others.) is allowed in doing the homeworks/programming assignments. Consulting materials in the Web is not allowed. No collaboration is allowed in the exams.
Each group will give a 30 minutes presentation on a relevant research paper selected by consultation with the instructor. This will count towards 30% of the grade.
Homeworks:
Late Homework Policy: No late submission is allowed unless there are compelling reasons and preapproved by the instructor.
Class Schedule
Lecture |
Topic |
References/Announcements |
Jan 19th |
Class overview. Different Models of Computation. Counting Distinct Items. | Lecture-1 |
Jan 21st |
Data Streaming Mining: Motivating Examples, Sampling: Reservoir Sampling, Priority Sampling | Lecture-2
Chapter 4 (upto Section 4.3) of Leskovec et al.’s book. Previous Class Note from CSCI8980 Lecture note on Priority sampling for those interested. |
Jan 26th |
Filtering Data Streams. Bloom Filter | Lecture-3
Chapter 4 (Section 4.3) Leskovec et al. |
Jan 28th |
Count Distinct Element, Flajolet Martin Sketch, | Lecture-4
Chapter 4 (Section 4.4) Leskovec et al. |
Feb 2nd |
Concentration Inequalities, Chernoff-Hoeffding bound, Boosting by Median. | Chernoff bound note (pg 1-2)
For various other useful versions of the Chernoff Bound see this |
Feb 4th |
No class due to Instructor travel.
Assignment 1 posted |
|
Feb 9th |
Count-min sketch | Lecture-5
Lecture note on Count-Min sketch Note from another course |
Feb 11th |
Review. Frequency norm estimation | Chapter 4 (Section 4.5) Leskovec et al. |
Feb 16th |
No class (Monday class schedule) |
|
Feb 18th |
Frequency norm estimation. Reflection on streaming algorithms | L Lecture note |
Feb 23rd |
Finding Similar Items, Document Similarity, Locality Sensitive Hashing
Assignment 1 due. |
Chapter 3 (Section 3.1-3.4) of Leskovec et al. |
Feb 25th |
Locality Sensitive Hashing Assignment 2 posted (Feb 27th) |
Chapter 3 (Section 3.5-3.8) of Leskovec et al.
Slide from Alex Andoni’s Course on Algorithmic Techniques for Massive Data |
Mar 1st |
New directions on Locality sensitive hashing | |
Mar 3rd |
Clustering Algorithms |
Chapter 7 (Section 7.1-7.2) of Leskovec et al.Chapter 2 (Section 2.1-2.2) of Leskovec et al. clustering Slides clustering (1) from Jeff Ullman’s course at Stanford. |
Mar 8th |
K-Center, K-Median, K-Means++ | Reference-1
For analysis see this paper. |
Mar 10th |
Analysis of k-means++, Correlation Clustering, Crowdsourcing.
Assignment 2 due (Mar 14th, Extended to Mar 21st) |
K-means++ |
Mar 15th |
Spring Break | |
Mar 17th |
Spring Break | |
Mar 22nd |
Correlation Clustering, Crowdsourcing
Coding Assignment Posted |
Correlation Clustering
Factor 3-approximation for correlation clustering Firmani, Saha, Srivastava, “Online Entity Resolution with an Oracle”, VLDB 2016 |
Mar 24th |
Review.
Extra credit homework added. |
|
Mar 29th |
Introduction to MapReduce | |
Mar 31st |
Paper presentation | |
Apr 5th |
Paper presentation |
|
Apr 7th |
Graph Algorithms on MapReduce
Community Detection Coding Assignment Due. |
Chapter 10 (Section 10.1, 10.5) of Leskovec et al. |
Apr 12th |
Lower Bounds: Communication Complexity, Conditional Lower Bounds | simple-lower-bounds |
Apr 14th |
Review
Assignment 3 posted (April 16th) |
|
Apr 19th |
Final Exam (take home) | |
Apr 21st |
Perceptron Algorithm, Support Vector Machines | Hopcroft and Kannan, pg 202-208 |
Apr 26th |
Non-linear separators, Kernels, Boosting
Assignment 3 due (May 2nd) |
Hopcroft and Kannan, pg 209-216 |
A Tentative List of Papers for Class Presentation:
The goal is to present Theoretical Computer Science papers related to Data Science. Please send me your top-3 choices and I will do the assignments. I will also post the guidelines to assist you in reading these papers and for preparing your presentations.
-
Priority sampling for estimation of arbitrary subset sums N Duffield, C Lund, M Thorup – Journal of the ACM (JACM), 2007
-
Chazelle, Bernard; Kilian, Joe; Rubinfeld, Ronitt; Tal, Ayellet (2004), “The Bloomier filter: an efficient data structure for static support lookup tables”, Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (PDF), pp. 30–39
- Kane, D. M.; Nelson, J.; Woodruff, D. P. (2010). “An optimal algorithm for the distinct elements problem” (PDF). Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems of data – PODS ’10. p. 41
- A. Andoni, A. Nikolov, K. Onak, G. Yaroslavtsev Parallel Algorithms for Geometric Graph Problems. 46th ACM Symposium on the Theory of Computing (STOC) 2014.
- Barna Saha The Dyck Language Edit Distance Problem in Near-linear Time, 55th IEEE Symposium on Foundations of Computer Science (FOCS) 2014.
- Gregory Valiant, Finding correlations in subquadratic time, with applications to learning parities and juntas, In the 53rd Annual IEEE Symposium on the Foundations of Computer Science, FOCS 2012
- Kook Jin Ahn, Sudipto Guha, Andrew McGregor: Analyzing graph structure via linear measurements. 23rd ACM-SIAM Symposium on Discrete Algorithms SODA 2012: 459-467
- Cynthia Dwork: Differential Privacy in New Settings.21st ACM-SIAM Symposium on Discrete Algorithms SODA 2010: 174-183
-
U Feige, P Raghavan, D Peleg, E Upfal: Computing with noisy information SIAM Journal on Computing, 1994
- Emmanuel Abbe, Colin Sandon: Community Detection in General Stochastic Block models: Fundamental Limits and Efficient Algorithms for Recovery. FOCS 2015: 670-688
- Alexandr Andoni, Rina Panigrahy, Gregory Valiant, Li Zhang: Learning Polynomials with Neural Networks. ICML 2014: 1908-1916
- Sanjeev Arora, Rong Ge, Ankur Moitra: Learning Topic Models – Going beyond SVD. FOCS 2012: 1-10
- Aditya Bhaskara, Moses Charikar, Ankur Moitra, Aravindan Vijayaraghavan: Smoothed analysis of tensor decompositions.STOC 2014: 594-603
Group Information for Class Presentation:
- Nick Braga, Jared Rand, Shujian Liu: Community Detection
“Arora et al., Finding overlapping communities in social networks: toward a rigorous approach”, EC 2012.
- Wei Hong: Streaming”Clustering data streams: Theory and practice” by Sudipto Guha , Adam Meyerson , Nina Mishra , Rajeev Motwani, IEEE TKDE
- Albert Williams, Rufina Chettiar, Ted Leone: Learning Algorithms, Basics of Neural Network.
- Haw-Shiuan Chang, Yanjie Li, Daoxu Ye: Community Detection (paper given)
“Dense Subgraphs on Dynamic Networks“, Sarma, Atish Das and Lall, Ashwin and Nanongkai, Danupon and Trehan, Amitabh, DISC, 2012.
- Zhou Xu, Zhanlong Qiu, Farahnaz Maroof: Social Network (paper given)“Maximizing the Spread of Influence through a Social Network”, Kempe, Kleinberg, Tardos, KDD 2003.
- Anushree Ghosh, Rose John, Seema Guggari: Learning Algorithms
- Jie Song, Buqin wang and Swetal Bhatt: Map Reduce “A Model of Computation for MapReduce” Howard Karloff, Siddharth Suri and Sergei Vassilvitskii, SODA 2010
- Sainyam Galhotra, Manish Motwani, Aditya Tiwari: Near Neighbor “Rasmus Pagh. Locality-sensitive Hashing without False Negatives, SODA 2016″
- Nabanita De: Fast Algorithms (paper given) “Gregory Valiant, Finding correlations in subquadratic time, with applications to learning parities and juntas, In the 53rd Annual IEEE Symposium on the Foundations of Computer Science, FOCS 2012“
- Freddy Ngyuen, Tri Hoang Ngyuen, Xu Qianxin: Streaming Algorithms
“Streaming algorithms via precision sampling” by Andoni, Krauthgamer and Onak, FOCS 2011.
- Danile Lipeles, Jeremy Alanano: Crowdsourcing (paper given)
“Comprehensive and Reliable Crowd Assessment Algorithms” by Joglekar, Garcia-Molina, and Parameswaran.