COMPSCI-590DAlgorithms for Data Science |

Home Page for COMPSCI 590D: Algorithms for Data Science

Instructor: Barna Saha
Office: CS 322. Office phone: (413) 577-2510. E-mail: barna@cs.umass.edu.

Instructor Office Hours: Thur 4:00 pm–5:00 pm, CS322

Teaching Assistant: My Phan. E-mail:myphan@cs.umass.edu

My Phan’s Office Hour: Wed 10-11 am, Lederle Graduate Research Tower – LGRT – T220

Class Time: TuTh 2:30 pm–3:45 pm. Marston Hall room 211.

Class Page: Algorithms for Data Science

Piazza Link: We will use Piazza for all class related discussions. Sign up here.

Course Overview:

Big Data brings us to interesting times and promises to revolutionize our society from business to government, from healthcare to academia. As we walk through this digitized age of exploded data, there is an increasing demand to develop unified toolkits for data processing and analysis. In this course our main goal is to rigorously study the mathematical foundation of big data processing, develop algorithms and learn how to analyze them. Specific Topics to be covered include:

Clustering
Estimating Statistical Properties of Data
Near Neighbor Search
Algorithms over Massive Graphs and Social Networks
Learning Algorithms
Randomized Algorithms

Text Book: We will use reference materials from the following books. Both can be downloaded for free.

Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman and Jeff Ullman.
Foundations of Data Science, a book in preparation, by John Hopcroft and Ravi Kannan

Prerequisities. CMPSCI 311 and CMPSCI 240 or equivalent courses are required with grade of B or better in both the courses. Students require proper background in algorithm design and basic probability, and will not be admitted in the course without satisfying the prerequisities.

Requirements and Grading.

There will be 3-4 homeworks/programming assignments which will be done in a group of three or less. It will count towards 20% of the grade. There will be 1-2 coding assignment which will count towards 10% of the grade. Class participation which includes discussions in piazza will count towards 10%. There will be one take-home final exam that will count towards 30% of the grade. No collaboration outside the group (if your group is of size 3. Everyone is allowed to discuss with at most 2 others.) is allowed in doing the homeworks/programming assignments. Consulting materials in the Web is not allowed. No collaboration is allowed in the exams.

Each group will give a 30 minutes presentation on a relevant research paper selected by consultation with the instructor. This will count towards 30% of the grade.

Homeworks:

Late Homework Policy: No late submission is allowed unless there are compelling reasons and preapproved by the instructor.

Class Schedule

Lecture	Topic	References/Announcements
Jan 19th	Class overview. Different Models of Computation. Counting Distinct Items.	Lecture-1
Jan 21st	Data Streaming Mining: Motivating Examples, Sampling: Reservoir Sampling, Priority Sampling	Lecture-2 Chapter 4 (upto Section 4.3) of Leskovec et al.’s book. Previous Class Note from CSCI8980 Lecture note on Priority sampling for those interested.
Jan 26th	Filtering Data Streams. Bloom Filter	Lecture-3 Chapter 4 (Section 4.3) Leskovec et al. A survey on Bloom Filter
Jan 28th	Count Distinct Element, Flajolet Martin Sketch,	Lecture-4 Chapter 4 (Section 4.4) Leskovec et al.
Feb 2nd	Concentration Inequalities, Chernoff-Hoeffding bound, Boosting by Median.	Chernoff bound note (pg 1-2) For various other useful versions of the Chernoff Bound see this
Feb 4th	No class due to Instructor travel. Assignment 1 posted
Feb 9th	Count-min sketch	Lecture-5 Lecture note on Count-Min sketch Note from another course
Feb 11th	Review. Frequency norm estimation	Chapter 4 (Section 4.5) Leskovec et al.
Feb 16th	No class (Monday class schedule)
Feb 18th	Frequency norm estimation. Reflection on streaming algorithms	L Lecture note Lecture-6
Feb 23rd	Finding Similar Items, Document Similarity, Locality Sensitive Hashing Assignment 1 due.	Chapter 3 (Section 3.1-3.4) of Leskovec et al. Lecture-7
Feb 25th	Locality Sensitive Hashing Assignment 2 posted (Feb 27th)	Chapter 3 (Section 3.5-3.8) of Leskovec et al. Slide Slide from Alex Andoni’s Course on Algorithmic Techniques for Massive Data
Mar 1st	New directions on Locality sensitive hashing
Mar 3rd	Clustering Algorithms	Chapter 7 (Section 7.1-7.2) of Leskovec et al.Chapter 2 (Section 2.1-2.2) of Leskovec et al. clustering Slides clustering (1) from Jeff Ullman’s course at Stanford.
Mar 8th	K-Center, K-Median, K-Means++	Reference-1 Reference-2 lecture-k-means++ For analysis see this paper.
Mar 10th	Analysis of k-means++, Correlation Clustering, Crowdsourcing. Assignment 2 due (Mar 14th, Extended to Mar 21st)	K-means++
Mar 15th	Spring Break
Mar 17th	Spring Break
Mar 22nd	Correlation Clustering, Crowdsourcing Coding Assignment Posted	Correlation Clustering Factor 3-approximation for correlation clustering Firmani, Saha, Srivastava, “Online Entity Resolution with an Oracle”, VLDB 2016 Slides
Mar 24th	Review. Extra credit homework added.
Mar 29th	Introduction to MapReduce	Chapter 2 of Leskovec et al. MapReduce-Intro MapReduce
Mar 31st	Paper presentation
Apr 5th	Paper presentation
Apr 7th	Graph Algorithms on MapReduce Community Detection Coding Assignment Due.	Reference-1 mapreduce-notes Chapter 10 (Section 10.1, 10.5) of Leskovec et al.
Apr 12th	Lower Bounds: Communication Complexity, Conditional Lower Bounds	simple-lower-bounds
Apr 14th	Review Assignment 3 posted (April 16th)
Apr 19th	Final Exam (take home)
Apr 21st	Perceptron Algorithm, Support Vector Machines	Hopcroft and Kannan, pg 202-208
Apr 26th	Non-linear separators, Kernels, Boosting Assignment 3 due (May 2nd)	Hopcroft and Kannan, pg 209-216

A Tentative List of Papers for Class Presentation:

The goal is to present Theoretical Computer Science papers related to Data Science. Please send me your top-3 choices and I will do the assignments. I will also post the guidelines to assist you in reading these papers and for preparing your presentations.

Priority sampling for estimation of arbitrary subset sums N Duffield, C Lund, M Thorup – Journal of the ACM (JACM), 2007
Chazelle, Bernard; Kilian, Joe; Rubinfeld, Ronitt; Tal, Ayellet (2004), “The Bloomier filter: an efficient data structure for static support lookup tables”, Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (PDF), pp. 30–39
Kane, D. M.; Nelson, J.; Woodruff, D. P. (2010). “An optimal algorithm for the distinct elements problem” (PDF). Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems of data – PODS ’10. p. 41
A. Andoni, A. Nikolov, K. Onak, G. Yaroslavtsev Parallel Algorithms for Geometric Graph Problems. 46th ACM Symposium on the Theory of Computing (STOC) 2014.
Barna Saha The Dyck Language Edit Distance Problem in Near-linear Time, 55th IEEE Symposium on Foundations of Computer Science (FOCS) 2014.
Gregory Valiant, Finding correlations in subquadratic time, with applications to learning parities and juntas, In the 53rd Annual IEEE Symposium on the Foundations of Computer Science, FOCS 2012
Kook Jin Ahn, Sudipto Guha, Andrew McGregor: Analyzing graph structure via linear measurements. 23rd ACM-SIAM Symposium on Discrete Algorithms SODA 2012: 459-467
Cynthia Dwork: Differential Privacy in New Settings.21st ACM-SIAM Symposium on Discrete Algorithms SODA 2010: 174-183
U Feige, P Raghavan, D Peleg, E Upfal: Computing with noisy information SIAM Journal on Computing, 1994
Emmanuel Abbe, Colin Sandon: Community Detection in General Stochastic Block models: Fundamental Limits and Efficient Algorithms for Recovery. FOCS 2015: 670-688
Alexandr Andoni, Rina Panigrahy, Gregory Valiant, Li Zhang: Learning Polynomials with Neural Networks. ICML 2014: 1908-1916
Sanjeev Arora, Rong Ge, Ankur Moitra: Learning Topic Models – Going beyond SVD. FOCS 2012: 1-10
Aditya Bhaskara, Moses Charikar, Ankur Moitra, Aravindan Vijayaraghavan: Smoothed analysis of tensor decompositions.STOC 2014: 594-603

Group Information for Class Presentation:

Nick Braga, Jared Rand, Shujian Liu: Community Detection
“Arora et al., Finding overlapping communities in social networks: toward a rigorous approach”, EC 2012.
Wei Hong: Streaming”Clustering data streams: Theory and practice” by Sudipto Guha , Adam Meyerson , Nina Mishra , Rajeev Motwani, IEEE TKDE
Albert Williams, Rufina Chettiar, Ted Leone: Learning Algorithms, Basics of Neural Network.
Haw-Shiuan Chang, Yanjie Li, Daoxu Ye: Community Detection (paper given)
“Dense Subgraphs on Dynamic Networks“, Sarma, Atish Das and Lall, Ashwin and Nanongkai, Danupon and Trehan, Amitabh, DISC, 2012.
Zhou Xu, Zhanlong Qiu, Farahnaz Maroof: Social Network (paper given)“Maximizing the Spread of Influence through a Social Network”, Kempe, Kleinberg, Tardos, KDD 2003.
Anushree Ghosh, Rose John, Seema Guggari: Learning Algorithms
Jie Song, Buqin wang and Swetal Bhatt: Map Reduce “A Model of Computation for MapReduce” Howard Karloff, Siddharth Suri and Sergei Vassilvitskii, SODA 2010
Sainyam Galhotra, Manish Motwani, Aditya Tiwari: Near Neighbor “Rasmus Pagh. Locality-sensitive Hashing without False Negatives, SODA 2016″
Nabanita De: Fast Algorithms (paper given) “Gregory Valiant, Finding correlations in subquadratic time, with applications to learning parities and juntas, In the 53rd Annual IEEE Symposium on the Foundations of Computer Science, FOCS 2012“
Freddy Ngyuen, Tri Hoang Ngyuen, Xu Qianxin: Streaming Algorithms
“Streaming algorithms via precision sampling” by Andoni, Krauthgamer and Onak, FOCS 2011.
Danile Lipeles, Jeremy Alanano: Crowdsourcing (paper given)
“Comprehensive and Reliable Crowd Assessment Algorithms” by Joglekar, Garcia-Molina, and Parameswaran.

Home Page for COMPSCI 590D: Algorithms for Data Science

Course Overview:

Clustering

Estimating Statistical Properties of Data

Near Neighbor Search

Algorithms over Massive Graphs and Social Networks

Learning Algorithms

Randomized Algorithms

Text Book: We will use reference materials from the following books. Both can be downloaded for free.

Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman and Jeff Ullman.

Foundations of Data Science, a book in preparation, by John Hopcroft and Ravi Kannan

Prerequisities. CMPSCI 311 and CMPSCI 240 or equivalent courses are required with grade of B or better in both the courses. Students require proper background in algorithm design and basic probability, and will not be admitted in the course without satisfying the prerequisities.

Requirements and Grading.

Each group will give a 30 minutes presentation on a relevant research paper selected by consultation with the instructor. This will count towards 30% of the grade.

Homeworks:

Late Homework Policy: No late submission is allowed unless there are compelling reasons and preapproved by the instructor.

Class Schedule

Lecture

Topic

References/Announcements

Jan 19th

Jan 21st

Jan 26th

Jan 28th

Feb 2nd

Feb 4th

Feb 9th

Feb 11th

Feb 16th

No class (Monday class schedule)

Feb 18th

Feb 23rd