Big Data Algorithms and Data Structures
- Instructor:
- Hu Fu
fuhu@mail.shufe.edu.cn
Office: 504 School of Information Management and Engineering
Office Hour: Wednesday 2:30-3:30pm.
- Teaching Assistant:
-
Qun Hu
email: 2019212804@163.sufe.edu.cn
Office Hour: Friday 2-3pm.
Location: The lab building 1109
- Lectures:
- Friday 6-8:35pm, 520 Center for Cloud Computing
- Syllabus:
-
This course is a theoretical introduction to basic data structures and algorithms used to deal with data of large scales.
Randomization plays a crucial role in these techniques, and a large of part of this course focuses on randomized algorithms and data structures.
Topics include a review of discrete probability theory, hashing, search trees, concentration inequalities, skip lists and SkipNets, dimensionality reduction, and streaming algorithms.
- Prerequisites:
-
Familiarity with basic data structures and algorithms will be assumed.
Discrete probability theory will be used throughout the course; a quick review will be provided at the beginning of the course.
- Texts:
-
There is no required textbook.
Supplementary readings are occasionally provided here. Optional readings will be marked as such.
- Course Work:
-
Grades are determined by Problem sets/written assignments (45%) + Project (20%) + Final (35%). The final is a take-home exam.
For the course project, students will form groups of up to 4 people and survey a topic related to big data data structures or algorithms.
The instructor will suggest candidate topics, but students are encouraged to explore topics of their own interests.
Each group will make an in-class presentation of ~20 minutes in the last lecture, and submit a survey.
A survey can be in either Chinese or English, with an expected length of two pages.
The projects will be evaluated based on the quality of presentation and the submitted survey.
- Suggested Project Topics
-
Here are some keywords that may help you find topics for the final presentation. Search online for resources on big data algorithms, and you will see plenty other topics and resources. See e.g. a list of papers compiled here by Chandra Chekuri.
- Fast Johnson-Lindenstrauss algorithms
- Locality sensitve Hashing
- Cuckoo hashing
- Core-sets
- MapReduce
- Compressed Sensing
- Graph sparsifiers
- Homework Policies:
-
- We will have roughly 4 written assignments. Students are encouraged to form groups of up to three people for each assignment. Each group needs to turn in only one solution, but every group member must be able to explain everything turned in.
- No late solutions will be accepted.
- Typesetting solutions using LaTex is encouraged.
Here is a LaTeX template for your reference.
-
For assignments and exams, unless stated otherwise, for all questions that ask to design an algorithm, you need to provide justification for, (that is, to prove) the correctness of your algorithm. When the question asks for a certain running time (e.g. polynomial time, or O(n^2)), you should analyze the running time of your algorithm.
-
Assignments and their solutions will be posted on the BB system.
You should turn in your solutions either on the BB sytem, or send them to Qun by email, or hand them in at the start of the lecture. Please make sure that your submission has all the names of your group members.
-
Some problems in the assignments are more challenging than the others. You are encouraged to discuss problems among yourselves, on BB, and/or come to office hours. You should start thinking about the problems early and not wait till the last day or two. Allow yourself time to think and to seek help.
-
If you do work with someone outside your group or use some outside source, you must acknowledge them in your write-up.
- Problem Sets
-
You may send your solutions by email to Qun, or drop them at the door of my office. If you work in a group, please make sure your solutions have the names and student numbers of everyone in the group.
- Schedule:
-
-
Sept 10: Introduction
-
Sept 17: Quicksort, Universal Hashing
-
Sept 24: Bloom filters, binary search trees, amortized analysis
-
Oct 8: Treap, Markov inequality
-
Oct 15: Chernoff bounds and their applications
-
Oct 22: Skip List, Johnson-Lindenstrauss
-
Oct 29: Finish JL, approximate nearest neighbor, start streaming algorithms
-
Nov 5: AMS, CountMin
-
Nov 12: Count-Sketch, Sketch for distinct elements, Heavy-hitters
-
Nov 19: Presentations.