Schedule

Part Description Dates Related Assignment
1 Introduction to Big Data Jan. 9th CS451 - A0 CS431 - A0
2 MapReduce Algorithm Design Jan. 11, 16, 18 CS451 - A1 CS431 - A1
3 From MapReduce to Spark Jan. 23, 25, 30 CS451 - A2 CS431 - A2
4 Analyzing Text Feb. 1, 6 CS451 - A3
5 Analyzing Graphs Feb. 8, 13, 15 CS451 - A4 CS431 - A3
Reading Week! Feb. 17-25 -
6 Data Mining and Machine Learning Feb. 27, 29, Mar. 5, 7 CS451 - A5 CS431 - A4
7 Analyzing Relational Data Mar. 12, 14🎉, 19 CS451 - A6 CS431 - A5
8 Real-Time Analytics (Streaming) Mar. 21, 26 CS451 - A7 CS431 - A6
9 Mutable State (Big Table / HBase) Mar. 28, Apr 2 -
10 Analyzing Graphs, Redux (Giraph, Spark GraphX) Apr. 4 -
(The party hat is because it my birthday)
Note that the following slides are from last term. When I have time I will be tweaking them. There's some Javascript that puts an "updated" note beside any files that change.

Part 1: Introduction to Big Data

Topics

  • What's this course about?
  • Why big data?
  • Scaling models

Slides

Back to top

Part 2: MapReduce Algorithm Design

Topics

  • MapReduce programming model
  • Cloud computing and datacenters
  • Hadoop API
  • Hadoop physical execution
  • MapReduce design patterns
  • Intermediate aggregation and combiners
  • Partitioning, grouping, and sorting

Readings

  • Data-Intensive Text Processing with MapReduce
  • Hadoop: The Definitive Guide (4th Edition):
    • Chapter 1: Meet Hadoop
    • Chapter 2: MapReduce
    • Chapter 3: The Hadoop Distributed Filesystem (Focus on the mechanics of the HDFS commands and don't worry so much about learning the Java API all at once—you'll pick it up in time.)
    • Chapter 5: Hadoop I/O (Read sections "Serialization" and "File-Based Data Structures")
    • Chapter 6: Developing a MapReduce Application (Skip sections "Setting Up the Development Environment", "Writing a Unit Test with MRUnit" and "MapReduce Workflows")
    • Chapter 7: How MapReduce Works (Skip section on "Configuration Tuning")
    • Chapter 8: MapReduce Types and Formats
    • Chapter 9: MapReduce Features (Read sections on "Counters", "Sorting", and "Side Data distribution")

Slides

Back to top

Part 3: From MapReduce to Spark

Topics

  • Evolution of dataflow abstractions
  • MapReduce, Pig, Spark, etc.

Readings

  • Learning Spark (Optional):
    • Chapter 1: Introduction to Data Analysis with Spark
    • Chapter 2: Downloading Spark and Getting Started (Skip section on downloading)
    • Chapter 3: Programming with RDDs
    • Chapter 4: Working with Key/Value Pairs
    • Chapter 5: Loading and Saving Your Data (Stop when you get to Structured Data with Spark SQL)

Slides

Back to top

Part 4: Analyzing Text

Topics

  • Language models and machine translation
  • Inverted indexing and search

Readings

Slides

Back to top

Part 5: Analyzing Graphs

Topics

  • Graph representations
  • Parallel breadth-first search
  • PageRank and random walks
  • Issues and challenges with dataflow abstractions

Readings

Slides

Back to top

Part 6: Data Mining and Machine Learning

Topics

  • Supervised machine learning: binary classification
  • Logistic regression, gradient descent, stochastic gradient descent, ensemble methods
  • Production machine learning pipelines
  • Hashing: minhash
  • Clustering: k-means

Readings

  • Tom Mitchell. Naive Bayes and Logistic Regression. (This book chapter serves as supplemental reading and goes into classification in more detail than in lecture.)
  • Deisenroth et al., Mathematics for Machine Learning: Chapter 12, Classification with Support Vector Machines. (Optional supplemental reading)
  • Deisenroth et al., Mathematics for Machine Learning: Chapter 11, Density Estimation with Gaussian Mixture Models. (This book chapter serves as supplemental reading and goes into clustering with Gaussian mixture models in more detail than in lecture.)

Slides

Back to top

Part 7: Analyzing Relational Data

Topics

  • OLTP vs. OLAP
  • Data warehousing and data lakes, ETL
  • SQL-on-Hadoop: relational data processing with MapReduce and Spark
  • Optimizations for relational processing: row vs. column stores, vectorized processing
  • Semistructured data and record reconstruction (Parquet)

Readings

Slides

Back to top

Part 8: Real-Time Analytics

Topics

  • Stream processing semantics, issues, and frameworks
  • Introduction to Apache Spark Streaming
  • Probabilistic data structures (hyerloglog counters, bloom filters, count-min sketches, etc.)

Readings

Slides

Back to top

Part 9: Mutable State

Topics

  • Bigtable/HBase: Log-structure merge trees
  • Distributed hash tables
  • Consistency, latency, and availability tradeoffs

Readings

Slides

Back to top

Part 10: Analyzing Graphs, Redux

Topics

  • Bulk synchronous parallel: "think like a vertex" (Giraph)

Readings

Slides

Back to top