Skip to Content

Spark Paper Notes

Resilient Distributed Datasets Note

Question of MapReduce

  • lack abstraction for leveraging distributed memory
    1. iterative algorithm
    2. interactive data mining

Resilient distributed datasets

  • readonly, partition collection of record
  • coarse-grained transformations
  • lineage
  • api

RDD vs. DSM

  • recovery is more light weight
  • backup tasks like Mapreduce for stragglers
  • data locality

Not suitable for RDD

  • fine-grained updates application

Spark programming interface

  • driver with cluster of =workers

Example

  • Logistic Regression
// parse text into point, then persist it to RAM
val points = spark.textFile(...)
  .map(parsePoint).persist()

// repeatedly run map and reduce on points to compute gradient
var w = // random initial vector
for (i <- 1 to ITERATIONS) {
val gradient = points.map{ p =>
  p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y
}.reduce((a,b) => a+b)
  w -= gradient
}
  • PageRank
// get links and persist
val links = spark.textFile(...).map(...).persist()
// initail RDD
var ranks = // RDD of (URL, rank) pairs

// repeatedly compute the new RDD from old RDD
for (i <- 1 to ITERATIONS) {
  // Build an RDD of (targetURL, float) pairs
  // with the contributions sent by each page
  val contribs = links.join(ranks).flatMap {
    (url, (links, rank)) =>
    links.map(dest => (dest, rank/links.size))
  }
  // Sum contributions by URL and get new ranks
  ranks = contribs.reduceByKey((x,y) => x+y)
    .mapValues(sum => a/N + (1-a)*sum)
}

Representing RDDs

  • dependencies
    1. narrow dependencies
    2. wide dependencies
  • partitions
  • oerferredLocations
  • iterator
  • partitioner

Implementation detail

  • Job scheduling
    1. stages with per tasks
  • interpreter integration
    1. class shipping
    2. modified code generation
  • memory management
    1. LRU eviction
    2. priority
  • checkpointing
    1. for long-run linegae chains

Expressing exising programming models in spark

  • mapreduce
  • SQL
  • Iterative mapreduce
  • Batched Stream Processing
comments powered by Disqus