功夫阿拉克的博客

Spark优化总结

1.groupByKey

1
2
3
Avoid groupByKey when performing an associative reductive operation.
例如将rdd.groupByKey().mapValues(_.sum)改为rdd.reduceByKey()
尽量少使用,因为相当于少个combiner

2.reduceByKey

1
2
3
4
Avoid reduceByKey when the input and output value types are different.
例如将rdd.map(kv=>(kv._1,new Set[String]()+kv._2)).reduceByKey(_+_)转变为
Val zero=new collection.mutable.Set[String]()
Rdd.aggregateByKey(zero)((set,v)=>set+=v,(set1,set2)=>set1++=set2)

3.flatMap-join-groupBy pattern

1
2
3
Avoid the flatMap-join-groupBy pattern.
when two datasets are already grouped by key and you want to join them and keep them grouped,
you can just use cogroup.This avoids all the overhead associated with unpacking and repacking the groups.

4.suffles

1
2
3
4
5
6
7
8
When more suffles are better:
This trick is especially useful when the aggregation is already grouped by a key.
For example,consider an app that wants to count the occurences of each word
in a corpus and pull the results into the driver as a map.
One approach,which can be accomolished with the aggregate action,is to compute a local map
at each partition and then merge the maps at the driver.
The alternative approach ,which can be accomplished with aggregateByKey ,
is to perform the count in a fully distributed way ,and them simply collectAsMap the results to driver.

5.repartitionAndSortWithInPartitions

1
2
This is more efficient than calling repartition and then sorting within
each partition because it can push the sorting down into the shuffle machinery.

6.flatmap

flatmap函数是两个操作的集合——先映射后扁平化。
操作1:同map函数一样,对每一条输入进行指定操作,然后为每一条输入返回一个对象。
操作2:最后为每一条输入返回一个对象。

flatmap会将字符串看成是一个字符串数组。

7.foldByKey

在使用flodByKey算子时,需特别注意映射函数及zeroValue的取值

8.reduceByKeyLocally

该函数将RDD[k,v]中每个k对应的v值,根据映射函数来运算,运算结果映射到一个Map[k,v]中,而不是RDD[k,v]

9.foreach

如果对RDD执行foreach,只会在Executor端有效,而不是Driver端

10.persist()

If we also wanted to use lineLengths again later,we could add :lineLengths.persist() before the reduce.

11.checkpoint

RDD需要加检查点的原因:
1.DAG中Lineage过长,如果重算,则开销太大(如在PageRank中)
2.在shuffle dependency上做checkpoint 获得的收益更大

12.HashPartitioner和RangePartitioner

使用HashPartitioner和RangePartitioner来减少网络通信开销