by Rahul


  • Big Data


  • java
  • spark


Tips to prepare Mapr Spark Certification-

Yes finally I did it :) After couple of months preparation I am finally Mapr certified spark developer. I wrote the exam on Nov 2016. Spark certification is sponsored by many organizations DataBricks, Mapr, Hortonworks, Cloudera… the DataBricks and Mapr are the popular one, I am bias towards both providers and I chose to Mapr because my organization is using Mapr distribution.

If you are planning to write MCSD then read apache spark wiki thoroughly and have hands on experience too.

The major topics MCSD covers:

  • Load and inspect data.
  • Build an Apache spark application.
  • Working with Pair RDD.
  • Monitor Spark application.
  • Working with Data frames.
  • Spark Streaming
  • Advance Machine Learning programming.

Pointers to start Prepration:

index topic time slot

(in minutes)

1 Load and inspect data. 30 24%
2 Build an Apache spark application. 15 14%
3 Working with Pair RDD. 20 17%
4 Monitor Spark application. 15 14%
5 Working with Data frames. 12 10%
6 Spark Streaming 12 10%
7 Advance Machine Learning programming. 12 10%
  • Most of the questions were in scala given with code snippet, so you practice scala basics.

Useful tips:

  • In first section majority of the questions based on PairRDD, practice all pairRDD: groupByKey, reduceByKey, combineByKey:
    • GroupByKey vs ReduceByKey vs combineBy which one is faster? ReduceBykey is faster
    • Average for key/value, what value below code will produce with given datasets:
      sum = data.combineByKey(value=> (value, 1),
                               (x, value)=> (x[0] + value, x[1] + 1),
                               (x, y): (x[0] + y[0], x[1] + y[1]))
      averageByKey =, (value_sum, count))=> (label, value_sum / count))
  • Difference between reduce & fold.
  • What’s output of  rdd.collect   method?
  • Practice accumulator and broadcast variables:
val accum = sc.longAccumulator("My Accumulator")
val rdd=sc.parallelize(1 to N,X)
question 1:
rdd.foreach(x => accum.add(1)) 
//result N
question 2:
//result X
  • Difference between supervised & unSupervised learning. Which one is unsupervised learning algorithm with below options? Spark Wiki
    • Supervised
      • regression
        • Linear regression
        • logistic regression
      • classification
        • naive baise
        • SVM
        • readon decision forest.
    • Unsupervised:
      • Dimension reduction.
      • PCA
      • SVD
  • Read MLLib datatypes. What type of value LabeledPoint contains? double.
  • Practice Streaming  sliding window and window operations, find out the correct implication of sliding window.
val lines = ssc.socketTextStream(args(0), args(1).toInt,.........) val words = lines.flatMap(\_.split(" ")) val wordCounts = =\> (x, 1)) 
 val runningCountStream = wordCounts.reduceByKeyAndWindow( (x: Int, y: Int) =\> x + y, (x: Int, y: Int) =\> x - y, windowSize, slidingInterval, 2, (x: (String, Int)) =\> x.\_2 != 0) runningCountStream.print() ssc.start() ssc.awaitTermination()
  • How fault tolerant is achieved is spark streaming.
  • How back pressure is achieved in stremaing.
  • Read how to tune spark job? Read here.
  • All spark UI questions were easy,
    • What is the meaning of stages, job and job progress status(M+N/N) in spark UI? Read here 
    • How many tasks/phases for given code snippet.
      val data=sc.parallelize(....).map(s **=\>** (s, 1))data.cache()data.reduceByKey( **\_** + **\_** )….data.foreach..groupBy
  • What are the different ways to load data and create dataframes? *****.
    • jdbc
    • load hdfs files - hadoop rdd.
    • text file- sc.textfile
    • sequence file- sc.sequenceFile.
    • whole text file- sc.wholeTextFile
    • Object File- sc.objectFile
    • DataFrames:
      • Read-
        • sc.parallelize(..).map(s=> Object).toDS
        • sc.parallelize(..).map(s=> Object).toDF
        • sqlContext.load(“….”)
        •““,true).parquet(“ ……”)
      • Write: Modes- error, append, overwrite, ignore
        •“…. “)
        • df.write.parquet(“….”)
        • df.write.format(“parqueet”).save(“….”)
  • Read all DF operations: select, group, filter …
    • df.filter(“xyz_col>100”)
  • Difference between data frame show, limit & take.
  • Exam doesn’t include graphX.


Should you need any further assistance or have any queries, please comment!
Wish you all the best for your exam.