The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using
sqlContext.createDataFrame(RDD, CaseClass)
, but my DataFrame ends up empty. Here’s my Scala code:// sc is the SparkContext, while sqlContext is the SQLContext. // Define the case class and raw data case class Dog(name: String) val data = Array( Dog("Rex"), Dog("Fido") ) // Create an RDD from the raw data val dogRDD = sc.parallelize(data) // Print the RDD for debugging (this works, shows 2 dogs) dogRDD.collect().foreach(println) // Create a DataFrame from the RDD val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog]) // Print the DataFrame for debugging (this fails, shows 0 dogs) dogDF.show()
The output I’m seeing is:
Dog(Rex) Dog(Fido) ++ || ++ || || ++
What am I missing?
Thanks!
Answer
All you need is just
val dogDF = sqlContext.createDataFrame(dogRDD)
Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). Your case class doesn’t follow this convention, so no property is detected, that leads to empty DataFrame with no columns.
Attribution
Source : Link , Question Author : sparkour , Answer Author : Vitalii Kotliarenko