How to use from_json with Kafka connect 0.10 and Spark Structured Streaming?

I was trying to reproduce the example from [Databricks][1] and apply it to the new connector to Kafka and spark structured streaming however I cannot parse the JSON correctly using the out-of-the-box methods in Spark… note: the topic is written into Kafka in JSON format. val ds1 = spark .readStream .format(“kafka”) .option(“kafka.bootstrap.servers”, IP + “:9092”) … Read more

PySpark: how to resample frequencies

Imagine a Spark Dataframe consisting of value observations from variables. Each observation has a specific timestamp and those timestamps are not the same between different variables. This is because the timestamp is generated when the value of a variable changed and is recorded. #Variable Time Value #852-YF-007 2016-05-10 00:00:00 0 #852-YF-007 2016-05-09 23:59:00 0 #852-YF-007 … Read more

Enable case sensitivity for spark.sql globally

The option spark.sql.caseSensitive controls whether column names etc should be case sensitive or not. It can be set e.g. by spark_session.sql(‘set spark.sql.caseSensitive=true’) and is false per default. It does not seem to be possible to enable it globally in $SPARK_HOME/conf/spark-defaults.conf with spark.sql.caseSensitive: True though. Is that intended or is there some other file to set … Read more

PySpark / Spark Window Function First/ Last Issue

From my understanding first/ last function in Spark will retrieve first / last row of each partition/ I am not able to understand why LAST function is giving incorrect results. This is my code. AgeWindow = Window.partitionBy(‘Dept’).orderBy(‘Age’) df1 = df1.withColumn(‘first(ID)’, first(‘ID’).over(AgeWindow))\ .withColumn(‘last(ID)’, last(‘ID’).over(AgeWindow)) df1.show() +—+———-+—+——–+————————–+————————-+ |Age| Dept| ID| Name|first(ID) |last(ID) | +—+———-+—+——–+————————–+————————-+ | 38| medicine| … Read more

How to convert a case-class-based RDD into a DataFrame?

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass), but my DataFrame ends up empty. Here’s my Scala code: // sc is the SparkContext, while sqlContext is the SQLContext. // Define the case class … Read more

Spark structured streaming kafka convert JSON without schema (infer schema)

I read Spark Structured Streaming doesn’t support schema inference for reading Kafka messages as JSON. Is there a way to retrieve schema the same as Spark Streaming does: val dataFrame = spark.read.json(rdd.map(_.value())) dataFrame.printschema Answer Here is one possible way to do this: Before you start streaming, get a small batch of the data from Kafka … Read more

Spark MLlib – trainImplicit warning

I keep seeing these warnings when using trainImplicit: WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB). The maximum recommended task size is 100 KB. And then the task size starts to increase. I tried to call repartition on the input RDD but the warnings are the same. All these warnings … Read more

BigQuery replaced most of my Spark jobs, am I missing something?

I’ve been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such. The thing is, I now often find myself writing processing steps in SQL more than in PySpark since it is : easier to reason … Read more