Notes on Spark Streaming app development

This post contains various notes from the second half of this year. It was a lot of learning trying to get a streaming model working and ready in production. We used Spark Structured Streaming, and wrote the code in Scala. Our model was stateful. Our source and sink were both Kafka.

[Read More]

The Spark tunable that gave me 8X speedup

There are many configuration tunables in Spark. However, if you have time for only one, set this one. It made a streaming application we run process data 8X faster. That’s 800% improvement, no code change needed!

[Read More]

Getting top-N elements in Spark

The documentation for pyspark top() function has this warning:

This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.

This piqued my interest: why would you need to bring all the data to the driver, if all you need is a few top elements?

The answer is: it does not load all the data into the driver’s memory.

[Read More]

Livy is out of memory

Spark jobs were failing. All of them. The data pipeline had stopped. This is a tale of high-pressure debugging.

[Read More]
spark  livy  azure