There are so many great ideas in engineering we can take home and apply to our own lives. Today I will talk about one of them: lazy evaluation.
[Read More]Notes on Spark Streaming app development
This post contains various notes from the second half of this year. It was a lot of learning trying to get a streaming model working and ready in production. We used Spark Structured Streaming, and wrote the code in Scala. Our model was stateful. Our source and sink were both Kafka.
[Read More]The Spark tunable that gave me 8X speedup
There are many configuration tunables in Spark. However, if you have time for only one, set this one. It made a streaming application we run process data 8X faster. That’s 800% improvement, no code change needed!
[Read More]Getting top-N elements in Spark
The documentation for pyspark top()
function has this warning:
This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.
This piqued my interest: why would you need to bring all the data to the driver, if all you need is a few top elements?
The answer is: it does not load all the data into the driver’s memory.
[Read More]Livy is out of memory
Spark jobs were failing. All of them. The data pipeline had stopped. This is a tale of high-pressure debugging.
[Read More]