There are so many great ideas in engineering we can take home and apply to our own lives. Today I will talk about one of them: lazy evaluation.[Read More]
Notes on Spark Streaming app development
This post contains various notes from the second half of this year. It was a lot of learning trying to get a streaming model working and ready in production. We used Spark Structured Streaming, and wrote the code in Scala. Our model was stateful. Our source and sink were both Kafka.[Read More]
The Spark tunable that gave me 8X speedup
There are many configuration tunables in Spark. However, if you have time for only one, set this one. It made a streaming application we run process data 8X faster. That’s 800% improvement, no code change needed![Read More]
Getting top-N elements in Spark
The documentation for pyspark
top() function has this warning:
This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.
This piqued my interest: why would you need to bring all the data to the driver, if all you need is a few top elements?
The answer is: it does not load all the data into the driver’s memory.[Read More]
Livy is out of memory
Spark jobs were failing. All of them. The data pipeline had stopped. This is a tale of high-pressure debugging.[Read More]