Spark

Lazy evaluation in real life

Posted on June 28, 2020 | 4 minutes

There are so many great ideas in engineering we can take home and apply to our own lives. Today I will talk about one of them: lazy evaluation.

[Read More]

spark functional-programming productivity

Notes on Spark Streaming app development

Posted on December 18, 2019 | 5 minutes

This post contains various notes from the second half of this year. It was a lot of learning trying to get a streaming model working and ready in production. We used Spark Structured Streaming, and wrote the code in Scala. Our model was stateful. Our source and sink were both Kafka.

[Read More]

spark streaming kafka

The Spark tunable that gave me 8X speedup

Posted on August 7, 2019 | 2 minutes

There are many configuration tunables in Spark. However, if you have time for only one, set this one. It made a streaming application we run process data 8X faster. That’s 800% improvement, no code change needed!

[Read More]

spark performance

Getting top-N elements in Spark

Posted on May 11, 2019 | 2 minutes

The documentation for pyspark top() function has this warning:

This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.

This piqued my interest: why would you need to bring all the data to the driver, if all you need is a few top elements?

The answer is: it does not load all the data into the driver’s memory.

[Read More]

spark

Livy is out of memory

Posted on March 11, 2018 | 3 minutes

Spark jobs were failing. All of them. The data pipeline had stopped. This is a tale of high-pressure debugging.

[Read More]

spark livy azure