Notes on Spark Streaming app development

This post contains various notes from the second half of this year. It was a lot of learning trying to get a streaming model working and ready in production. We used Spark Structured Streaming, and wrote the code in Scala. Our model was stateful. Our source and sink were both Kafka.

Privacy in today's age with a SOCKS proxy

Say you are at a cafe, and you want to surf the Web. But the WiFi is not secure. Or say your company lets you bring your laptop, but what if its firewall has blocked your favorite website? Is there no hope, besides paying $15 to a VPN provider? There is, and it costs about$3.50 per month as of this writing.

The Spark tunable that gave me 8X speedup

There are many configuration tunables in Spark. However, if you have time for only one, set this one. It made a streaming application we run process data 8X faster. That’s 800% improvement, no code change needed!

Getting top-N elements in Spark

The documentation for pyspark top() function has this warning:

This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.

This piqued my interest: why would you need to bring all the data to the driver, if all you need is a few top elements?

The answer is: it does not load all the data into the driver’s memory.

Livy is out of memory

Spark jobs were failing. All of them. The data pipeline had stopped. This is a tale of high-pressure debugging.

Accessing home computer from anywhere

Do you sometimes want to access your home computer from an outside network? Maybe you use another system, but you do not trust it and would prefer your home computer for some workflows?

This post outlines the steps to make such access possible.

The program that would not go away

This post is about a program hang. The hang was in the Python process that was running Ansible scripts. The problem was hard to debug and had me go back to Unix textbook.

Correct way to create a directory in Python

Can you see the problem with this code? It comes from Ansible, v2.1.1.0.

if not os.path.exists(value):
os.makedirs(value, 0o700)


It’s quite straightforward. It checks if a directory path exists. If it does not, then it creates the directory path, similar to mkdir -p. What could be wrong?