This post contains various notes from the second half of this year. It was a lot of learning trying to get a streaming model working and ready in production. We used Spark Structured Streaming, and wrote the code in Scala. Our model was stateful. Our source and sink were both Kafka.
[Read More]Privacy in today's age with a SOCKS proxy
Say you are at a cafe, and you want to surf the Web. But the WiFi is not secure. Or say your company lets you bring your laptop, but what if its firewall has blocked your favorite website? Is there no hope, besides paying $15 to a VPN provider?
There is, and it costs about $3.50 per month as of this writing.
[Read More]The Spark tunable that gave me 8X speedup
There are many configuration tunables in Spark. However, if you have time for only one, set this one. It made a streaming application we run process data 8X faster. That’s 800% improvement, no code change needed!
[Read More]Getting top-N elements in Spark
The documentation for pyspark top()
function has this warning:
This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.
This piqued my interest: why would you need to bring all the data to the driver, if all you need is a few top elements?
The answer is: it does not load all the data into the driver’s memory.
[Read More]Livy is out of memory
Spark jobs were failing. All of them. The data pipeline had stopped. This is a tale of high-pressure debugging.
[Read More]Accessing home computer from anywhere
Do you sometimes want to access your home computer from an outside network? Maybe you use another system, but you do not trust it and would prefer your home computer for some workflows?
This post outlines the steps to make such access possible.
[Read More]The program that would not go away
This post is about a program hang. The hang was in the Python process that was running Ansible scripts. The problem was hard to debug and had me go back to Unix textbook.
[Read More]Correct way to create a directory in Python
Can you see the problem with this code? It comes from Ansible, v2.1.1.0.
if not os.path.exists(value):
os.makedirs(value, 0o700)
It’s quite straightforward. It checks if a directory path exists. If
it does not, then it creates the directory path, similar to mkdir -p
. What could be wrong?
Getting rid of unused virtual disks on XenServer
A continuous test server I’d set up had stopped working. The XenServer on which it was running had a 1TB disk: and it was full. What’s going on?
[Read More]Log rotation, no code change needed
This post shows you how to rotate old logs from your application. There is no change to application code. There is no specialized logging library or framework needed. It works for any language, on standard Unix platform.
[Read More]