Technology


This section of the site showcases some blog posts and artifacts from software engineering and machine learning. I also throw in some advice for free where it makes sense.

I started as a software engineer in 2003 and moved to engineering management in 2023, so these articles are now in cold storage.

Machine Learning

I have some implementation notes if you’d like to use large language models (LLMs) at your company.

A behind-the-scenes look at an example machine-learning model that we could run on a global scale on Google Earth Engine. It worked because:

  • Google provided programming primitives for parallelism,
  • had the infrastructure to run our code in parallel, and
  • random-forest as an algorithm is well-suited for scale because you can grow the trees in parallel. (It doesn’t work with gradient-boosted trees, for example.)

I worked with Department of Hydrology, UC Berkeley, to develop global-scale machine-learning models for sustainable use of water. You can read our published journal article (alternative link). Source code and system design is on my Github. If you want to build something similar, you can read about how we built our model.

Hobby ML Projects

With a little bit of preprocessing, SAR data and image-processing neural networks can detect irrigation in California with an accuracy of 95% (on a balanced dataset). See this post for more detail. Source notebook is available as well.

Large hailstorms can damage solar panels, so it’s useful to quantify historical large hailstorm occurrences across the US to plan new solar panel installations.

Data Processing Systems

A tutorial on running Apache Beam and Flink locally. I personally think the Beam project is not set up for success.

Processing data quickly and at scale is hard, and this falls under the realm of “streaming” data processing. I have captured some notes from developing streaming apps with Spark Structured Streaming.

Advice: avoid Spark Streaming, Structured or otherwise. Use simple apps that subscribe to the relevant event streams directly and scale with something like Kubernetes. In other words, move the complexity away from streaming, perhaps into a feature store that can be easily looked up with a regular app.

It is important to tune your Spark app, or at least have a look at the UI and run-time profile to see what the bottlenecks are. Sometimes it’s fun to look inside Spark and understand what makes it tick.

Advice: go at least 1 or 2 levels deep from whatever abstraction you work at.

Debugging

Debugging is the bane of the programmer. It is that hard place where reality meets expectation. As you get older, you’ll learn enough art to minimize the time you spend debugging, but it never really goes away.

Often, I have had to debug because my mental model of the system is different from how it actually is. I’ve seen disks running full due to ghostly files, programs that wouldn’t die. Apparently this happened to a library I was using as well.

Sometimes the problem is due to a low-level, unhelpful error message. At other times, there is an unusual situation or an interaction of bad assumptions.

Some problems are particularly hard, such as if it occurs once in a blue-moon, or if it happens only on your computer and nowhere else. A simple program that exercises the interesting bits can help a lot.

I’ve now worked in the industry long enough to encounter issues due to processor architecture or bugs in operating system code. Sometimes there’s no alternative to systematic experiments and reading the kernel sources.

Yet, my advice is that if you see a problem, always start by assuming the problem is in your code or understanding of the system. You should translate the current debugging problem into one of information: what information is currently missing that you should add or obtain in order to narrow down and make progress?

Missing Manuals

A few articles on accessing your home computer from the Internet, browsing securely when on an open Wi-Fi network, and rotating logs the easy way.

Meta

Selection effect is all around us, if we care to consider it deeply enough.

I think graph data structures are the scalable way to absorb knowledge.

If lazy evaluation works for Spark, it can work for you too.

Digital Attic

Linux and Unix in the early 2000s as I knew it.

Following are some projects I did (2018-20) as part of MIDS degree program at UC Berkeley that I’m allowed to make public: Improving diversity in NYC schools, detecting fake news with ML, recognizing landmarks in Paris, quantifying effect of distraction.

(Last modified on February 19, 2024)