Machine Learning

This page showcases my work in data science and machine learning, both during my graduate studies at UC Berkeley and after.


I worked with Department of Hydrology to develop machine-learning models for sustainable use of water.

Global Water Use for Irrigation

Globally, agriculture consumes 65-70% of freshwater. In this work, I predict irrigation globally for the years 2001-2015 and show how it has changed in this timeframe.

Our project uses satellite and climate data to fit a random forest model and classify the world into “high”, “low-to-medium” and “no” irrigation areas. Published in Advances in Water Resources, 2021.

Global Irrigation Extent, as per our ML Model

Choose your destination:

“I want to look at your results”:

We have the following interactive maps:

“I want to use your results”:

Maps are available:

“I want a quick summary of what you did”:

Watch our lightning talk (5 min) at Google Geo for Good summit. Alternatively, you can read this blog post.

“I have the time, tell me more”:

You can read our published journal article (alternative link). You can also look at our model assessment map.

“I have a lot of time, I want to do this myself”:

Source code and system design is on my Github. If you like to build something similar, you can read about how we built our model.

Detecting Irrigation with Radar Satellites

With a little bit of preprocessing, SAR data and image-processing neural networks can detect irrigation in California with an accuracy of 95% (on a balanced dataset).

SAR picture for a part of Central Valley, California. SAR values standardized on non-irrigated class distribution.

See this post for more detail. Source notebook is available as well.

Quantifying Large Hailstorms Across the US

Large hailstorms in the US: likelihood based on historical occurrence (1995-2020)

Large hailstorms can damage solar panels, so it’s useful to quantify historical large hailstorm occurrences across the US to plan new solar panel installations.

Blog Posts

Selection effect is all around us, if we care to consider it deeply enough.

Notes: Career in Data Science

I liked and made notes from Build a Career in Data Science as a set of mind maps. Click on the arrows in the nodes to go to linked maps.

Data Science Portfolio

Below are some projects I did as part of MIDS coursework at UC Berkeley.

Improving diversity in NYC schools

Which schools in New York need the most help, to improve diversity in SHSAT exam?

SHSAT is a competitive exam to enter specialized high schools in NYC. It is currently overwhelmingly White and Asian. We check which schools with Black and Latino populations hold potential but need help.

KNN, Random Forests, Logistic Regression, and Perceptron Neural Network.

Report (public)

Detecting fake news

Can we detect fake news using machine learning?

We try both classical techniques as well as recurrent neural networks. Within the classical realm, we use “AutoML”, which searches for the optimal machine learning algorithm. Our best accuracy was 81%.

Report (public)

Recognizing landmarks in Paris

We use convolutional neural networks to train and recognize landmarks in Paris.

We use an NVIDIA GPU to train the network. We train with transfer learning (bottlenecks) as well as retraining pre-trained models.

MobileNet, Inception v3, AlexNet, VGG16.

The best performing model (MobileNet) had 95% accuracy.

Report Presentation (public)

Can distractions affect your focus?

Statistical field experiment to measure causal inference: we use a difference-in-differences, within-subject design. We recruited about 150 subjects on Amazon MTurk for this experiment. We measured focus via response time and correctness, with both “relaxing” and “busy-work” distractions. We found statistical significance (at 95% confidence level) for one of our claims.

Report (public, dataset available upon request)


The following projects are private in accordance with Berkeley’s academic policy. However, I can make them available to non-students on individual basis upon request.

Predicting success rate for online ads

Will this user click on this ad?

Online advertising is big business. Ad companies make money only when a user clicks on an ad, so they try hard to show ads on which a user is likely to click.

On a large, anonymized dataset, we run logistic regression. We get 76% accuracy, AUC 0.58.

Report (private, available upon request)

Effect of regulation on Internet access

Does government regulation improve Internet connectivity?

We explore whether regulation can lower costs, improve speed, or provide more people with the Internet.

Exploratory data analysis with R statistical language.

Report (private, available upon request)

What is associated with crime?

We analyze whether factors like police density, chances of conviction, proportion of young men or minorities affect crime rates.

Linear regression with R.

Report (private, available upon request)

Classifying text into topics

We use word counts to build a text classifier. We try with word sequences (n-grams) and word frequency scores (TF-IDF), and logistic regression. We could classify text into newsgroups correctly 77% of the time.

Natural language processing: CountVectorizer, TfidfVectorizer. Classifier: logistic regression.

Report (private, available upon request)

Recognizing handwritten digits

How accurately can we recognize digits written by hand? We use classical machine learning algorithms. We also check if blurring the pictures can improve accuracy.

k-nearest neighbor, naive Bayes, and Gaussian NB.

Report (private, available upon request)

Detecting poisonous mushrooms

Based on a variety of properties, we check if a mushroom is poisonous. We reduce the complexity of data and group mushrooms into poisonous and non-poisonous categories.

Principal component analysis (PCA), k-means clustering, Gaussian mixture models

Report (private, available upon request)

Filtering spam at scale

We run naive-Bayes at scale with old-school Hadoop Map-Reduce. Enron dataset.

Notebook (private, available upon request)

Detecting synonyms on large data

We use Spark and document-similarity metrics to detect synonyms on Google’s n-gram corpus.

Notebook (private, available upon request)

What determines wine quality?

Is it acidity? SO2? Density? Alcohol? Something else?

We use OLS, ridge and lasso regressions to determine features that predict wine quality.

Notebook (private, available upon request)

How Google works

PageRank was the original Google search algorithm.

We implement PageRank graph search algorithm on a large Wikipedia dataset, with Spark.

Notebook (private, available upon request)

Analyzing movie-goer sentiment with neural networks

We implement a “bag-of-words” model, running on a neural network, to analyze sentiment on the Stanford Sentiment TreeBank. It cannot beat a simple naive-Bayes model.

Notebook (private, available upon request)

Can a computer learn a language?

We implement n-gram language models. We first do smoothed models on scikit-learn. We then ramp this up with a recurrent neural network on TensorFlow.

Notebooks (private, available upon request)

Tag sentences for grammar

We use the Viterbi algorithm to tag each word in a sentence with the part of speech in English grammar. We use a “hidden Markov model” (HMM).

Notebook (private, available upon request)

Could we have prevented the Challenger space shuttle accident?

Statistical models say yes: given the temperature at the time of launch, it was likely to fail.

Logistic regression, binomial and binary. Likelihood-ratio tests (LRTs). Profile and Wald intervals. Bootstrap.

R Notebook (private, available upon request)

Given how retailers place cereal boxes, can we predict where kids' cereals are?

Sugar and sodium were statistically significant, and sweeter cereal boxes were likely sitting on the bottom shelves.

Multinomial logistic regression. Odds ratios.

R Notebook (private, available upon request)

Forecasting e-commerce sales, given historical data

We develop a statistical time-series model.

Seasonal auto-regressive moving average (SARIMA). Our best model was ARIMA(0,1,0)(1,1,0)4.

R Notebook (private, available upon request)

Do traffic laws really reduce deaths due to accidents?

Speed limits, seat belts, blood alcohol limits: do these reduce fatalities? (Yes.)

Panel data models with pooled data, fixed effects and random effects.

R Notebook (private, available upon request)