Data Science

This page showcases a portfolio of Data Science projects I’ve done.


Capstone project at UC Berkeley: Global Irrigation Map

We use satellite imagery, soil and climate data to map worldwide extent of cropland irrigation, and how it has changed from 2001 to 2015.

I worked on this even after graduation, and a revised version is under review for publication.

In collaboration with Dept of Environmental Studies, UC Berkeley.

Version 2 | Version 1 | Code (all public)

Improving diversity in NYC schools

Which schools in New York need the most help, to improve diversity in SHSAT exam?

SHSAT is a competitive exam to enter specialized high schools in NYC. It is currently overwhelmingly White and Asian. We check which schools with Black and Latino populations hold potential but need help.

KNN, Random Forests, Logistic Regression, and Perceptron Neural Network.

Report (public)

Detecting fake news

Can we detect fake news using machine learning?

We try both classical techniques as well as recurrent neural networks. Within the classical realm, we use “AutoML”, which searches for the optimal machine learning algorithm. Our best accuracy was 81%.

Report (public)

Recognizing landmarks in Paris

We use convolutional neural networks to train and recognize landmarks in Paris.

We use an NVIDIA GPU to train the network. We train with transfer learning (bottlenecks) as well as retraining pre-trained models.

MobileNet, Inception v3, AlexNet, VGG16.

The best performing model (MobileNet) had 95% accuracy.

Report Presentation (public)

Can distractions affect your focus?

Statistical field experiment to measure causal inference: we use a difference-in-differences, within-subject design. We recruited about 150 subjects on Amazon MTurk for this experiment. We measured focus via response time and correctness, with both “relaxing” and “busy-work” distractions. We found statistical significance (at 95% confidence level) for one of our claims.

Report (public, dataset available upon request)


The following projects are private in accordance with Berkeley’s academic policy. However, I can make them available to non-students on individual basis upon request.

Predicting success rate for online ads

Will this user click on this ad?

Online advertising is big business. Ad companies make money only when a user clicks on an ad, so they try hard to show ads on which a user is likely to click.

On a large, anonymized dataset, we run logistic regression. We get 76% accuracy, AUC 0.58.

Report (private, available upon request)

Effect of regulation on Internet access

Does government regulation improve Internet connectivity?

We explore whether regulation can lower costs, improve speed, or provide more people with the Internet.

Exploratory data analysis with R statistical language.

Report (private, available upon request)

What is associated with crime?

We analyze whether factors like police density, chances of conviction, proportion of young men or minorities affect crime rates.

Linear regression with R.

Report (private, available upon request)

Classifying text into topics

We use word counts to build a text classifier. We try with word sequences (n-grams) and word frequency scores (TF-IDF), and logistic regression. We could classify text into newsgroups correctly 77% of the time.

Natural language processing: CountVectorizer, TfidfVectorizer. Classifier: logistic regression.

Report (private, available upon request)

Recognizing handwritten digits

How accurately can we recognize digits written by hand? We use classical machine learning algorithms. We also check if blurring the pictures can improve accuracy.

k-nearest neighbor, naive Bayes, and Gaussian NB.

Report (private, available upon request)

Detecting poisonous mushrooms

Based on a variety of properties, we check if a mushroom is poisonous. We reduce the complexity of data and group mushrooms into poisonous and non-poisonous categories.

Principal component analysis (PCA), k-means clustering, Gaussian mixture models

Report (private, available upon request)

Filtering spam at scale

We run naive-Bayes at scale with old-school Hadoop Map-Reduce. Enron dataset.

Notebook (private, available upon request)

Detecting synonyms on large data

We use Spark and document-similarity metrics to detect synonyms on Google’s n-gram corpus.

Notebook (private, available upon request)

What determines wine quality?

Is it acidity? SO2? Density? Alcohol? Something else?

We use OLS, ridge and lasso regressions to determine features that predict wine quality.

Notebook (private, available upon request)

How Google works

PageRank was the original Google search algorithm.

We implement PageRank graph search algorithm on a large Wikipedia dataset, with Spark.

Notebook (private, available upon request)

Analyzing movie-goer sentiment with neural networks

We implement a “bag-of-words” model, running on a neural network, to analyze sentiment on the Stanford Sentiment TreeBank. It cannot beat a simple naive-Bayes model.

Notebook (private, available upon request)

Can a computer learn a language?

We implement n-gram language models. We first do smoothed models on scikit-learn. We then ramp this up with a recurrent neural network on TensorFlow.

Notebooks (private, available upon request)

Tag sentences for grammar

We use the Viterbi algorithm to tag each word in a sentence with the part of speech in English grammar. We use a “hidden Markov model” (HMM).

Notebook (private, available upon request)

Could we have prevented the Challenger space shuttle accident?

Statistical models say yes: given the temperature at the time of launch, it was likely to fail.

Logistic regression, binomial and binary. Likelihood-ratio tests (LRTs). Profile and Wald intervals. Bootstrap.

R Notebook (private, available upon request)

Given how retailers place cereal boxes, can we predict where kids' cereals are?

Sugar and sodium were statistically significant, and sweeter cereal boxes were likely sitting on the bottom shelves.

Multinomial logistic regression. Odds ratios.

R Notebook (private, available upon request)

Forecasting e-commerce sales, given historical data

We develop a statistical time-series model.

Seasonal auto-regressive moving average (SARIMA). Our best model was ARIMA(0,1,0)(1,1,0)4.

R Notebook (private, available upon request)

Do traffic laws really reduce deaths due to accidents?

Speed limits, seat belts, blood alcohol limits: do these reduce fatalities? (Yes.)

Panel data models with pooled data, fixed effects and random effects.

R Notebook (private, available upon request)

Page under construction. More projects on the way!