I work on big, dirty, unstructured, observational data. Examples include, cell phone calls, military intelligence, text from scientific articles, knowledge bases like wikidata, newspaper reports, and raw natural images. Here below are some example projects that have emerged from that work.
How to be Curious Instead of Contrarian About COVID-19: Eight Data Science Lessons From ‘Coronavirus Perspective’ (Epstein 2020), March 30, 2020 [Link]
Crowd-sourced COVID-19 Dataset Tracking Involuntary Government Restrictions (TIGR), March 2020, [Github]
We show which natural language geo-referencing strategy you choose determines what downstream econometric result you’ll find. We develop a dataset of ten thousand events from the Mau Mau rebellion, drawn from twenty thousand pages of historical intelligence documents. We apply over a dozen geo-referencing strategies, and benchmark them against a known ground-truth in the form of exact military grid coordinates which were available for a subset of the reports.
“Understanding Civil War Violence through Military Intelligence: Mining Civilian Targeting Records from the Vietnam War” Chapter in C.A. Anderton and J. Brauer, eds., Economic Aspects of Genocides, Mass Atrocities, and Their Prevention. New York: Oxford University Press, 2016
I investigate a contemporary government database of civilians targeted during the Vietnam War. The data are detailed, with up to 45 attributes recorded for 73,712 individual civilian suspects. I employ an unsupervised machine learning approach of cleaning, variable selection, dimensionality reduction, and clustering. I find support for a simplifying typology of civilian targeting that distinguishes different kinds of suspects and different kinds targeting methods.
“Why Not Divide and Conquer? Targeted Bargaining and Violence in Civil War” Dissertation, 2012, Princeton University, Department of Politics
“MINING THE GAPS: A Text Mining-Based Meta-Analysis of the Current State of Research on Violent Extremism” (with Candace Rondeaux)
We apply topic modeling to a unique corpus of 3,000 expert curated articles on violent extremism
Awarded Best Scientific Paper in the Data for Development (D4D) competition at NetMob 2013, MIT, Cambridge, MA (1-3, May 2013); Conference Preprint
“High resolution population estimates from telecommunications data” with (David A Meyer, Megha Ram, David Rideout, and Dongjin Song) EPJ Data Science 2015, 4:4
Top 10 finalist of 652 projects in Telecom Italia Big Data Challenge 2014.
Population censuses are expensive and infrequent, but cell phone data are plentiful and real-time; how can we use one to estimate the other? We investigate the relationship between calling activity and demography at a very high 235 square meter resolution in Northern Italy.