About Me

I’m a computational social scientist, and director of the Machine Learning for Social Science Lab (MSSL). I’m based at the Center for Peace and Security Studies (cPASS), Department of Political Science, University of California San Diego (UCSD). Previously I’ve held post-doc positions in Political Science (UCSD), Mathematics (UCSD). and History (Columbia). Before that I studied at Princeton (MA/PHD), and the University of Texas (BA).

[Curriculum vitae] , [Github] , [Twitter Feed]


I work on big, dirty, unstructured, observational data. Examples include, cell phone calls, military intelligence, text from scientific articles, knowledge bases like wikidata, newspaper reports, and raw natural images. Here below are some example projects that have emerged from that work.


The Data Science of COVID-19 Spread: Some Troubling Current and Future Trends (with Thomas Scherer, Erik Gartzke), Peace Economics, Peace Science and Public Policy, August 17, 2020

[Paper] [Paper Open Access] [Media: Wired] [Media: Slate] [Media: king5]

How to be Curious Instead of Contrarian About COVID-19: Eight Data Science Lessons From ‘Coronavirus Perspective’ (Epstein 2020), March 30, 2020

[Paper] [Media: LA Times]

Crowd-sourced COVID-19 Dataset Tracking Involuntary Government Restrictions (TIGR), March 2020,


I provide semi-regular review of COVID-19 papers and literature reviews on Twitter (30 and counting) in this [thread].

Computational Replications

Substantial underestimation of SARS-CoV-2 infection in the United States’ (We et al. 2020)]


“Measuring the Landscape of Civil War” (with Kristen Harkness) Journal of Peace Research, February 15, 2018

[Github], [Ungated Paper], [Ungated Appendix], [Gated Paper]

We show which natural language geo-referencing strategy you choose determines what downstream econometric result you’ll find. We develop a dataset of ten thousand events from the Mau Mau rebellion, drawn from twenty thousand pages of historical intelligence documents. We apply over a dozen geo-referencing strategies, and benchmark them against a known ground-truth in the form of exact military grid coordinates which were available for a subset of the reports.

“Understanding Civil War Violence through Military Intelligence: Mining Civilian Targeting Records from the Vietnam War” Chapter in C.A. Anderton and J. Brauer, eds., Economic Aspects of Genocides, Mass Atrocities, and Their Prevention. New York: Oxford University Press, 2016

[Ungated arXiv preprint]

I investigate a contemporary government database of civilians targeted during the Vietnam War. The data are detailed, with up to 45 attributes recorded for 73,712 individual civilian suspects. I employ an unsupervised machine learning approach of cleaning, variable selection, dimensionality reduction, and clustering. I find support for a simplifying typology of civilian targeting that distinguishes different kinds of suspects and different kinds targeting methods.

“Why Not Divide and Conquer? Targeted Bargaining and Violence in Civil War” Dissertation, 2012, Princeton University, Department of Politics

[Ungated Dissertation Print]

“MINING THE GAPS: A Text Mining-Based Meta-Analysis of the Current State of Research on Violent Extremism” (with Candace Rondeaux)

[Ungated PDF]

We apply topic modeling to a unique corpus of 3,000 expert curated articles on violent extremism


“Analyzing Social Divisions using Cell Phone Data” (with Orest Bucicovschi, Rex W. Douglass, David A. Meyer, Ram Rideout, Dongjin Song)

[Ungated Conference Preprint]

Awarded Best Scientific Paper in the Data for Development (D4D) competition at NetMob 2013, MIT, Cambridge, MA (1-3, May 2013); Conference Preprint

“High resolution population estimates from telecommunications data” with (David A Meyer, Megha Ram, David Rideout, and Dongjin Song) EPJ Data Science 2015, 4:4

[Ungated Paper at EPJ]

Top 10 finalist of 652 projects in Telecom Italia Big Data Challenge 2014.

Population censuses are expensive and infrequent, but cell phone data are plentiful and real-time; how can we use one to estimate the other? We investigate the relationship between calling activity and demography at a very high 235 square meter resolution in Northern Italy.


I teach a brief course on Machine Learning for new members of my lab and elsewhere by invitation.