"Understanding Civil War Violence through Military Intelligence" (Douglass 2016)

Continue reading…

“High Resolution Population Estimates from Telecommunications Data” (Douglass et al. 2015)

Continue reading…

Five Fast and Replicable Spatial Tasks with PostGIS

thumbnailThere is a crisis of replicability in scientific results that involve spatial data because important calculations are often carried out by hand in proprietary software.  Without code to serve as a paper trail, important steps in the analysis become difficult to check for errors,  alter for testing , or even to really document adequately. Performing spatial tasks by hand is tedious and discourages researchers from performing standard robustness checks, adding additional measures, or trying new things. One solution to both problems is to encourage the use of spatially aware languages and databases solutions.  This brief post outlines how to perform five common spatial tasks using one of the best tools available, the  PostGIS spatial extension to the popular and free PostgresSQL database.

Continue reading…

Spatial Hexagon Binning in POSTGIS

hex4Hexagon binning is an important tool for visualizing spatial relationships but it’s a pain to implement. Current solutions involve partial implementations for PostGIS and work arounds based on generating hexagons in QGIS with the MMQGIS plugin. In this brief post, I provide a SQL function for quickly generating hexagon polygon layers in PostGIS of any size, region of interest, and projection.

 

Continue reading…

Making GIS products for Historical Periods

DevelCombined_Final_cropped_smalloping GIS products for older historical periods can be tricky. The further back in time you go, the less comprehensive the map coverage, the less accurate the surveying, and the more obscure the projections. In a project with Kristen Harkness and others, we are recovering and mapping events during an insurgency that took place 60 years ago in Colonial Kenya called the Mau Mau Rebellion. The episode is surprisingly data rich and relevant for modern  counterinsurgency debates, but making sense of such an old case with modern econometric tools means solving very practical GIS problems like how to develop a period accurate map of roads and infrastructure. This post details a two prong approached we developed that should be widely applicable to other cases.

Continue reading…

Parsing XML to a Data Frame: Recovering the Worldwide Incidents Tracking System (WITS)

MapThe Worldwide Incidents Tracking System (WITS) was a database of global terrorism events compiled by the National Counter-terrorism Center (NCTC) until 2012.  At the end it contained 68,939 records with a short synopsis of each event and is thus still an interest to conflict scholars. Unfortunately, it’s now defunct and getting a copy can be difficult. In this brief post, I show how to download the original XML copy available on the Internet Archive Wayback Machine and to parse it to a flat csv file.

Continue reading…

QUICKLY SAMPLING REALISTIC FUNCTIONS IN R

160_outcomesIn a follow up to my post about generating lots of real world random data in R, in this brief post I show how to generate lots of realistic functions. By sampling from the PDF and CDF of real world data you can quickly generate all manner of continuous and step functions for further experimentation.

Continue reading…

An Unspoken Job Perk of Data Science: Artistic Mistakes

Occasionally I end up making something that is unintentionally beautiful but that will never end up in a paper. I call these artistic mistakes and below are a collection of some of the prettier ones. Most of them come from my experiments with machine vision, particularly transformations of map images that produced wildly unexpected results.

Continue reading…

Quickly Generating Lots of Realistic Random Data in R

In this brief post, I show a trick for quickly assembling arbitrarily large samples of real world data by sampling from all of the data sets included in R packages.
Continue reading…

Comparing Bivariate Plots Under Different Assumptions

uniform1I’m a strong proponent of graphical comparisons before diving into models, but which exploratory plot to use depends heavily on the underlying distributions of the data and which signals you’re looking for. Below I compare 4 kinds of bivariate plots for continuous variables (binning into 10 quantiles with boxplots, scatter-plot with a lowess fit, hexbins with a greyscale color scheme, and 2d density with a rainbow color scheme). I show they each have strengths and weaknesses recovering five functional forms (Piece-wise linear stair-step, piece-wise linear, exponential, quadratic, and sinusoidal) and three distributions of the covariates (uniform, normal, and mixed exponential and normal).

Continue reading…

Quick Land Cover Estimates from Satellite Imagery and OpenStreetMap

OpenStreetmapCombinedLand cover (land use) estimates assign points or regions on the earth’s surface to classes like forested, farmland, urban development, etc. There are hundreds of land cover data sets and methods covering different regions, time periods, and special topics. In a paper under review, my coauthors and I test methods of estimating population density at very high resolutions (235×235 meters) using real-time telecommunications data. For that paper, I developed a custom land cover map of Milan and the surrounding area using the latest available satellite images from Landsat 8, training labels using the community curated OpenStreetMap database, and a random forest classifier. It was a great quick and dirty way to get a very recent land cover map for a specific use, and I outline the details below.

Continue reading…

Training Neural Networks on the GPU: Installation and Configuration

1445Neural networks are fantastic tools for classification and regression, but they are slow to train because they depend on gradient descent across thousands or even millions of parameters. In fact, they are a relatively old idea that has recently come back into vogue in part because speed increases in modern CPUs and particularly the large scale parallelization available in GPUs. With open source software and commodity hardware, the cost of learning and building useful neural networks is now extremely low. This post describes how I built a dedicated rig for testing neural networks for only a few hundred dollars and had it  running in less than a day. It also serves as a how-to guide for avoiding some pitfalls in configuration.

Continue reading…

Top 10 Finalist in the Telecom Italia Big Data Challenge for "(Dis)assembling Milan with Big Data"

GridCensusPopulationCongratulations to my group at UCSD  (David A. Meyer, Megha Ram, David Rideout and Dongjin Song) for being selected as a top 10 finalist out of 652 teams in the Telecom Italia Big Data Challenge 2014. Check out the UCSD press release describing the project. My corner of the project focuses on using cell phone, text, and internet data to create super fine grained estimates of urban population at the 235 meter by 235 meter grid square level. The problem has three challenging components; variable selection in extremely wide data,  correctly estimating scale (in)variance, and interpolating down into smaller geographic scale

Continue reading…

Fastest Random Forest- Sklearn?

1932426_10104214228469240_1604027540_n

I was tipped off by a github thread that the development version of the Random Forest Classifier in  Sklearn (15-dev) had major speed improvements. I built a small benchmark using the MINST handwriting data set and compared the training and prediction speeds of Sklearn (14.1), Sklearn (15-dev), and and WiseRF  (1.1). At least in this small test, the development version of sklearn is the king besting wiseRF in both training and prediction.

Continue reading…

Parsing XML files to a flat dataframe

Markup languages like XML are really handy for structured data that can have multiple values for the same attribute, or attributes which are nested within other attributes in a hierarchical structure. For simple analysis, however, we just want a rectangular data-frame with columns and rows and we need to flatten all that structure. The following code does a very simple job of converting an XML file into a Pandas data-frame. It recursively parses every branch in the file creating new columns and storing their value when information is found. It stores not just raw text as variables in the new dataset, but also all of the attributes stored in tags as well.

Continue reading…

Cool Visualizations

Inner turmoilAuthoritative weekly newspaper focusing on international politics and business news and opinion.

Embedly Powered

Embedly Powered

via Vimeo

Cool Visualizations

Mapped: British, Spanish and Dutch Shipping 1750-1800I recently stumbled upon a fascinating dataset which contains digitised information from the log books of ships (mostly from Britain, France, Spain and The Netherlands) sailing between 1750 and 1850. The creation of this dataset was completed as part of the Climatological Database for the World’s Oceans 1750-1850 (CLIWOC) project.

Embedly Powered

The Racial Dot Map: One Dot Per PersonThis map is an American snapshot; it provides an accessible visualization of geographic distribution, population density, and racial diversity of the American people in every neighborhood in the entire country. The map displays 308,745,538 dots, one for each person residing in the United States at the location they were counted during the 2010 Census.

Embedly Powered

How Al Jazeera English gathered data to map Syria’s rebellionAl Jazeera English yesterday published an interactive that maps the Syria rebellion. The interactive is based on data collected from 900 opposition fighter groups in the country, gathered though good old fashioned journalism.

Embedly Powered

Fast Spatial Joins in Python with a Spatial Index

I often have to execute spatial joins between points and polygons, of say bombing events and the boundaries of the district they took place in. Quantum GIS uses ftools to execute these kinds of spatial joins, but failed on on a relatively modest join of 40k points and 9k boundaries. I could do the join in POSTGIS but I don’t want the overhead of a full spatial database for some quick analysis. The whole operation took only a few minutes to write up and few seconds to run in python, using the package pyshp to load shapefiles, rtree to build the spatial index, and shapely to do the final boundary check.

Continue reading…

Labeling Data for Image Segments Using Dropbox and Google Docs

boxes_colored_1For machine vision projects, I often need a quick and easy way to label images as training data. After a long false start with exporting images as cells in an excel file, I found a rather elegant online solution using dropbox to host image segments and google spreadsheets for labeling by myself or research assistants. Continue reading…

Extracting Data from Printed Tables in Historical Documents

BackImage7420_CRoppedA remarkable amount of data are hiding in historical records in hand written forms, electronic printouts, or typed tables. This post describes methods I use for three types of difficult documents consistently structured forms, inconsistently structured forms, and near machine readable tables.

Continue reading…

Friendly News Coverage: May 2013 UCSD Team wins award for "Analyzing Social Divisions using Cell Phone Data"

prix-bestscientific-800x567My team in the Dpt. of Mathematics at UCSD headed by David Meyer and including Orest Bucicovschi, Megha Ram, David Rideout, and Dongjin Song won the “Best Scientific Paper” in the D4D Challenge for our work on mapping social cleavages in the Côte d’Ivoire using cell phone traffic. Check out Orange Telecommunication’s discussion of our paper and award  and the UCSD newspaper The Gaurdian report.

Continue reading…

Mosaic Plots with Percentage Labels

SOURCEOFINTEL.pngMichael Friendly’s book “Visualizing Categorical Data” has many great examples of visually representing cross tabs. An R package that emerged from that book is the vcd package for making mosaic plots. Something I could not find an example of, however, was how to use the elaborate struct-plot framework to overlay percentages on each tile. What I came up with is hand rolling a structure object for the purpose. You end up with something like in this example,

 

Continue reading…

APSA and Conferences

Henry Farrell on the Crookedtimber blog has an interesting account of how selections are made at conferences like APSA.

“The Political Economy of Academic Conferences”
http://crookedtimber.org/2005/06/07/the-political-economy-of-academic-conferences/

R Speed Gains

For those looking to get more speed out of their R code check out this post on using C++ directly in R through the rcpppackage, and compiling R code through the new compiler package which is coming out in R 2.13.0.

http://www.r-bloggers.com/the-new-r-compiler-package-in-r-2-13-0-some-first-experiments/

Archival Research on an Industrial Scale Part 1

Political scientists and historians face at least four major problems in conducting archival research: time, resources, identifying the key information, and making that information available to others for replication purposes. Together these problems either put serious archival work out of the reach of graduate students/junior faculty or they encourage brief/shallow trips where the exercise becomes can I find a document that supports my claim. Over the next several posts I am going to discuss one of the technological solutions I have developed as well as some online resources which are often overlooked.

Continue reading…

Hand Vectorizing Political Boundaries from Complicated Maps

small

Anyone who has tried to vectorize a paper map has struggled with the fact that maps are not designed to be cleanly read by computers, they are designed to cram as much information as possible in the smallest space that the human eye can interpret. Using a freeware image tool called GIMP, I have a quick and dirty way of removing the clutter and leaving only the political boundaries for vectorization. Take for example this political boundary map from the Vietnam War (click to zoom, warning big download at 12 meg).

Continue reading…

Archival Research: Custom Zotero Translators

With more and more archival material being put up on the web, it is important to have a system for downloading and organizing that material for your research. I use zotero for all of my citation management because 1) it automatically pulls cites and files from the web and 2) it can store them to the cloud so they follow you wherever you go. For specialized electronic archives, however, there may not be a ready made zotero translator available. This was the case for the amazing Vietnam Virtual Archive at Texas Tech, so I rolled my own translator using the directions at the links provided below. I’ve made the full code available for anyone in need of a quick fix now, and I’ll put together something more substantial for the main zotero trunk when I get time.

Continue reading…

Google Books and Microsoft OneNote for coding data

An encouraging norm is emerging where scholars release alongside their data, a large pdf of textual summaries and specific quotes used in the coding decision. I’ve tinkered with different systems for doing this including word, excel, and access, but what I have recently discovered works best is surprisingly Microsoft OneNote. OneNote offers at least four advantages so far. The first, is that it makes it very easy to organize raw information by case and then variable using pages and subpages. Second, it makes it very easy to get information into OneNote from sources like google books. Use zotero to download the book citation automatically, and drag and drop it into OneNote. Then use OneNote’s screen capture option to quickly copy and paste the relevant page(s) out of google books. Third, OneNote will automatically OCR those book images for you allowing you to either search for words later or to copy and paste the text directly into a word doc. Fourth, OneNote will export to a word doc or a pdf, splitting sections based on the case and variable headings you set up in your pages and subpages which allows you reorganize thing easily before putting out a final product.

The Elevator Outline of a Dissertation

Outlining a dissertation or book project is difficult because it’s easy to get lost in the detail. I proposed the following outline to a colleague with the instruction to label each section separately and to adhere to the strict sentence limits.

  • The Puzzle (1 Sentence)
  • Which we should care about because (1 Sentence)
  • What is the closest existing theory we have to account for this behavior (1 Sentence)
  • Why does it get this wrong? (2 Sentences)
  • Which together suggest the following Research Question (1 Sentence)
  • Which we will explain by theorizing about the following outcomes (1 Sentence)
  • Which is a product of a strategic interaction between who and who (1 Sentence)
  • Of which I argue the following main factor explains the puzzle (1 Sentence)
  • A factor which is important to understand, and has applicability to these bigger areas in political science (1 sentence)
  • My theory produces the following observable implications (1 Sentence)
  • My theory stands in contrast to the following explanations (1 Sentence Each)
  • They generate alternative observable implications (1 Sentence Each)
  • They suggest the coding of the following across a large sample of cases (1 Sentence)
  • Of which the appropriate universe is (1 Sentence)
  • Where my identification strategy is (2 Sentences)
  • And this will be novel because the closest work has only done (1 Sentence)
  • Additionally, a case by case comparison is warranted to code in greater detail the outcomes for the following macro level predictions (1 sentence) and for the following expected micro level predictions (1 sentence each)
  • I have selected the following cases as representative from the relevant universe (1 Sentence)
  • And where there is variation on the main outcome of interest (1 Sentence)
  • And on the key explanatory factors (1 Sentence)
  • Of which my theory predicts we should see across these cases (1 Sentence)
  • And at the micro level we would expect to see (1 Sentence per case)
  • The finding will make a major contribution to the following research field [the main one] (1 sentence)
  • It will have the following important scope conditions (1 sentence)
  • Which will suggest the next research questions (1 sentence)

Data from Historical Maps: Extracting Backgrounds

medium (1)As I’ve written on here before, digitizing political maps is no easy task. One tough problem is digitizing background colors which identify things like land cover. Consider this section from a Vietnam War era military map of South Vietnam. There are three background regions, a dark green for forested area, a slightly lighter green for cleared forest, and a white area for completely clear. On top of that are lots of details including brown elevation lines, black grid lines and text, etc.

How do we differentiate the background regions from the foreground regions? Define a background region as a semi-contiguous area with similar but not identical color.

Continue reading…