There is a crisis of replicability in scientific results that involve spatial data because important calculations are often carried out by hand in proprietary software. Without code to serve as a paper trail, important steps in the analysis become difficult to check for errors, alter for testing , or even to really document adequately. Performing spatial tasks by hand is tedious and discourages researchers from performing standard robustness checks, adding additional measures, or trying new things. One solution to both problems is to encourage the use of spatially aware languages and databases solutions. This brief post outlines how to perform five common spatial tasks using one of the best tools available, the PostGIS spatial extension to the popular and free PostgresSQL database.
Hexagon binning is an important tool for visualizing spatial relationships but it’s a pain to implement. Current solutions involve partial implementations for PostGIS and work arounds based on generating hexagons in QGIS with the MMQGIS plugin. In this brief post, I provide a SQL function for quickly generating hexagon polygon layers in PostGIS of any size, region of interest, and projection.
Developing GIS products for older historical periods can be tricky. The further back in time you go, the less comprehensive the map coverage, the less accurate the surveying, and the more obscure the projections. In a project with Kristen Harkness and others, we are recovering and mapping events during an insurgency that took place 60 years ago in Colonial Kenya called the Mau Mau Rebellion. The episode is surprisingly data rich and relevant for modern counterinsurgency debates, but making sense of such an old case with modern econometric tools means solving very practical GIS problems like how to develop a period accurate map of roads and infrastructure. This post details a two prong approached we developed that should be widely applicable to other cases.
The Worldwide Incidents Tracking System (WITS) was a database of global terrorism events compiled by the National Counter-terrorism Center (NCTC) until 2012. At the end it contained 68,939 records with a short synopsis of each event and is thus still an interest to conflict scholars. Unfortunately, it’s now defunct and getting a copy can be difficult. In this brief post, I show how to download the original XML copy available on the Internet Archive Wayback Machine and to parse it to a flat csv file.
In a follow up to my post about generating lots of real world random data in R, in this brief post I show how to generate lots of realistic functions. By sampling from the PDF and CDF of real world data you can quickly generate all manner of continuous and step functions for further experimentation.
Occasionally I end up making something that is unintentionally beautiful but that will never end up in a paper. I call these artistic mistakes and below are a collection of some of the prettier ones. Most of them come from my experiments with machine vision, particularly transformations of map images that produced wildly unexpected results.
In this brief post, I show a trick for quickly assembling arbitrarily large samples of real world data by sampling from all of the data sets included in R packages.
I’m a strong proponent of graphical comparisons before diving into models, but which exploratory plot to use depends heavily on the underlying distributions of the data and which signals you’re looking for. Below I compare 4 kinds of bivariate plots for continuous variables (binning into 10 quantiles with boxplots, scatter-plot with a lowess fit, hexbins with a greyscale color scheme, and 2d density with a rainbow color scheme). I show they each have strengths and weaknesses recovering five functional forms (Piece-wise linear stair-step, piece-wise linear, exponential, quadratic, and sinusoidal) and three distributions of the covariates (uniform, normal, and mixed exponential and normal).
Land cover (land use) estimates assign points or regions on the earth’s surface to classes like forested, farmland, urban development, etc. There are hundreds of land cover data sets and methods covering different regions, time periods, and special topics. In a paper under review, my coauthors and I test methods of estimating population density at very high resolutions (235×235 meters) using real-time telecommunications data. For that paper, I developed a custom land cover map of Milan and the surrounding area using the latest available satellite images from Landsat 8, training labels using the community curated OpenStreetMap database, and a random forest classifier. It was a great quick and dirty way to get a very recent land cover map for a specific use, and I outline the details below.
Neural networks are fantastic tools for classification and regression, but they are slow to train because they depend on gradient descent across thousands or even millions of parameters. In fact, they are a relatively old idea that has recently come back into vogue in part because speed increases in modern CPUs and particularly the large scale parallelization available in GPUs. With open source software and commodity hardware, the cost of learning and building useful neural networks is now extremely low. This post describes how I built a dedicated rig for testing neural networks for only a few hundred dollars and had it running in less than a day. It also serves as a how-to guide for avoiding some pitfalls in configuration.
Congratulations to my group at UCSD (David A. Meyer, Megha Ram, David Rideout and Dongjin Song) for being selected as a top 10 finalist out of 652 teams in the Telecom Italia Big Data Challenge 2014. Check out the UCSD press release describing the project. My corner of the project focuses on using cell phone, text, and internet data to create super fine grained estimates of urban population at the 235 meter by 235 meter grid square level. The problem has three challenging components; variable selection in extremely wide data, correctly estimating scale (in)variance, and interpolating down into smaller geographic scale
I was tipped off by a github thread that the development version of the Random Forest Classifier in Sklearn (15-dev) had major speed improvements. I built a small benchmark using the MINST handwriting data set and compared the training and prediction speeds of Sklearn (14.1), Sklearn (15-dev), and and WiseRF (1.1). At least in this small test, the development version of sklearn is the king besting wiseRF in both training and prediction.
Markup languages like XML are really handy for structured data that can have multiple values for the same attribute, or attributes which are nested within other attributes in a hierarchical structure. For simple analysis, however, we just want a rectangular data-frame with columns and rows and we need to flatten all that structure. The following code does a very simple job of converting an XML file into a Pandas data-frame. It recursively parses every branch in the file creating new columns and storing their value when information is found. It stores not just raw text as variables in the new dataset, but also all of the attributes stored in tags as well.
I often have to execute spatial joins between points and polygons, of say bombing events and the boundaries of the district they took place in. Quantum GIS uses ftools to execute these kinds of spatial joins, but failed on on a relatively modest join of 40k points and 9k boundaries. I could do the join in POSTGIS but I don’t want the overhead of a full spatial database for some quick analysis. The whole operation took only a few minutes to write up and few seconds to run in python, using the package pyshp to load shapefiles, rtree to build the spatial index, and shapely to do the final boundary check.
For machine vision projects, I often need a quick and easy way to label images as training data. After a long false start with exporting images as cells in an excel file, I found a rather elegant online solution using dropbox to host image segments and google spreadsheets for labeling by myself or research assistants. Continue reading…
A remarkable amount of data are hiding in historical records in hand written forms, electronic printouts, or typed tables. This post describes methods I use for three types of difficult documents consistently structured forms, inconsistently structured forms, and near machine readable tables.
My team in the Dpt. of Mathematics at UCSD headed by David Meyer and including Orest Bucicovschi, Megha Ram, David Rideout, and Dongjin Song won the “Best Scientific Paper” in the D4D Challenge for our work on mapping social cleavages in the Côte d’Ivoire using cell phone traffic. Check out Orange Telecommunication’s discussion of our paper and award and the UCSD newspaper The Gaurdian report.
Michael Friendly’s book “Visualizing Categorical Data” has many great examples of visually representing cross tabs. An R package that emerged from that book is the vcd package for making mosaic plots. Something I could not find an example of, however, was how to use the elaborate struct-plot framework to overlay percentages on each tile. What I came up with is hand rolling a structure object for the purpose. You end up with something like in this example,
Henry Farrell on the Crookedtimber blog has an interesting account of how selections are made at conferences like APSA.
“The Political Economy of Academic Conferences”
For those looking to get more speed out of their R code check out this post on using C++ directly in R through the rcpppackage, and compiling R code through the new compiler package which is coming out in R 2.13.0.
Political scientists and historians face at least four major problems in conducting archival research: time, resources, identifying the key information, and making that information available to others for replication purposes. Together these problems either put serious archival work out of the reach of graduate students/junior faculty or they encourage brief/shallow trips where the exercise becomes can I find a document that supports my claim. Over the next several posts I am going to discuss one of the technological solutions I have developed as well as some online resources which are often overlooked.
Anyone who has tried to vectorize a paper map has struggled with the fact that maps are not designed to be cleanly read by computers, they are designed to cram as much information as possible in the smallest space that the human eye can interpret. Using a freeware image tool called GIMP, I have a quick and dirty way of removing the clutter and leaving only the political boundaries for vectorization. Take for example this political boundary map from the Vietnam War (click to zoom, warning big download at 12 meg).
With more and more archival material being put up on the web, it is important to have a system for downloading and organizing that material for your research. I use zotero for all of my citation management because 1) it automatically pulls cites and files from the web and 2) it can store them to the cloud so they follow you wherever you go. For specialized electronic archives, however, there may not be a ready made zotero translator available. This was the case for the amazing Vietnam Virtual Archive at Texas Tech, so I rolled my own translator using the directions at the links provided below. I’ve made the full code available for anyone in need of a quick fix now, and I’ll put together something more substantial for the main zotero trunk when I get time.
An encouraging norm is emerging where scholars release alongside their data, a large pdf of textual summaries and specific quotes used in the coding decision. I’ve tinkered with different systems for doing this including word, excel, and access, but what I have recently discovered works best is surprisingly Microsoft OneNote. OneNote offers at least four advantages so far. The first, is that it makes it very easy to organize raw information by case and then variable using pages and subpages. Second, it makes it very easy to get information into OneNote from sources like google books. Use zotero to download the book citation automatically, and drag and drop it into OneNote. Then use OneNote’s screen capture option to quickly copy and paste the relevant page(s) out of google books. Third, OneNote will automatically OCR those book images for you allowing you to either search for words later or to copy and paste the text directly into a word doc. Fourth, OneNote will export to a word doc or a pdf, splitting sections based on the case and variable headings you set up in your pages and subpages which allows you reorganize thing easily before putting out a final product.
Outlining a dissertation or book project is difficult because it’s easy to get lost in the detail. I proposed the following outline to a colleague with the instruction to label each section separately and to adhere to the strict sentence limits.
As I’ve written on here before, digitizing political maps is no easy task. One tough problem is digitizing background colors which identify things like land cover. Consider this section from a Vietnam War era military map of South Vietnam. There are three background regions, a dark green for forested area, a slightly lighter green for cleared forest, and a white area for completely clear. On top of that are lots of details including brown elevation lines, black grid lines and text, etc.
How do we differentiate the background regions from the foreground regions? Define a background region as a semi-contiguous area with similar but not identical color.
Copyright © 2016 Rex Douglass - All Rights Reserved
Powered by WordPress & Atahualpa