Congratulations to my group at UCSD (David A. Meyer, Megha Ram, David Rideout and Dongjin Song) for being selected as a top 10 finalist out of 652 teams in the Telecom Italia Big Data Challenge 2014. Check out the UCSD press release describing the project. My corner of the project focuses on using cell phone, text, and internet data to create super fine grained estimates of urban population at the 235 meter by 235 meter grid square level. The problem has three challenging components; variable selection in extremely wide data, correctly estimating scale (in)variance, and interpolating down into smaller geographic scale
Data on population were available at the district level and only for some districts leaving 69 observations. Meanwhile, calls,text, and internet data were available by grid square, country of origin, and call time which we aggregated up into tens of thousands of potential explanatory variables. Variable selection was performed with a combination of Lasso and random forest regression with cross validation.
Interpolation requires the relationship between call activity and population to either remain constant at all scales or vary in a predictable and monotonic way. We modeled scale invariance by aggregating districts into larger virtual districts, trying several different fits, and then measuring out of sample error on a validation set. For the most part, we found population and some kinds of calling activity to scale very predictably.
Using the subset of models and variables with the greatest accuracy, we interpolate downward out of sample to the smaller grid units. We end up with a high resolution population map of Milan available in almost real-time.