Quick Land Cover Estimates from Satellite Imagery and OpenStreetMap


OpenStreetmapCombinedLand cover (land use) estimates assign points or regions on the earth’s surface to classes like forested, farmland, urban development, etc. There are hundreds of land cover data sets and methods covering different regions, time periods, and special topics. In a paper under review, my coauthors and I test methods of estimating population density at very high resolutions (235×235 meters) using real-time telecommunications data. For that paper, I developed a custom land cover map of Milan and the surrounding area using the latest available satellite images from Landsat 8, training labels using the community curated OpenStreetMap database, and a random forest classifier. It was a great quick and dirty way to get a very recent land cover map for a specific use, and I outline the details below.

Step 1: Training Labels (OpenStreetMap)

OpenstreetMap consists of community managed and free vector layers which users develop and trace over proprietary satellite and road map raster tiles provided by a third parties. Behind the scenes, objects are stored with POSTGIS in a Postgres database.  You can download the full database as XML formatted .osm files.

We want simple shapefiles, and you can process those OSM files yourself into other formats. The website also has an export feature that will dump a bounding box to a shapefile, but I had difficulty using it on moderately large areas. There are third party companies that provide free extracts and alternative formats for cities and regions. I ultimately selected a shapefile dump of Italy provide by GeoFabrik.de because it was reasonably sized with standardized feature labels.

The download is organized into separate files by polygons (buildings, landuse, natural), vectors (railways, roads, waterways), and points (places, points). Each feature has columns providing additional information like place names, classifications like playground or lake, population for places, etc. For this very simple classification, I’m concerned with just five land cover classification types buildings, agriculture/nature, road/pavement, railroad, and water. For these classifications, I focus on just a few files (buildings, landuse, natural, railways, roads).

I start by clipping each of the relevant files to just the region of interest. In Qgis you can do this by creating a polygon in the shape of the area, and then using the clip one shapefile with another option. For each shapefile, I create a unique column to contain my intended final classification. I then sorted each file by the most relevant attribute and hand labeled them as one of each of the final five classes. The final intended product is a raster, so my labels were consistent integers (1,2,3,4,5) which are converted to pixel intensities after rasterizing each shapefile. After importing, cleaning, and merging the images in python, the result is five images at the target resolution labeling each pixel with a class.










As a final step, because there is some overlap between features across shapefiles each class had to be ordered from background to foreground. Here roads > rails > nature > buildings > water. The final layered labels look like the following.


Step 2: Unlabeled Data (Landsat 8 Satellite Imagery)

The labels provided by OpenStreetMap cover approximately 44% of the area of interest so additional information is needed to fill in the remainder. Fortunately, relatively high resolution and very recent imagery is available for free from Landsat 8 imagery.

USGs provides several ways of finding and downloading imagery. The entire region of interest fits within a single Landsat image, and I was able to find recent imagery with relatively little cloud cover. The imagery is provided as high resolution 30 meter (7,821 x 7,951 pixel) uncompressed .tif files broken down by 11 bands and a higher resolution 15 meter (15,641 x 15,901 pixel) chromatic band.

The three natural color image bands in r,g,b order are bands 4, 3, and 2. Using python, I extracted those three bands and combined them into a single color image. Next, I used the higher resolution chromatic image to pansharpen them into a natural color image at an interpolate 15m resolution. The final product is 15m resolution color image in the same shape and resolution as the rasters generated from the OpenStreetMap features.

#Pan Sharpening Demo
import skimage
import skimage.io
import skimage.transform

#Channels, 2 blue, 3 green, 4 red
def sharpen(root):
r=skimage.io.imread(path+root + "_B4.tif", as_grey=True) #Load Red
g=skimage.io.imread(path+root + "_B3.tif", as_grey=True) #Load Green
b=skimage.io.imread(path+root + "_B2.tif", as_grey=True) #Load Blue
rgb= np.dstack((r,g,b )) #Combine into correctly ordered stack
landsat_highres=root + "_B8.tif"
pan=skimage.io.imread(path+landsat_highres, as_grey=True)/65535 #Load chromatic, normalize to between 0 and 1
rgb_big=skimage.transform.resize(rgb, output_shape=(pan.shape[0],pan.shape[0],3), order=3, mode='constant', cval=0.0) #Resize the rgb composite to match the chromatic

hsv = skimage.color.rgb2hsv(rgb_big) #Conver the rgb to the HSV colorspace
hsv[...,2]=pan #Poor man's pan sharpening, replace the intensity channel with the observed chromatic image
rgb_pan = skimage.color.hsv2rgb(hsv) #Convert back to rgb space
return [rgb_pan]

landsat1=sharpen(root="LC81940282014072LGN00") #File name of your imagery


Step 3: Classification (Random Forest)

Now it’s time to assign land cover labels to the remaining 56% of pixels. The target region and resolution is 1567 x 1567 pixels. The unit of observation (the rows) of our data are individual pixels (n=2,455,489).

The features for each observation (the columns) are the pixel intensities in each band for a 3×3 window. For RGB that’s 3 channels x height of 3 x width of 3 = 27 features. Structures have sharp identifying edges so I further include an edge detector (difference of gausians) which produces three channels over the same dimensions adding another 27 features. That brings the total up to 54.

from scipy import ndimage as ndi
landsat1_dog = landsat1 - ndi.gaussian_filter(landsat1, 15) #Difference of gaussians

stack=np.dstack((landsat1, landsat1_dog))
stack_window=skimage.util.view_as_windows(np.pad(stack,pad_width=((buff,buff),(buff,buff),(0,0)), mode ='minimum'),
 window_shape=(edge,edge,depth), step=1); stack_window.shape #3 by 3 moving window around each pixel

stack_flat=stack_window.reshape(-1,edge*edge*depth); stack_flat.shape #Flatten to rows

There are two technical considerations to consider about how the labeled observations we have compare to the unlabeled observations. The first is that OpenStreetMap coverage for two of the classes roads/pavement and rails is so comprehensive that they are almost completely labeled already. Providing them as options to the classifier can only drive down the accuracy of predictions for building, nature, and water. Therefore I drop those two labels and all of the related rows, leaving 658,430 training examples.

Second, the distribution of the remaining three classes isn’t uniform across training examples (67% nature, 29% building, and 3% water). With a lopsided distribution, a classifier that simply chooses the majority category more often will have a lower error rate on the training data so we will have to provide weights for each example.



weights= proportions/sum(proportions[1:])

Finally, it’s time  to pass everything to the classifier. I use a Random Forest classifier ,  with 300 fully grown trees. It has an out of bag error of only 4.5% which is great for a first attempt. That could be further improved with more refined features, a larger sliding window, training data from outside of the region, or moving to a different classifier like a neural network.

import sklearn.ensemble
rf=sklearn.ensemble.RandomForestClassifier(n_estimators=300, criterion='gini',
max_depth=None, min_samples_split=2, min_samples_leaf=1,
max_features='auto', bootstrap=True,
oob_score=True, n_jobs=8, random_state=0, verbose=True,
min_density=None, compute_importances=None)

rf.fit(x_train,y_train, sample_weight=weights)

As a final step, I combine the model’s predictions with the original labels giving precedence to the latter.  I assign each label a unique color and out a final image shown below.

finalcolors=np.array([[0,0,0], #0 unlabeled
                     [32,66,239], #1 water
                     [0,102,0], #2 vegitation
                     [102,51,0], #3 buildings
                     [0,0,0], #4 rail
                     [128,128,128]])#5 road

outpath = "out/path/here"
skimage.io.imsave(outpath+"rf_predictions_andknown.png", finalcolors[labels_predicted_andknown])



1 comment to Quick Land Cover Estimates from Satellite Imagery and OpenStreetMap

Leave a Reply to “High Resolution Population Estimates from Telecommunications Data” (Douglass et al. 2015) | Rex Douglass Cancel reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>