Land cover (land use) estimates assign points or regions on the earth’s surface to classes like forested, farmland, urban development, etc. There are hundreds of land cover data sets and methods covering different regions, time periods, and special topics. In a paper under review, my coauthors and I test methods of estimating population density at very high resolutions (235×235 meters) using real-time telecommunications data. For that paper, I developed a custom land cover map of Milan and the surrounding area using the latest available satellite images from Landsat 8, training labels using the community curated OpenStreetMap database, and a random forest classifier. It was a great quick and dirty way to get a very recent land cover map for a specific use, and I outline the details below.
Step 1: Training Labels (OpenStreetMap)
OpenstreetMap consists of community managed and free vector layers which users develop and trace over proprietary satellite and road map raster tiles provided by a third parties. Behind the scenes, objects are stored with POSTGIS in a Postgres database. You can download the full database as XML formatted .osm files.
We want simple shapefiles, and you can process those OSM files yourself into other formats. The website also has an export feature that will dump a bounding box to a shapefile, but I had difficulty using it on moderately large areas. There are third party companies that provide free extracts and alternative formats for cities and regions. I ultimately selected a shapefile dump of Italy provide by GeoFabrik.de because it was reasonably sized with standardized feature labels.
The download is organized into separate files by polygons (buildings, landuse, natural), vectors (railways, roads, waterways), and points (places, points). Each feature has columns providing additional information like place names, classifications like playground or lake, population for places, etc. For this very simple classification, I’m concerned with just five land cover classification types buildings, agriculture/nature, road/pavement, railroad, and water. For these classifications, I focus on just a few files (buildings, landuse, natural, railways, roads).
I start by clipping each of the relevant files to just the region of interest. In Qgis you can do this by creating a polygon in the shape of the area, and then using the clip one shapefile with another option. For each shapefile, I create a unique column to contain my intended final classification. I then sorted each file by the most relevant attribute and hand labeled them as one of each of the final five classes. The final intended product is a raster, so my labels were consistent integers (1,2,3,4,5) which are converted to pixel intensities after rasterizing each shapefile. After importing, cleaning, and merging the images in python, the result is five images at the target resolution labeling each pixel with a class.
As a final step, because there is some overlap between features across shapefiles each class had to be ordered from background to foreground. Here roads > rails > nature > buildings > water. The final layered labels look like the following.
Step 2: Unlabeled Data (Landsat 8 Satellite Imagery)
The labels provided by OpenStreetMap cover approximately 44% of the area of interest so additional information is needed to fill in the remainder. Fortunately, relatively high resolution and very recent imagery is available for free from Landsat 8 imagery.
USGs provides several ways of finding and downloading imagery. The entire region of interest fits within a single Landsat image, and I was able to find recent imagery with relatively little cloud cover. The imagery is provided as high resolution 30 meter (7,821 x 7,951 pixel) uncompressed .tif files broken down by 11 bands and a higher resolution 15 meter (15,641 x 15,901 pixel) chromatic band.
The three natural color image bands in r,g,b order are bands 4, 3, and 2. Using python, I extracted those three bands and combined them into a single color image. Next, I used the higher resolution chromatic image to pansharpen them into a natural color image at an interpolate 15m resolution. The final product is 15m resolution color image in the same shape and resolution as the rasters generated from the OpenStreetMap features.
#Pan Sharpening Demo import skimage import skimage.io import skimage.transform path="Path/To/Imagery/Here/" #Channels, 2 blue, 3 green, 4 red def sharpen(root): r=skimage.io.imread(path+root + "_B4.tif", as_grey=True) #Load Red g=skimage.io.imread(path+root + "_B3.tif", as_grey=True) #Load Green b=skimage.io.imread(path+root + "_B2.tif", as_grey=True) #Load Blue rgb= np.dstack((r,g,b )) #Combine into correctly ordered stack landsat_highres=root + "_B8.tif" pan=skimage.io.imread(path+landsat_highres, as_grey=True)/65535 #Load chromatic, normalize to between 0 and 1 rgb_big=skimage.transform.resize(rgb, output_shape=(pan.shape,pan.shape,3), order=3, mode='constant', cval=0.0) #Resize the rgb composite to match the chromatic hsv = skimage.color.rgb2hsv(rgb_big) #Conver the rgb to the HSV colorspace hsv[...,2]=pan #Poor man's pan sharpening, replace the intensity channel with the observed chromatic image rgb_pan = skimage.color.hsv2rgb(hsv) #Convert back to rgb space return [rgb_pan] landsat1=sharpen(root="LC81940282014072LGN00") #File name of your imagery
Step 3: Classification (Random Forest)
Now it’s time to assign land cover labels to the remaining 56% of pixels. The target region and resolution is 1567 x 1567 pixels. The unit of observation (the rows) of our data are individual pixels (n=2,455,489).
The features for each observation (the columns) are the pixel intensities in each band for a 3×3 window. For RGB that’s 3 channels x height of 3 x width of 3 = 27 features. Structures have sharp identifying edges so I further include an edge detector (difference of gausians) which produces three channels over the same dimensions adding another 27 features. That brings the total up to 54.
from scipy import ndimage as ndi landsat1_dog = landsat1 - ndi.gaussian_filter(landsat1, 15) #Difference of gaussians stack=np.dstack((landsat1, landsat1_dog)) depth=stack.shape edge=3 buff=floor(edge/2) stack_window=skimage.util.view_as_windows(np.pad(stack,pad_width=((buff,buff),(buff,buff),(0,0)), mode ='minimum'), window_shape=(edge,edge,depth), step=1); stack_window.shape #3 by 3 moving window around each pixel stack_flat=stack_window.reshape(-1,edge*edge*depth); stack_flat.shape #Flatten to rows
There are two technical considerations to consider about how the labeled observations we have compare to the unlabeled observations. The first is that OpenStreetMap coverage for two of the classes roads/pavement and rails is so comprehensive that they are almost completely labeled already. Providing them as options to the classifier can only drive down the accuracy of predictions for building, nature, and water. Therefore I drop those two labels and all of the related rows, leaving 658,430 training examples.
Second, the distribution of the remaining three classes isn’t uniform across training examples (67% nature, 29% building, and 3% water). With a lopsided distribution, a classifier that simply chooses the majority category more often will have a lower error rate on the training data so we will have to provide weights for each example.
x_test=stack_flat y_test=labels.flatten() x_train=x_test[y_test>0,...] y_train=y_test[y_test>0] proportions=1/(np.bincount(y_train)/y_train.shape) weights= proportions/sum(proportions[1:]) weights=proportions[y_train]
Finally, it’s time to pass everything to the classifier. I use a Random Forest classifier , with 300 fully grown trees. It has an out of bag error of only 4.5% which is great for a first attempt. That could be further improved with more refined features, a larger sliding window, training data from outside of the region, or moving to a different classifier like a neural network.
import sklearn.ensemble rf=sklearn.ensemble.RandomForestClassifier(n_estimators=300, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features='auto', bootstrap=True, oob_score=True, n_jobs=8, random_state=0, verbose=True, min_density=None, compute_importances=None) rf.fit(x_train,y_train, sample_weight=weights) rf.oob_score_
As a final step, I combine the model’s predictions with the original labels giving precedence to the latter. I assign each label a unique color and out a final image shown below.
predictions=rf.predict(x_test) labels_predicted=predictions.reshape(labels.shape) labels_predicted_andknown[labels>0]=labels[labels>0] finalcolors=np.array([[0,0,0], #0 unlabeled [32,66,239], #1 water [0,102,0], #2 vegitation [102,51,0], #3 buildings [0,0,0], #4 rail [128,128,128]])#5 road outpath = "out/path/here" skimage.io.imsave(outpath+"rf_predictions_andknown.png", finalcolors[labels_predicted_andknown])