Skip to main content
U.S. flag

An official website of the United States government

Using genetic algorithms to optimize k-Nearest Neighbors configurations for use with airborne laser scanning data

Formally Refereed

Abstract

The relatively small sampling intensities used by national forest inventories are often insufficient to produce the desired precision for estimates of population parameters unless the estimation process is augmented with auxiliary information, usually in the form of remotely sensed data. The k-Nearest Neighbors (k-NN) technique is a non-parametric,multivariate approach to prediction that has emerged as particularly popular for use with forest inventory and remotely sensed data and has been shown to contribute substantially to increasing precision. k-NN predictions are calculated as linear combinations of observations for sample units that are nearest in a space of auxiliary variables to the population unit forwhich a prediction is desired. Implementation of a nearest neighbors algorithmrequires four choices: (i) a distancemetric, (ii) specific auxiliary variables to be usedwith the distance metric, (iii) the number of nearest neighbors, and a (iv) scheme for weighting the nearest neighbors. Regardless of the choices for a distance metric and weighting scheme, emerging evidence suggests that optimization of the technique, including selection of an optimal subset of auxiliary variables, greatly enhances prediction. However, optimization can be computationally intensive and time-consuming. A promising approach that is gaining favor is based on genetic algorithms, a technique that uses search heuristics that mimic natural selection to solve optimization problems. The objective of the study was to compare optimized k-NN configurations with respect to inferences for mean volume per unit area using airborne laser scanning variables as auxiliary information. For two study areas, one in Norway and one in Minnesota, USA, the analyses focused on optimizing k-NN configurations that used the weighted Euclidean and canonical correlation distance metrics and two neighborweighting schemes. Novel features of the study include introduction of a neighborweighting scheme that has not previously been used for forestry applications, simultaneous optimization of all four k-NN choices, and basing comparisons on confidence intervals, rather than intermediate products such as prediction accuracies. Two conclusionswere primary: (1) optimized selection of feature variables produced greater precision than using all feature variables, and (2) computational intensity necessary to optimize the weighted Euclidean metric was considerably greater than for the canonical correlation analysis metric. Specific findings were that optimization produced pseudo-R2 as large as 0.87 for the Norwegian dataset and as large as 0.89 for the Minnesota dataset. For the optimized canonical correlation distance metric, widths of approximate 95% confidence intervals as proportions of the estimated means were as small as 0.13 for the Norwegian dataset and as small as 0.15 for the Minnesota dataset.

Keywords

Inference, Spatial estimation, National forest inventory

Citation

McRoberts, Ronald E.; Domke, Grant M.; Chen, Qi; Næsset, Erik; Gobakken, Terje. 2016. Using genetic algorithms to optimize k-Nearest Neighbors configurations for use with airborne laser scanning data. Remote Sensing of Environment. 184: 387-395. https://doi.org/10.1016/j.rse.2016.07.007.
Citations
https://www.fs.usda.gov/research/treesearch/55205