The selection of areas for case study research in socio-economic geography with the application of <i>k</i>-means clustering

Agata Warchalska-Troll; Tomasz Warchalski

doi:10.5604/01.3001.0015.7717

The selection of areas for case study research in socio-economic geography with the application of k-means clustering

Agata Warchalska-Troll Instytut Rozwoju Miast i Regionów, Warszawa, Polska; Uniwersytet Jagielloński w Krakowie, Instytut Geografii i Gospodarki Przestrzennej, Polska / Institute of Urban and Regional Development, Warsaw, Poland; Jagiellonian University in Krakow, Institute of Geography and Spatial Management, Poland ORCID: https://orcid.org/0000-0003-1314-3206 , Tomasz Warchalski Badacz niezależny, Polska / Independent researcher, Poland ORCID: https://orcid.org/0000-0002-2894-2265 Wiadomości Statystyczne. The Polish Statistician, vol. 67, 2022, 2, s. 1-20 Opublikowano online: 28 lutego 2022 DOI 10.5604/01.3001.0015.7717 Sposób cytowania: Warchalska-Troll, A., Warchalski, T. (2022). The selection of areas for case study research in socio-economic geography with the application of k-means clustering. Wiadomości Statystyczne. The Polish Statistician, 67(2), 1-20. https://doi.org/10.5604/01.3001.0015.7717.

2932 Wyświetlenia 165 Pobrania

ARTYKUŁ

(Angielski) PDF

STRESZCZENIE

The grouping techniques which are known in statistics are rarely used by geographers to select a research area. The aim of the paper is to examine the potential use of the k-means clustering (partitioning) method for the selection of spatial units (here: gminas, i.e. the lowest administrative units in Poland) for case studies in socio-economic geography. We explored this topic by solving a practical problem consisting in the optimal designation of gminas for in-depth research on the interaction between nature protection and local and regional development in the Polish Carpathians. Particular attention was devoted to defining an appropriate number of clusters by means of the elbow method as well as the pseudo-F statistic (the Calinski-Harabasz index). The data for the analysis were mostly provided by Statistics Poland and covered the period of 1999–2012. The multi-stage procedure resulted in the selection of the following gminas: Cisna, Lipinki, Ochotnica Dolna, Sękowa, Szczawnica and Zawoja.
The example described in the paper demonstrates that the k-means technique, despite its certain deficiencies, may prove useful for creating classifications and typologies leading to the selection of case study sites, as it is relatively time-effective, intuitive and available in opensource software. At the same time, due to the complexity of the socio-economic characteristics of the areas, the application of this method in socio-economic geography may require support in terms of the interpretation of the results through the analysis of additional data sources and expert knowledge.

SŁOWA KLUCZOWE

case study, k-means partitioning, elbow method, pseudo-F statistic, Calinski-Harabasz index

JEL

C38, O18, R58

BIBLIOGRAFIA

Babbie, E. (2007). Badania społeczne w praktyce. Wydawnictwo Naukowe PWN.

Bayisa, F. L., Adahl, M., Rydén, P, & Cronie, O. (2020). Large-scale modelling and forecasting of ambulance calls in northern Sweden using spatio-temporal log-Gaussian Cox processes. Spatial Statistics, 39, 1–22. https://doi.org/10.1016/j.spasta.2020.100471 .

Bole, D., Kozina, J, & Tiran, J. (2019). The variety of industrial towns in Slovenia: a typology of their economic performance. Bulletin of Geography. Socio-economic Series, 46(46), 71–83. http://doi.org/10.2478/bog-2019-0035 .

Brauksa, I. (2013). Use of Cluster Analysis in Exploring Economic Indicator Differences among Regions: The Case of Latvia. Journal of Economics, Business and Management, 1(1), 42–45. http://doi.org/10.7763/JOEBM.2013.V1.10 .

Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3(1), 1–27.

Crone, T. M. (2005). An alternative definition of economic regions in the United States based on similarities in state business cycles. The Review of Economics and Statistics, 87(4), 617–626. https://doi.org/10.1162/003465305775098224 .

Dawidowicz, D. (2020). Ocena sytuacji finansowej gmin z wykorzystaniem metody k-średnich. Wiadomości Statystyczne. The Polish Statistician, 65(7), 26–46. https://doi.org/10.5604/01.3001.0014.3284 .

ESRI. (n.d.). Grouping Analysis (Spatial Statistics) 8. Retrieved June 24, 2021, from https://pro.arcgis.com/en/pro-app/2.8/tool-reference/spatial-statistics/grouping-analysis.htm .

Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Cluster Analysis (5th edition). John Wiley & Sons. https://doi.org/10.1002/9780470977811

Gao, P., & Kupfer, J. A. (2018). Capitalizing on a wealth of spatial information: Improving biogeographic regionalization through the use of spatial clustering. Applied Geography, 99, 98– 108. https://doi.org/10.1016/j.apgeog.2018.08.002 .

Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100–108. https://doi.org/10.2307/2346830 .

Kodinariya, T. M., & Makwana, P. R. (2013). Review on determining number of Cluster in K-Means Clustering. International Journal of Advance Research in Computer Science and Management Studies, 1(6), 90–95.

Kong, W., Wang, Y., Dai, H., Zhao, L., & Wang, C. (2021). Analysis of energy consumption structure based on K-means clustering algorithm. E3S Web of Conferences, 267, 1–5. https://doi.org/10.1051/e3sconf/202126701054 .

Kraszewska, B. (2016). Wykorzystanie analizy skupień w ocenie zróżnicowania zagrożenia ubóstwem w podregionach Polski. Wiadomości Statystyczne. The Polish Statistician, 61(5), 17– 36. https://doi.org/10.5604/01.3001.0014.0993 .

Larose, D. T., & Larose, C. D. (2014). Discovering Knowledge in Data: An Introduction to Data Mining (2nd edition). John Wiley & Sons. https://doi.org/10.1002/9781118874059 .

Li, X., Wang, L., & Liu, S. (2016). Geographical Analysis of Community Resilience to Seismic Hazard in Southwest China. International Journal of Disaster Risk Science, 7(3), 257–276. https://doi.org/10.1007/s13753-016-0091-8 .

Lloyd, S. P. (1982). Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. https://doi.org/10.1109/TIT.1982.1056489 .

MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. In L. M. Le Cam & J. Neyman (Eds.), Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability (pp. 281–297). University of California Press. https://projecteuclid.org/ebooks/berkeley-symposium-on-mathematical-statistics-and-probability/Some-methods-for-classification-and-analysis-of-multivariate-observations/chapter/Some-methods-for-classification-and-analysis-of-multivariate-observations/bsmsp/1200512992 .

Malinowski, M. (2016). Potencjał ludzki a efektywność ekonomiczna przedsiębiorstw – wykorzystanie metod taksonomicznych w ujęciu regionalnym. Studia Regionalne i Lokalne, (2), 87–109. https://doi.org/10.7366/1509499526405 .

Migdał-Najman, K. (2011). Ocena jakości wyników grupowania – przegląd bibliografii. Przegląd Statystyczny, 58(3–4), 281–299.

Mikuš, R., Máliková, L., & Lauko, V. (2016). An introductory study of perceptual marginality in Slovakia. Bulletin of Geography. Socio-economic Series, (34), 47–62. http://dx.doi.org/10.1515/bog-2016-0034 .

Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2), 159–179. https://doi.org/10.1007 /BF0229424 5.

Nicholson, D., Vanli, O. A., Jung, S., & Ozguven, E. E. (2019). A spatial regression and clustering method for developing place-specific social vulnerability indices using census and social media data. International Journal of Disaster Risk Reduction, 38, 101–224 https://doi.org/10.1016/j.ijdrr.2019.101224 .

Novotná, M., Šlehoferová, M., & Matušková, A. (2016). Evaluation of spatial differentiation in the Pilsen region from a socio-economic perspective. Bulletin of Geography. Socio-economic Series, (34), 73–90. https://doi.org/10.1515/bog-2016-0036 .

Peeples, M. A. (2011). R Script for K-Means Cluster Analysis. Retrieved May 27, 2021, from http://www.mattpeeples.net/kmeans.html .

R Core Team. (n.d.). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Retrieved August 30, 2020, from https://www.R-project.org/ .

Steinhaus, H. (1957). Sur la division des corps matériels en parties. Bulletin L’Académie Polonaise des Sciences, 4(12), 801–804. http://www.laurent-duval.eu/Documents/Steinhaus_H_1956_j-bull-acad-polon-sci_division_cmp-k-means.pdf .

Stukalo, N., & Simakhova, A. (2018). Global parameters of social economy clustering. Problems and Perspectives in Management, 16(1), 36-47. https://doi.org/10.21511/ppm.16(1).2018.04 .

Taylor, L. (2016). Case Study Methodology. In N. Clifford, M. Cope, T. Gillespie & S. French (Eds.), Key Methods in Geography (3rd edition; pp. 581–595). SAGE Publications.

Thorndike, R. L. (1953). Who belongs in the family?. Psychometrika, 18(4), 267–276. https://doi.org/10.1007/BF02289263 .

Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 63(2), 411–423. https://doi.org/10.1111/1467-9868.00293 .

Warchalska-Troll, A. (2018). Natura 2000 sites in the Polish Carpathians vs local development: inevitable conflict?. eco.mont Journal of Mountain Protected Areas Research and Management, 10(2), 50–58. https://doi.org/10.1553/eco.mont-10-2s50 .

Warchalska-Troll, A. (2019). Do Economic Opportunities Offered by National Parks Affect Social Perceptions of Parks? A Study from the Polish Carpathians. Mountain Research and Development, 39(1), 37–46. https://doi.org/10.1659/MRD-JOURNAL-D-18-00055.1 .

Zhang, Y., Moges, S., & Block, P. (2016). Optimal Cluster Analysis for Objective Regionalization of Seasonal Precipitation in Regions of High Spatial–Temporal Variability: Application to Western Ethiopia. Journal of Climate, 29(10), 3697–3717. https://doi.org/10.1175/JCLI-D-15-0582.1 .

Wróć do: