Adam Juszczak

© Adam Juszczak. Artykuł udostępniony na licencji CC BY-SA 4.0


(Angielski) PDF


Web scraping is a technique that makes it possible to obtain information from websites automatically. As online shopping grows in popularity, it became an abundant source of information on the prices of goods sold by retailers. The use of scraped data usually allows, in addition to a significant reduction of costs of price research, the improvement of the precision of inflation estimates and real-time tracking. For this reason, web scraping is a popular research tool both for statistical centers (Eurostat, British Office of National Statistics, Belgian Statbel) and universities (e.g. the Billion Prices Project conducted at Massachusetts Institute of Technology). However, the use of scraped data to calculate inflation brings about many challenges at the stage of their collection, processing, and aggregation. The aim of the study is to compare various methods of calculating price indices of clothing and footwear on the basis of scraped data. Using data from one of the largest online stores selling clothing and footwear for the period of February 2018–November 2019, the author compared the results of the Jevons chain index, the GEKS-J index and the GEKS-J expanding and updating window methods. As a result of the calculations, a high chain index drift was confirmed, and very similar results were found using the extension methods and the updated calculation window (excluding the FBEW method).


inflation, web scraping, online shopping, GEKS-J


C43, C49


Auer, J., & Boettcher, I. (2017). From price collection to price data analytics. How new large data sources require price statisticians to re-think their index compilation procedures. Experiences from web-scraped and scanner data. Ottawa Group.$FILE/From%20price%20collection%20to%20price%20data%20analytics%20-Josef%20Auer,%20Ingolf%20Boettcher%20-Paper.pdf.

Białek, J. (2021). PriceIndices – a New R Package for Bilateral and Multilateral Price Index Calculations. Statistika. Statistics and Economy Journal, (2), 122–141.

Białek, J., & Bobel, A. (2019). Comparison of Price Index Methods for the CPI Measurement Using Scanner Data.$FILE/Comparison%20of%20Price%20Index%20Methods%20paper.pdf.

Białek, J., Kłopotek, M., & Panek, T. (2022). Nowoczesne technologie i nowe źródła danych w pomiarze inflacji. Główny Urząd Statystyczny.

Bitner, T., & Stech, G. (2019, April 9). GUS: Big Data to nasz priorytet.,412891.html.

Bosch, O. (n.d.). Uses of web scraping for official statistics. ESTP course on Big Data Sources – Web, Social Media and Text Analytics.

Chessa, A. G. (2016). A new methodology for processing scanner data in the Dutch CPI. Eurostat Review of National Accounts and Macroeconomic Indicator, (1), 49–69.

Chessa, A. G. (2021). A Product Match Adjusted R Squared Method for Defining Products with Transaction Data. Journal of Official Statistics, 37(2), 411–432.

Chessa, A. G., & Griffioen, R. (2019). Comparing Price Indices of Clothing and Footwear for Scanner Data and Web Scraped Data. Economie et Statistique. Economics and Statistics, (509), 49–68.

Chuanyang, F., & Lee Wen Hao, J. (2016). Experiences with the Use of Online Prices in Consumer Price Index. Statistics Singapore Newsletter, (2), 1–4.

Eltetö, Ö., & Köves, P. (1964). Egy nemzetközi összehasonlításoknál fellépö indexszámításl. Statisztikai Szemle, 42, 507–518.

Eurostat. (2020). Practical guidelines on web scraping for the HICP.

Gini, C. (1931). On the Circular Test of Index Numbers. Metron, 9(9), 3–24.

de Haan, J., & van der Grient, H. A. (2011). Eliminating chain drift in price indexes based on scanner data. Journal of Econometrics, 161(1), 36–46.

de Haan, J., Willenborg, L., & Chessa, A. (2016). A Review of Price Index Methods for Scanner Data.

Jevons, W. S. (1865). On the Variation of Prices and the Value of the Currency since 1782. Journal of the Statistical Society of London, 28(2), 294–320.

Juszczak, A. (2021). Zastosowanie danych scrapowanych w pomiarze dynamiki cen. Acta Universitatis Lodziensis. Folia Oeconomica, 1(352), 25–37. .

Lamboray, C. (2017). The Geary Khamis index and the Lehr index: how much do they differ?. 15th Meeting of the Ottawa Group, Elville am Rhein.$FILE/The%20Geary%20Khamis%20index%20and%20the%20Lehr%20index%20how%20much%20do%20they%20differ%20-%20Claude%20Lamboray%20-%20Presentation.pdf.

van Loon, K., & Roels, D. (2018, May 7–9). Integrating big data in the Belgian CPI. Meeting of the Group of Experts on Consumer Price Indices, Geneva.

Lünnemann, P., & Wintr, L. (2006). Are Internet Prices Sticky? (ECB Working Paper No. 645).

Macias, P., & Stelmasiak, D. (2019). Food inflation nowcasting with web scraped data (NBP Working Papers No. 302).

Netcomm Suisse Observatory, United Nations Conference on Trade and Development. (2020). COVID-19 and E-commerce Findings from a survey of online consumers in 9 countries.

Persson, E. (2019). Evaluating tools and techniques for web scraping [Second cycle degree project, KTH School of Electrical Engineering and Computer Science].

Polidoro, F., Giannini, R., Conte, R. L., Mosca, S., & Rosetti, F. (2015). Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation. Statistical Journal of the IAOS, 31(2), 165–176.

Radzikowski, B., & Śmietanka, A. (2016). Online CASE CPI. 1st International Conference on Advanced Research Methods and Analytics, Valencia.

Szulc, B. (1964). Indices for multiregional comparisons. Przegląd Statystyczny, (3), 239–254.

Do góry
© 2019-2022 Copyright by Główny Urząd Statystyczny, pewne prawa zastrzeżone. Licencja Creative Commons Uznanie autorstwa - Na tych samych warunkach 4.0 (CC BY-SA 4.0) Creative Commons — Attribution-ShareAlike 4.0 International — CC BY-SA 4.0