Comparison Study of Distance Methods in Nonlinear Panel Data Clustering with K-Means Method
Keywords:
Coronavirus Disease, Calinski-Harabatz, Dynamic Time Warping, K-meanAbstract
Cluster analysis is used to group objects based on the similarity of characteristics between objects. Cluster analysis will be applied to nonlinear panel data using Indonesian Coronavirus Disease (COVID-19) data with the aim of grouping Provinces based on the number of active positive cases using the K-means method. The first stage will be a simulation to get the best distance method on nonlinear panel data. The distance method used is the Euclidean, Manhattan, Maximum, Frechet, and Dynamic Time Warping (DTW). The simulation results are obtained after running all distance methods with 36 scenarios from four generation data models, the maximum distance method is the best distance method with a total of 20 highest accuracy values compared to other distance methods. The maximum distance method will be applied to real data. The real data results showed that the optimal number of clusters is formed when three clusters are formed with the value of the Calinski Harabatz (CH) criteria of 143,459. Cluster A has 30 members, Cluster B has three members, while Cluster C has one member from DKI Jakarta Province.
References
Mattjik, AA, Sumertajaya, I,M. 2011. Sidik Peubah Ganda dengan Menggunakan SAS. Bogor: IPB Press.
Sumertajaya IM, Erfiani, Putri WDY. 2007. Analisis Gerombol Menggunakan Metode Two Step Cluster (Studi kasus: data Potensi Desa Sensus Ekonomi 2003 wilayah Jawa Barat). Forum Statistika dan Komputasi. 12(1): 18-23.
Genolini C, Alacoque X, Sentenac M, Arnaud C. 2015. KML and KML3D: R Packages to Cluster Longitudinal Data. Journal of Statistical Software. 65(4): 2-10. doi: 10.18637/jss.v065.i04.
Genolini C, Ecochard R, Benghezal M, Driss T, Andrieu S, Subtil F. 2016 kmlShape: An Efficient Method to Cluster Longitudinal Data (Time-Series) According to Their Shapes. PLoS ONE 11(6):1-12. doi: 10.1371/journal.pone.0150738
Bilgic E, Baydar V. 2018. Panel Data Clustering with R: An Application on Macroeconomic Variables of European Countries. 19, 258.
Sugiono, Adella Sari Cahyani. 2020. Kajian Perbandingan Beberapa Jarak untuk Data Panel dalam Penggerombolan Tak Berhiraki [Tesis]. Bogor: Institut Pertanian Bogor.
Montero P, Villar Jose A. 2014. TSclust: An R Package for Time Series Clustering. Journal of Statistical Software. 65 (4): 2-18.
Liu L, Li W, Jia H. 2018. Method of Time Series Similarity Measurement Based on Dynamic Time Warping. CMC. 57(1):97-106.
Gorunescu, F. 2011. Data Mining: Concepts, Model and Techniques. Berlin, Jerman: Springer.
Baarsch J, Celebi ME. 2012. Investigation of Internal Validity Measures for K-Means Clustering. International Multiconference Of Engineers And Computer Scientists.1:14-16. LA: Louisiana Board of Regents.
Johnson RA, Wichern DW. 2002. Applied Multivariate Statistical Analysis 6th Edition. New Jersey: Prentice-Hall International.
Downloads
Published
How to Cite
Issue
Section
License
Authors who submit papers with this journal agree to the following terms.