A Data Mining project which focuses on the comparison between different un-supervised clustering algorithms.
The project was part of the Data Mining and Data Warehouses class, where it was ranked first among the same type of projects.
The code of the project is located in the clustering.R file
Authors: Nikolas Petrou, Salome Nikolaishvili
- This simple work compares the K-Means algorithm versus the Single-Link Hierarchical Clustering algorithm
- Mostly emphasized on the quality of the results and not on the performance
- Consists of examples of geodata where the weaknesses of each algorithm were clearly shown
- Programming language: R
- Dataset: World Cities database
- Data for cities around the world
- Geographical coordinates
- Each different sample corresponds to a different city
- The last update of the dataset took place in November 2020
- Only kept samples with non-null features
- Only the geographical coordinates (Latitude and Longitude) were kept and used as features
- Choice of cities of different countries - the cities of each country represent the actual-real clusters
- The choice of cities and countries was done in a manner such that the different expected behaviour of the clustering methods would have been easily visualized and proved
- Selected countries:
- India
- Iran
- Mongolia
- 648 cities (data samples)
- Selected countries:
- Papua New Guinea
- China
- Philippines
- 906 cities (data samples)
- During the evaluation of the K-Means algorithm, the elbow method showed that "optimal" number of clusters is three (as it was previously known since the cities used belong to three different countries)
- The Lloyd K-Means algorithm was used
- For more accurate assumptions, the algorithm has run 20 times in total, for different initial data-points as centers
- Selected centers for model with the smallest SSE
- Accurate clustering
- As expected since:
- The clusters form spherical - elliptical shapes
- The actual clusters are uniformly distributed in the space
- Algorithm always converges (in iterations <30)
- Execution time: 0.013 secs (nstart = 20)
- Poor clustering results
- As expected since:
- The actual clusters have significantly different sizes
- Non-spherical actual clusters
- Different density of actual clusters
- Algorithm always converges (in iterations <30)
- Execution time: 0.017 secs (nstart = 20)
- Used Euclidean Distance as a comparison metric between the different data-points
- The hierarchical tree that the algorithm had produced, was pruned in a manner such that the total clusters would have been three in total
- Poor clustering results
- A bridge (connection between data-points that occurred in the same cluster) was created between two nearby points-cities
- Isolation of Port Blair in a different, non-accurate cluster
- As expected since:
- The Port Blair is isolated (outlier) compared to the other cities of India
- The nearby cities between India and Iran created a bridge during the process
- Execution time: 0.033 secs
- Accurate clustering
- As expected since: Distances between countries (real clusters) are greater than the distances of the respective pairs of cities in each different country. That is, the algorithm finds a bridge to connect the cities of the same countries to the same cluster
- Execution time: 0.042 secs