Skip to content

bellingcat/geoclustering

Repository files navigation

geoclustering

๐Ÿ“ command-line tool for clustering geolocations.

Features

  • Uses DBSCAN or OPTICS to perform clustering.
  • Outputs clustering results as json, txt and geojson.
  • Creates a kepler.gl visualization of clusters.

Clustering Method

A cluster is created when a certain number of points (defined with --size) each are within a given distance (defined with --distance) of at least one other point in the cluster.

Install

Install with pip:

# with kepler.gl visualization support
pip install geoclustering[full]

# only text-based output
pip install geoclustering

If the full install fails, you might need to install kepler.gl build dependencies:

# macos
brew install proj gdal

Usage

Usage: geoclustering [OPTIONS] FILENAME

  Tool to cluster geolocations. A cluster is created when a certain number of
  points (defined with --size) each are within a given distance (defined with
  --distance) of at least one other point in the cluster. Input is supplied as
  a csv file. At a minimum, each row needs to have a 'lat' and a 'lon' column.
  Other rows are reflected to the output.

Options:
  -d, --distance FLOAT            (in km) Max. distance between two points in
                                  a cluster.  [required]
  -s, --size INTEGER              Min. number of points in a cluster.
                                  [required]
  -o, --output PATH               Output directory for results. Default:
                                  ./output
  -a, --algorithm [dbscan|optics]
                                  Clustering algorithm to be used. `optics`
                                  produces tighter clusters but is slower.
                                  Default: dbscan
  --open                          Open the generated visualization in the
                                  default browser automatically.
  --debug                         Print debug output.
  --help                          Show this message and exit.

Input

Inputs are supplied as a .csv file. At a minimum, each row needs to have a lat and a `lon`` column. Other rows are reflected to the output.

id,name,lat,lon
1,Bonnibelle Mathwen,40.1324085,64.4911086
...

Output

If at least one cluster was found, the tool outputs a folder with output as json, geojson, txt, csv files. A kepler.gl html file is generated as well.

JSON

Encodes an array of clusters, each containing an array of points.

[
  {
    "cluster_id": 0,
    "points": [
      {
        "id": 9,
        "name": "Rosanna Foggo",
        "lat": -6.2074293,
        "lon": 106.8915948
      }
    ]
  }
]

GeoJSON

Encodes a single FeatureCollection, containing all points as Feature objects.

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          106.891595,
          -6.207429
        ]
      },
      "properties": {
        "id": 9,
        "name": "Rosanna Foggo",
        "cluster_id": 0
      }
    }
  ]
}

Text

Encodes cluster as blocks separated by a newline, where each line in a cluster block contains one point.

Cluster 0
id 9, name Rosanna Foggo, lat -6.2074293, lon 106.8915948

// ...

CSV

Encodes each event in one line with cluster_id information associated.

cluster_id,name,lat,lon
9,Rosanna Foggo,-6.2074293,106.8915948
...

kepler.gl

kepler.gl instance

Develop

It is assumed that you are using Python3.9+. It is encouraged to setup a virtualenv for development.

    # install dependencies & dev-dependencies
    # PIP
    pip install -e .[dev,full]
    # PIPENV
    pipenv install --dev -e .

    # install a git hook that runs the code formatter before each commit.
    pre-commit install

We use Black as our code formatter. If you don't want to use the pre-commit hook, you can run the formatter manually or via an editor plugin.