This is a service used by the signals application and provides the ability to predict the category that a signal belongs to. It achieves this using existing data, either from an old system or from signals itself. See the "Input data" section below for more information on the format that is required.
The model is based on sklearn and is trained by removing things like "stop words" and special characters, and subsequently counting the stemmed version of the remaining words. The outcome of that process is then transformed to a tf-idf format. Finally, to form a statistic model, regression analysis is performed in the form of logistic regression
To get a prediction from the model there is an API, built using Flask. See section "Running service" below for more information.
Navigate to the root directory and pull the relevant images and build the services:
docker-compose build
The CSV
input file must have at least the following columns:
column | description |
---|---|
Text | message |
Main | Main category slug |
Sub | Sub category slug |
The columns must be in the order Text,Main,Sub
, no header is required.
To train the models place the csv file in the input/
directory and run the following commands:
docker-compose run --rm train --filepath=/input/{name of csv file} --columns=Main
docker-compose run --rm train --filepath=/input/{name of csv file} --columns=Main,Sub
This will produce a set of files pickled using joblib and some files that can be used
to verify the accuracy of the model in the form of a
confusion matrix.
The files will be saved in the ouput/
directory.
In the example above this would result in:
/output/main_model.pkl
/output/main_labels.pkl
/output/main_dl.csv
/output/main-matrix.csv
/output/main-matrix.pdf
/output/main_sub_model.pkl
/output/main_sub_labels.pkl
/output/main_sub_dl.csv
/output/main_sub-matrix.csv
/output/main_sub-matrix.pdf
The service is a standalone API built on the Flask framework. In order for it to
be able to use the model that was trained the pickle files are required.
Copy the pickle files listed below into the /models
directory or the directory you have configured through the
MODELS_DIRECTORY
environmental variable.
output/ | models/ | description |
---|---|---|
main_model.pkl | main_model.pkl | model for main category |
main_sub_model.pkl | sub_model.pkl | model for sub category |
main_slugs.pkl | main_slugs.pkl | slugs for main category |
main_sub_slugs.pkl | sub_slugs.pkl | slugs for sub category |
In order for the API to produce useful results for the signals application, it is important to provide a base url for
the backend portion of the application. This can be achieved by setting the environmental variable
SIGNALS_CATEGORY_URL
.
To activate the flask api run:
docker-compose up -d web
Typically, a POST request with a body similar to:
{
"text": "afval"
}
should be made to http://localhost:8140/signals_mltool/predict, to get a prediction. This should give a response with a body similar:
{
"hoofdrubriek": [
[
"http://localhost:8000/signals/v1/public/terms/categories/afval"
],
[
0.7629584838555712
]
],
"subrubriek": [
[
"http://localhost:8000/signals/v1/public/terms/categories/afval/sub_categories/huisafval"
],
[
0.56709391826473
]
]
}
As can be seen in the example response, the url is constructed using the base url in the same way that the signals application constructs those urls.
This project uses isort
for import sorting. Before committing changes, run the following command to ensure correct import sorting:
poetry run isort .
Adding --diff
and --check-only
will not fix the found issues but instead show the proposed fixes that can be implemented.
poetry run isort --diff --check-only .
We also use flake8
for code linting. Before committing changes, run the following command to ensure code
compliance with PEP 8 style guidelines and identify potential issues:
poetry run flake8 .
Remember to address any identified issues before committing your changes.