Skip to content

Commit

Permalink
Merge pull request #18 from UKGovernmentBEIS/MEESALPHA-310-readme-docs
Browse files Browse the repository at this point in the history
MEESALPHA-310 readme docs
  • Loading branch information
scottywakefield authored Sep 25, 2024
2 parents e7faa49 + 300e185 commit ae0c97d
Show file tree
Hide file tree
Showing 4 changed files with 104 additions and 8 deletions.
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
# DESNZ-MEES
# DESNZ-MEES Alpha

## Getting started
## Warning
This is experimental code and not intended for commercial use.

## Repository Structure
This repo consists of two entirely independant code bases:
- [data analysis](./src/data-analysis/) - python and jupyter scripts to load and analyse data, the DMS equivalent of this repo.
- [data tool](./src/data-tool/) - ASP.NET app to display the data.

They are combined into a single code base as we were only supplied with a single repo. And it is alpha so not as important to get repo structure correct as it is to prove the individual areas of technology. The time was therefore spent on technology.
46 changes: 46 additions & 0 deletions src/data-analysis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# DESNZ-MEES Alpha - Data Analysis

## Warning
This is experimental code and not intended for commercial use.

## How it Works
This is a collection of python and jupyter scripts that can be run to automate the

- downloading (`XX_download_files.py`)
- unzipping (`XX_unzip-file.py`)
- analysis (`XX-analysis.ipynb`)
- conversion to parquet (`XX-create-parquets.ipynb`)

of data from a variety of sources (see Confluence or HLD for more info on each of the data sources) that together provide the required data to identify non-domestic, rented properties that are below the Minimum Energy Efficiency Standard (MEES).

The folder structure organises each of the data sources by their respsective source organisation, for example hmlr contains ccod, ocod and poly-mock data sources. Note, this rearrangement into sub-folders has broken some paths.

There is one more important file, `all-db-import.ipynb`, which loads each parquet file into a destination DB.

The intention of this code is not only to get a POC working but also to provide a head start for building this functionality in DMS on Databricks and using `polars`. Therefore polars was used as the data analysis library (tho pandas was used until I found out that DMS uses polars).

## Installation
### Pre-requisites
- Python runtime
- Jupyter Server

I found it much easier to install and run these on a mac than windows machine. Even using WSL on windows it was still painful.

### Config
- Copy the dev.env file to an .env file. You will have to populate this file with any missing values for specific paths, keys, tokens etc.

## Local Running
- Due to the size of the files downloaded you will need a lot of diskspace. It is currently using `60GB`(!!) on my local. Once you have the parquet files you could delete everything from the downloads dir, unless you need them for further data matching activities, which I did. Please be careful to keep `.gitignore` updated so these huge files are not committed to the repo.
- Also due to the size of the files some of the operations did take a long time. Building the poly-mock data was one of the longest, it ran overnight.
## External Dependencies
- Each of the data imports will require an account created in order to gain a token for access.
## Technologies
- Python
- Polars
- Jupyter
## Known Issues
- no attempt has been made to get the correct file download per month. Currently the most recent at the time of writing was hardcoded.
- VOA data is meant to be augmented with updates, this was not done.
- use of subfolders for each script has broken some of the paths. Prepending with '../' should resolve the issue in most cases. At the same time you should start using the 'OUTPUT_FOLDER' and 'DOWNLOADS_FOLDER' env variables so that the scripts are more robust between environments.
- the `XX-create-parquets.ipynb` files can easily be converted to py files but this has not been done. It was helpful when running in Jupyter to debug what was going on.
- the resultant DB from `all-db-import.ipynb` has most of the columns set to `text` data type and no indexes or keys. Setting the types, indexes and keys will drastically improve query performance. This was done to a point but abandoned due to time constraints.
2 changes: 2 additions & 0 deletions src/data-analysis/dev.env
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ EPC_BASE_URL="https://epc.opendatacommunities.org/api/v1/non-domestic/search"
OS_TOKEN="get from os"
OS_FILENAME="downloads/AB76GB_CSV.zip"
OS_BASE_URL="https://api.os.uk/downloads/v1/dataPackages/0040169528/versions/6571173/downloads?fileName=AB76GB_CSV.zip&key="
# absolute path to os files
OS_FILES_PATH="/path-to-repo/src/data-analysis/downloads/os_abp_csv"

VOA_FILENAME="downloads/euk-englandwales-ndr-2023-listentries-compiled-epoch-0008-baselin-csv.zip"
VOA_BASE_URL="https://voaratinglists.blob.core.windows.net/downloads/uk-englandwales-ndr-2023-listentries-compiled-epoch-0008-baseline-csv.zip"
Expand Down
52 changes: 46 additions & 6 deletions src/data-tool/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,48 @@
set up asp.net user secrets
set up dev certificate:
https://github.com/dotnet/dotnet-docker/blob/main/samples/run-aspnetcore-https-development.md
# DESNZ-MEES Alpha - Data Tool

## Warning
This is experimental code and not intended for commercial use.

why basic auth??????
single user, no further ui work, 'ok' if forced through https.
defo only for POC, NOT FOR PROD CODE!
## How it Works
This is a basic asp.net core web app with a couple of quirks:
- the db is defined and loaded by the scripts in the [data-analysis](../data-analysis/) folder. Use the `dotnet ef dbcontext scaffold` command to update the DbContext and related entities.
- all the data is read-only
- the app can be run locally as a pair of docker containers, one for the web app and one for the postgres db.
- GitHub Actions are used to ensure the app builds before a PR is merged and also deploys the app to an Azure App Service when `main` is updated. The Azure uses the docker image to create a containerised App Service. Bear this in mind if you start changing ports. Also Azure expects to be able to ping port 80 on the container as a health check.
- there are currently no tests

## Installation
### Pre-requisites
- a clone of this repo
- an IDE
- docker desktop or equivalent
- a postresql DB already loaded with the relevant data
- .NET User Secrets for the Web.csproj
- a [dev ssl certificate](https://github.com/dotnet/dotnet-docker/blob/main/samples/run-aspnetcore-https-development.md) for running via https locally, keep the password in the user secrets config below.
### Config
- The user secrets file for running the Web app locally should be updated to contain the following items:
```json
{
"DataConfiguration:DbConnectionString": "Server=db;Port=5432;Database=data_tool;User Id=postgres;Password=YOUR_DB_PASSWORD;",
"Kestrel:Certificates:Development:Password": "YOUR_CERT_PASSWORD",
"WebConfiguration:Username": "YOUR_WEB_USER",
"WebConfiguration:Password": "YOUR_WEB_PASSWORD"
}
```
- The config for the Azure App Service has already been defined and does not need to be modified.
## Local Running
- use `docker compose -f compose.yaml -p data-tool build` and then `docker compose -f compose.yaml -p data-tool up -d` to ensure any code changes are captured in the docker image used by the `compose.yaml` file. Change the Server name of the postgres connection string accordingly to match the server name defined there.
## External Dependencies
- the DB is created by the scripts in the [data-analysis](../data-analysis/) folder. Modify the Database in the postgres connection string accordingly.
- it uses GitHub Actions to deploy the docker image to an Azure Container Registry
- Azure is used as the host.
## Technologies
- .Net 8
- EF Core
- Docker
- GOV.UK Design System
## Known Issues
- why basic auth??????: single user, no further ui work, 'ok' if forced through https. defo only for POC, NOT FOR PROD CODE!
- no tests, tho it has been written using .net core IOC for testing at a future date
- the LA parameter to the SQL query should really be chooseable by the user
- GDS css and js is hard coded

0 comments on commit ae0c97d

Please sign in to comment.