This section describes tooling and features included in Koku to assist developers contributing to Koku.
Koku makes use of environment variables to configure features and application capabilities.
In the repository, there is an .env.example
file with sample environment settings. The .env
file is used by Django's tools as well as Koku's scripting. It is the recommended way to configure your local development environment settings.
This section documents environment variables that may be of interest to developers.
DEVELOPMENT=(True|False)
- Enables development features. Not for
production use. DEVELOPMENT_IDENTITY=(JSON)
- A JSON Object
representing a User
When DEVELOPMENT
is not set, Koku expects to use an external service to handle authentication and access control for most API endpoints.
When DEVELOPMENT
is set, the development middleware is enabled. This allows passing in custom identity information into Koku for development and testing purposes using the DEVELOPMENT_IDENTITY
variable.
Example DEVELOPMENT_IDENTITY
object:
{
"identity": {
"account_number": "10001",
"type": "User",
"user": {
"username": "user_dev",
"email": "[email protected]",
"is_org_admin": False
"access": {
"aws.account": {
"read": ["1234567890AB", "234567890AB1"]
}
"gcp.account": {
"read": ["*"]
}
"gcp.project": {
"read": ["*"]
}
"azure.subscription_guid": {
"read": ["*"]
}
"openshift.cluster": {
"read": ["*"]
}
"openshift.project": {
"read": ["*"]
}
"openshift.node": {
"read": ["*"]
}
}
},
},
"entitlements": {"cost_management": {"is_entitled": True}},
}
Note: This example is pretty-printed for readability. When setting the enviroment variable, it should be collapsed to one single line.
Trino is an SQL query engine, but not an SQL database. What does that mean? Traditional databases typically consist of a query engine and a storage engine. Trino is just a query engine and does not store data. Inside of koku, our download workers will retrieve cost data from the cloud providers in the form of CSVs. During processing we convert these csv files to parquet files (Look for the following function in the codebase: convert_csv_to_parquet
).
After the files are converted they are then sent up to an S3 bucket. During runtime the Trino coordinator utilizes the existing HIVE metastore metadata and files residing in storage to query the files for data.
Trino Specific Env Vars
S3_BUCKET_NAME=koku-bucket
S3_ENDPOINT=http://kokuminio:9000
S3_ACCESS_KEY=kokuminioaccess
S3_SECRET=kokuminiosecret
TRINO_DATE_STEP=31
Additional Reads:
Koku utilize two different table types within Trino, external tables and managed tables. Understanding the difference between these two table types is essential to understanding how to migrate them.
An external table is a table that is stored externally. These tables point towards an external location in S3 or Mino (locally). The Hive metastore only contains the metadata schema so that it knows where to look during a query.
- External tables are identifiable in trino by using the
SHOW CREATE TABLE <table_name>
command. The external tables will have anexternal_location
like this:
SHOW CREATE TABLE aws_line_items;
...
WITH (
external_location = 's3a://koku-bucket/data/parquet/org1234567/AWS',
format = 'PARQUET',
partitioned_by = ARRAY['source','year','month']
)
Migration Considerations
- Since the data is externally located, dropping these tables will not result in data loss. Dropping the table will force the hive metastore to refresh the metadata schema picking up any changes.
Managed tables ARE NOT stored externally, and Hive assumes it owns the data for the managed table. The managed tables can be identified with the same SHOW CREATE TABLE <table_name>
command.
SHOW CREATE TABLE reporting_ocpawscostlineitem_project_daily_summary;
...
WITH (
format = 'PARQUET',
partitioned_by = ARRAY['aws_source','ocp_source','year','month','day']
)
Notice how an external_location
is not present in the return. That means the hive metastore assumes it owns that information.
Migration Considerations
- Changes to these tables should always come in the form of adding or dropping a column.
- Dropping these tables would result in DATA LOSS, therefore the
migrate_trino.py
script will not allow you to drop managed tables.
Our Trino migrations are run using the migrate_trino.py
script, and this script contains checks and balances regarding managed and external tables. Therefore, when new trino tables are added to the project, the corresponding list of external vs managed tables should be updated with the new table name.
MinIO is the file storage location we use locally. Since, we query the parquet files based off the S3 path, we use this tool locally to see the csv & parquet files are in their expected place. It will also allow you to download them.
Minio UI:
The username & password is set inside of the .env file
S3_ACCESS_KEY=kokuminioaccess
S3_SECRET=kokuminiosecret
This is the brain of the Trino installation and the node the client connects to when submitting statements for execution. The coordinator keeps track of the activity on each worker and coordinates the execution of a query.The coordinator creates a logical model of a query involving a series of stages, which is then translated into a series of connected tasks running on a cluster of Trino workers.
You can use the trino UI to see the queries run in trino:
You can also manually run queries inside of the trino container
docker exec -it trino trino --catalog hive --schema org1234567 --user admin --debug
SHOW CREATE TABLE <table_name>;
Additional Reads:
Connect to PostgreSQL
PGPASSWORD=postgres psql postgres -U postgres -h localhost -p 15432
Another alternative is to configure a connection service file.
Example ~/.pg_service.conf
:
[koku_db]
host=localhost
dbname=postgres
port=5432
user=postgres
password=postgres
You can then connect through setting the PGSERVICE
variable:
PGSERVICE=koku_db psql
Commonly used psql commands
- Set search Schema:
SET search_path TO org1234567
- find datatable:
\dt
or\dt reporting_ocpgcp*
- Tidy up display:
\x auto
- List columns for a table:
\d+ reporting_ocpusagelineitem_daily_summary
Commonly used SQL pieces
Search for Tag Key
WHERE pod_labels ? 'environment';
Extensions
We use PostgreSQL's uuid-ossp extension to generate uuids. However, this extension is not enabled by default when connecting through a terminal. If you ever need the extension locally, it can can be enabled with the following command:
ALTER EXTENSION "uuid-ossp" SET SCHEMA org1234567;
Example:
postgres=# SELECT uuid_generate_v4();
uuid_generate_v4
--------------------------------------
cde6c80d-ac41-42a2-b1db-bf941db2b7c8
We have a development script that will replace the Jinja variables inside of a template file.
Example
python3 dev/scripts/render_templated_sql.py -f /path/to/template.sql -b 11
The following code snippit will provide you with queries generated by Django that are able to be copy and pasted into PostgreSQL.
from django.db import connection
readable_queries = []
for dikt in connection.queries:
for key, item in dikt.items():
item = item.replace("\"", "")
item = item.replace("\\", "")
readable_queries.append({key: item})
LOG.info(readable_queries)
If you are debugging queries generated for the report endpoints, a good place to put this code is in the ReportView
class since it is the parent class for all the views.
docker exec -it koku_redis redis-cli
127.0.0.1:6379> select 1
OK
127.0.0.1:6379[1]> keys *
Additionally, here is a small example of how to test Redis cache changes.
The default credentials for Unleash are:
username: admin
password: unleash4all
- Add the variable to the necessary kustomize/patches yaml files.
- name: TAG_ENABLED_LIMIT
value: ${TAG_ENABLED_LIMIT}
- Then add the variable to the
parameters:
list in the base.yaml file.
- displayName: Enable Tags Limit
name: TAG_ENABLED_LIMIT
value: "200"
- Run the
make clowdapp
command to generate the mainclowdapp.yaml
file.
Example:
This is a small script to iterate over schema and validate any data in Trino also exists in Postgres summary tables.
Examples commands and results:
- This
python ./dev/scripts/data_integrity.py -g "gabi_url" -u "trino-url" -t "my-token" -c "my.cert" -k "my.key" -s "2023-11-07" -e "2023-11-08" -l
will return a list of all active Postgres Schema's
["schema1", "schema2", "schema3"]
- This
python ./dev/scripts/data_integrity.py -g "gabi_url" -u "trino-url" -t "my-token" -c "my.cert" -k "my.key" -s "2023-11-07" -e "2023-11-08"
will return the following snippet stating if there is complete data or any tables with missing data for particular days.
{'schema1': {'PG': {'reporting_awscostentrylineitem_daily_summary': 'All data complete for table: reporting_awscostentrylineitem_daily_summary'}, 'TRINO': 'PG data complete so skipping trino query'}},
{'schema2': {'PG': {"reporting_awscostentrylineitem_daily_summary": "Missing data in reporting_awscostentrylineitem_daily_summary for days: ['2023-10-26', '2023-10-27', '2023-10-28', '2023-10-29', '2023-10-30', '2023-10-31']"}, 'TRINO': {"aws_line_items": "All data complete for table: aws_line_items"}}, 'schema3': {'PG': {'reporting_awscostentrylineitem_daily_summary': 'All data complete for table: reporting_awscostentrylineitem_daily_summary'}, 'TRINO': 'PG data complete so skipping trino query'}}
- This
python ./dev/scripts/data_integrity.py -g "gabi_url" -u "trino-url" -t "my-token" -c "my.cert" -k "my.key" -s "2023-11-07" -e "2023-11-08" -x "schema1" "schema3"
will return similar to above but run for specific schema
'schema1': {'PG': {'reporting_awscostentrylineitem_daily_summary': 'All data complete for table: reporting_awscostentrylineitem_daily_summary'}, 'TRINO': 'PG data complete so skipping trino query'}},
'schema3': {'PG': {'reporting_awscostentrylineitem_daily_summary': 'All data complete for table: reporting_awscostentrylineitem_daily_summary'}, 'TRINO': 'PG data complete so skipping trino query'}}
Flower is a tool for monitoring Celery clusters. It provides detailed information about the status of workers and tasks.
Flower is installed with the dev dependencies but it is not run by default. To start Flower, run
celery -A koku flower
Open http://localhost:5555 to see Celery details.
See the Flower documentation for detailed usage information.