Skip to content

This project use GitHub action CICD, excel, sql, docker and python

Notifications You must be signed in to change notification settings

Thanasornsawan/Practice_ETL_QA_analyst

Repository files navigation

Practice_ETL_QA_analyst

This project use excel, sql, docker and python (pytest)

Scenario user story

Ensure that customer order data is correctly loaded into the database so that I can analyze purchasing patterns.

Acceptance Criteria:

The Customer_ID must be unique.
The Product_ID must exist in the Products table.
The Quantity must not be negative.
The Order_Date must be valid and in the correct format (YYYY-MM-DD).
The Customer_Name must not be empty.

Transformation Logic (ETL):

Extract: Data is extracted from Excel files containing order details.
Transform: Remove leading/trailing spaces from Customer_Name.
Ensure Order_Date is in the correct format (YYYY-MM-DD).
Handle missing or invalid values for Quantity (e.g., replace negative values with zero).
Load: Data is loaded into the Orders table in SQLite

Test Plan for Data Quality Testing (unit test)

Test Case ID Test Case Description Steps to Execute Expected Result Risk Level Test Data
TC_001 Validate Customer ID Uniqueness - Execute validate_customer_id_unique query.
- Fetch the results into a DataFrame.
- Check for any duplicate Customer_IDs.
Failure: The DataFrame should be empty, indicating no duplicates. Critical – Affects data integrity Customer_ID: 1234 (used for two orders)
Order_Date: "2024-12-01"
Product_ID: 567
Quantity: 2
TC_002 Validate Correct Date Format - Execute validate_order_date_format query.
- Fetch the results into a DataFrame.
- Validate if the Order_Date is in the correct format (dd/mm/yyyy).
Failure: The DataFrame should have no invalid date formats. High – Affects date parsing and reporting Customer_ID: 1234
Order_Date: "12/01/2024" (invalid format)
Product_ID: 567
Quantity: 2
TC_003 Validate Missing Customer Name - Execute get_orders_with_missing_customer_name query.
- Fetch the results into a DataFrame.
- Check for any missing Customer_Name values.
Failure: There should be no missing customer names. High – Affects order processing Customer_ID: 1234
Order_Date: "2024-12-01"
Product_ID: 567
Quantity: 2 (Customer_Name: NULL)
TC_004 Validate Negative Quantity Orders - Execute get_orders_with_negative_quantity query.
- Fetch the results into a DataFrame.
- Check for negative Quantity values.
Failure: The DataFrame should have no rows with negative quantities. High – Affects business logic and financial calculations Customer_ID: 1234
Order_Date: "2024-12-01"
Product_ID: 567
Quantity: -5
TC_005 Validate Order Date Range (December 2024 only) - Execute the query to fetch all Order_ID and Order_Date from the Orders table.
- Check each order's date format and ensure it's within the range 2024-12-01 to 2024-12-31.
- Identify invalid or out-of-range dates.
Failure: Orders with Order_Date outside the range 2024-12-01 to 2024-12-31 should be flagged.
Failure: Orders with invalid date formats should be flagged.
High – Invalid or out-of-range dates can affect reporting and processing. Customer_ID: 1234
Order_Date: "01/12/2024"
Product_ID: 567
Quantity: 10 (Valid date)
Customer_ID: 5678
Order_Date: "01/11/2024" (Out of range)
Customer_ID: 91011
Order_Date: "InvalidDate" (Invalid format)
TC_006 Validate Invalid Email Format - Execute get_invalid_email_customers query.
- Fetch the results into a DataFrame.
- Check for invalid email formats.
Failure: The DataFrame should have no rows with invalid emails. High – Affects customer communication Customer_ID: 1234
Order_Date: "2024-12-01"
Product_ID: 567
Quantity: 2
Customer_Email: "invalid_email"
TC_007 Ensure Unique Product_ID in Order - Execute get_orders_with_duplicate_product_id query.
- Fetch the results into a DataFrame.
- Check for duplicate Product_IDs in orders.
Failure: The DataFrame should be empty, indicating no duplicates. Critical – Affects data integrity Customer_ID: 1234
Order_Date: "2024-12-01"
Product_ID: 567 (duplicate)
Quantity: 2
TC_008 Ensure Product_Name Cannot Be NULL - Execute get_orders_with_null_product_name query.
- Fetch the results into a DataFrame.
- Check for any NULL values in Product_Name.
Failure: The DataFrame should have no rows with NULL Product_Name. High – Affects order completeness Customer_ID: 1234
Order_Date: "2024-12-01"
Product_ID: 567
Quantity: 2
Product_Name: NULL
TC_009 Validate Referential Integrity Between Orders and Products - Execute get_invalid_product_references query.
- Fetch the results into a DataFrame.
- Check for any Product_ID references that do not exist in Products.
Failure: The DataFrame should have no rows indicating invalid Product_ID references. Critical – Affects data integrity Customer_ID: 1234
Order_Date: "2024-12-01"
Product_ID: 999 (non-existing)
Quantity: 2

To run a specific test case:

Run by exact function name:

pytest -s tests/test_data_unittest.py::test_invalid_product_id

This will run only the test_invalid_product_id test case in tests/test_data_unittest.py.

Running the Project Locally

1.Create and Activate a Virtual Environment

python3 -m venv venv

Activate the virtual environment:

source venv/bin/activate

2.Install Project Dependencies

pip install -r requirements.txt

3.Docker step

docker-compose down
docker-compose up -d

4.Set Up the Database

python sql/sqlite_db/setup_db.py

5.Load Data into the Database

python tests/load_data.py

6.Run the test

pytest tests/test_data_unittest.py

Example fail result after run test

date result

map result

Data Completeness Testing:

Objective: Ensure that all expected data is loaded into the target system without any loss.
Test Case: Compare the record counts between the source and target tables to verify completeness.

Run the test

pytest tests/test_load_correct.py

load result

SCD Type 2 testing

Proposed Workflow

1. Raw Source Database (etl.db)

  • Schema: Customer_ID, Customer_Name, Order_Date, Product_ID, Quantity, Email.

2. Dimension Database (etl_dm.db)

  • Schema: EID (Primary Key), Customer_ID, Customer_Name, Order_Date, Product_ID, Quantity, Email, Start_Date, End_Date, Active.

3. Logic in create_dm.py

  • Check if data exists in the dimension database:
    • Non-existent Record: Insert new data with:
      • Start_Date = today
      • End_Date = 9999-12-31
      • Active = Y
    • Existing Record:
      • Compare Product_ID, Quantity, Order_Date:
      • If data is different:
        • Update the End_Date of the old record to today and set Active = N.
        • Insert the updated record as a new row with Start_Date = today and Active = Y.

4. Verification in Tests

  • Validate SCD Type 2 logic by filtering EID and comparing historical records.
  • Ensure schema alignment between etl.db and etl_dm.db.

Step by step to run test

1.Create dimension database only once (first time)

python sql/sqlite_db/setup_db.py

2. Load data from source db to target dimension db

python sql/sqlite_db/create_dm.py 

The table first time, we can see that it has wrong data from source with duplicate Customer_ID
table first insert
This file you can run mutltiple times, it will compare source db with target db and check with the same Customer_ID and Order_Id which is primary key from source db to update history data change by put Start_Date, End_Date and Active status

3. Run SCD test case file

pytest -s tests/test_scd.py

scd step
Then, we update quantity of customer name "Jane Smith" and run create_dm.py again

table update

scd result

Compare between database testing

Example Test Cases for Validation:

  1. Record Count Validation Verify the number of records in both Employee and Finance tables matches both QA and DEV env database
Expected Output:
Employee table: 4 records.
Finance table: 4 records.
  1. Primary Key Validation Ensure all Emp_ID in Finance refer to valid ID in Employee.
Expected Output:
Orphaned record: Emp_ID = 5.
  1. Column-Wise Validation Compare each column's data in the tables to the source files.
Expected Output:
Mismatch found in the Employee table:
{'ID': 3, 'Column': 'Department', 'Dev_Value': 'HR', 'QA_Value': 'Human Resources'}
Mismatch found in the Finance table:
{'ID': 3, 'Column': 'Salary', 'Dev_Value': 50000, 'QA_Value': 52000}
  1. Data Type Validation Check that all columns conform to their defined types.
Expected Output:
Data types are correct for all rows.

Step to create ENV database (Dev, QA)

1. Build docker of each env from docker-compose

docker-compose -f docker-compose.yml up -d --build 

2. Setup database for each env*

python env/init_env_db.py

You need to run this file two times because need to change MYSQL_CONFIG = MYSQL_CONFIG_DEV to be the specific env

3. Insert data into database both dev and qa env

python env/init_env_db.py

We can see that salary mismatch between dev and qa database from these pictures

dev finance

qa finance

4. Run test validation between 2 databases

pytest -s tests/test_db_validation.py

db result1

db_result2

Explanation

  • /tests/load_data.py: script for transformation purpose, load raw data from excel to database sqlite
  • /tests/load_correct.py: script count record that raw data from excel equal to total row in database
  • /tests/data_unittest.py: script verify test case follow ETL purpose on db that load from excel (check pk, null, format, etc.)
  • /tests/scd.py: script check history record correct whenever edit data in excel and load to etl.db, it will auto create etl_dm.db for verify history change by start_date, end_date and active status
  • /tests/db_validation.py: simulate situation of data warehouse when testing which need two env (dev,qa) and when load data, it may has mismatch data, primary key not correct, total row in db, data in every column of table between env should be the same and data type of column in table, etc.
    The database that we use to simulate two env is mysql while test case in unit test file use sqlite
  • /sql/sqlite_db/setup_db.py: script for create sqlite db (for unit testing) follow structure from excel file. (excel sheet name is table name) - result will be sql/sqlite/etl.db
  • /sq/sqlite_db/setup_dm_db.py: script for create sqlite db (for scd testing) - result will be be sql/sqlite/etl_dm.db
  • /sq/sqlite_db/create_dm.py: script for insert it will compare history data and auto insert/update table in etl_dm.db
  • /sq/sqlite_db/db_queries.py: all query functions that use in test_data_unittest.py
  • /sq/sqlite_db/dbm_queries.py: all query functions that use in test_scd.py
  • /env/init_env_db.py: script for create table in mysql database that running in docker for both env 'dev' and 'qa'
  • /env/populate_db.py: script for insert data to both env database for data movement testing (relate to test_db_validation.py)

About

This project use GitHub action CICD, excel, sql, docker and python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published