This project use excel, sql, docker and python (pytest)
Ensure that customer order data is correctly loaded into the database so that I can analyze purchasing patterns.
The Customer_ID
must be unique.
The Product_ID
must exist in the Products table.
The Quantity
must not be negative.
The Order_Date
must be valid and in the correct format (YYYY-MM-DD).
The Customer_Name
must not be empty.
Extract: Data is extracted from Excel files containing order details.
Transform:
Remove leading/trailing spaces from Customer_Name.
Ensure Order_Date is in the correct format (YYYY-MM-DD).
Handle missing or invalid values for Quantity (e.g., replace negative values with zero).
Load: Data is loaded into the Orders table in SQLite
Test Case ID | Test Case Description | Steps to Execute | Expected Result | Risk Level | Test Data |
---|---|---|---|---|---|
TC_001 | Validate Customer ID Uniqueness | - Execute validate_customer_id_unique query.- Fetch the results into a DataFrame. - Check for any duplicate Customer_ID s. |
Failure: The DataFrame should be empty, indicating no duplicates. | Critical – Affects data integrity | Customer_ID: 1234 (used for two orders) Order_Date: "2024-12-01" Product_ID: 567 Quantity: 2 |
TC_002 | Validate Correct Date Format | - Execute validate_order_date_format query.- Fetch the results into a DataFrame. - Validate if the Order_Date is in the correct format (dd/mm/yyyy ). |
Failure: The DataFrame should have no invalid date formats. | High – Affects date parsing and reporting | Customer_ID: 1234 Order_Date: "12/01/2024" (invalid format) Product_ID: 567 Quantity: 2 |
TC_003 | Validate Missing Customer Name | - Execute get_orders_with_missing_customer_name query.- Fetch the results into a DataFrame. - Check for any missing Customer_Name values. |
Failure: There should be no missing customer names. | High – Affects order processing | Customer_ID: 1234 Order_Date: "2024-12-01" Product_ID: 567 Quantity: 2 (Customer_Name: NULL) |
TC_004 | Validate Negative Quantity Orders | - Execute get_orders_with_negative_quantity query.- Fetch the results into a DataFrame. - Check for negative Quantity values. |
Failure: The DataFrame should have no rows with negative quantities. | High – Affects business logic and financial calculations | Customer_ID: 1234 Order_Date: "2024-12-01" Product_ID: 567 Quantity: -5 |
TC_005 | Validate Order Date Range (December 2024 only) | - Execute the query to fetch all Order_ID and Order_Date from the Orders table.- Check each order's date format and ensure it's within the range 2024-12-01 to 2024-12-31 .- Identify invalid or out-of-range dates. |
Failure: Orders with Order_Date outside the range 2024-12-01 to 2024-12-31 should be flagged.Failure: Orders with invalid date formats should be flagged. |
High – Invalid or out-of-range dates can affect reporting and processing. | Customer_ID: 1234 Order_Date: "01/12/2024" Product_ID: 567 Quantity: 10 (Valid date) Customer_ID: 5678 Order_Date: "01/11/2024" (Out of range) Customer_ID: 91011 Order_Date: "InvalidDate" (Invalid format) |
TC_006 | Validate Invalid Email Format | - Execute get_invalid_email_customers query.- Fetch the results into a DataFrame. - Check for invalid email formats. |
Failure: The DataFrame should have no rows with invalid emails. | High – Affects customer communication | Customer_ID: 1234 Order_Date: "2024-12-01" Product_ID: 567 Quantity: 2 Customer_Email: "invalid_email" |
TC_007 | Ensure Unique Product_ID in Order | - Execute get_orders_with_duplicate_product_id query.- Fetch the results into a DataFrame. - Check for duplicate Product_ID s in orders. |
Failure: The DataFrame should be empty, indicating no duplicates. | Critical – Affects data integrity | Customer_ID: 1234 Order_Date: "2024-12-01" Product_ID: 567 (duplicate) Quantity: 2 |
TC_008 | Ensure Product_Name Cannot Be NULL | - Execute get_orders_with_null_product_name query.- Fetch the results into a DataFrame. - Check for any NULL values in Product_Name . |
Failure: The DataFrame should have no rows with NULL Product_Name . |
High – Affects order completeness | Customer_ID: 1234 Order_Date: "2024-12-01" Product_ID: 567 Quantity: 2 Product_Name: NULL |
TC_009 | Validate Referential Integrity Between Orders and Products | - Execute get_invalid_product_references query.- Fetch the results into a DataFrame. - Check for any Product_ID references that do not exist in Products. |
Failure: The DataFrame should have no rows indicating invalid Product_ID references. |
Critical – Affects data integrity | Customer_ID: 1234 Order_Date: "2024-12-01" Product_ID: 999 (non-existing) Quantity: 2 |
Run by exact function name:
pytest -s tests/test_data_unittest.py::test_invalid_product_id
This will run only the test_invalid_product_id test case in tests/test_data_unittest.py.
1.Create and Activate a Virtual Environment
python3 -m venv venv
Activate the virtual environment:
source venv/bin/activate
2.Install Project Dependencies
pip install -r requirements.txt
3.Docker step
docker-compose down
docker-compose up -d
4.Set Up the Database
python sql/sqlite_db/setup_db.py
5.Load Data into the Database
python tests/load_data.py
6.Run the test
pytest tests/test_data_unittest.py
Objective: Ensure that all expected data is loaded into the target system without any loss.
Test Case: Compare the record counts between the source and target tables to verify completeness.
Run the test
pytest tests/test_load_correct.py
1. Raw Source Database (etl.db)
- Schema: Customer_ID, Customer_Name, Order_Date, Product_ID, Quantity, Email.
2. Dimension Database (etl_dm.db)
- Schema: EID (Primary Key), Customer_ID, Customer_Name, Order_Date, Product_ID, Quantity, Email, Start_Date, End_Date, Active.
3. Logic in create_dm.py
- Check if data exists in the dimension database:
- Non-existent Record: Insert new data with:
- Start_Date = today
- End_Date = 9999-12-31
- Active = Y
- Start_Date = today
- Existing Record:
- Compare Product_ID, Quantity, Order_Date:
- If data is different:
- Update the End_Date of the old record to today and set Active = N.
- Insert the updated record as a new row with Start_Date = today and Active = Y.
- Update the End_Date of the old record to today and set Active = N.
- Compare Product_ID, Quantity, Order_Date:
- Non-existent Record: Insert new data with:
4. Verification in Tests
- Validate SCD Type 2 logic by filtering EID and comparing historical records.
- Ensure schema alignment between etl.db and etl_dm.db.
1.Create dimension database only once (first time)
python sql/sqlite_db/setup_db.py
2. Load data from source db to target dimension db
python sql/sqlite_db/create_dm.py
The table first time, we can see that it has wrong data from source with duplicate Customer_ID
This file you can run mutltiple times, it will compare source db with target db and check with the same Customer_ID and Order_Id which is primary key from source db to update history data change by put Start_Date, End_Date and Active status
3. Run SCD test case file
pytest -s tests/test_scd.py
Then, we update quantity of customer name "Jane Smith" and run create_dm.py again
Example Test Cases for Validation:
- Record Count Validation Verify the number of records in both Employee and Finance tables matches both QA and DEV env database
Expected Output:
Employee table: 4 records.
Finance table: 4 records.
- Primary Key Validation Ensure all Emp_ID in Finance refer to valid ID in Employee.
Expected Output:
Orphaned record: Emp_ID = 5.
- Column-Wise Validation Compare each column's data in the tables to the source files.
Expected Output:
Mismatch found in the Employee table:
{'ID': 3, 'Column': 'Department', 'Dev_Value': 'HR', 'QA_Value': 'Human Resources'}
Mismatch found in the Finance table:
{'ID': 3, 'Column': 'Salary', 'Dev_Value': 50000, 'QA_Value': 52000}
- Data Type Validation Check that all columns conform to their defined types.
Expected Output:
Data types are correct for all rows.
1. Build docker of each env from docker-compose
docker-compose -f docker-compose.yml up -d --build
2. Setup database for each env*
python env/init_env_db.py
You need to run this file two times because need to change MYSQL_CONFIG = MYSQL_CONFIG_DEV
to be the specific env
3. Insert data into database both dev and qa env
python env/init_env_db.py
We can see that salary mismatch between dev and qa database from these pictures
4. Run test validation between 2 databases
pytest -s tests/test_db_validation.py
/tests/load_data.py
: script for transformation purpose, load raw data from excel to database sqlite/tests/load_correct.py
: script count record that raw data from excel equal to total row in database/tests/data_unittest.py
: script verify test case follow ETL purpose on db that load from excel (check pk, null, format, etc.)/tests/scd.py
: script check history record correct whenever edit data in excel and load to etl.db, it will auto create etl_dm.db for verify history change by start_date, end_date and active status/tests/db_validation.py
: simulate situation of data warehouse when testing which need two env (dev,qa) and when load data, it may has mismatch data, primary key not correct, total row in db, data in every column of table between env should be the same and data type of column in table, etc.
The database that we use to simulate two env is mysql while test case in unit test file use sqlite/sql/sqlite_db/setup_db.py
: script for create sqlite db (for unit testing) follow structure from excel file. (excel sheet name is table name) - result will be sql/sqlite/etl.db/sq/sqlite_db/setup_dm_db.py
: script for create sqlite db (for scd testing) - result will be be sql/sqlite/etl_dm.db/sq/sqlite_db/create_dm.py
: script for insert it will compare history data and auto insert/update table in etl_dm.db/sq/sqlite_db/db_queries.py
: all query functions that use in test_data_unittest.py/sq/sqlite_db/dbm_queries.py
: all query functions that use in test_scd.py/env/init_env_db.py
: script for create table in mysql database that running in docker for both env 'dev' and 'qa'/env/populate_db.py
: script for insert data to both env database for data movement testing (relate to test_db_validation.py)