Skip to content

Commit

Permalink
add code and update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
thanasornsawan committed Dec 19, 2024
1 parent e0f39a2 commit 23e1d9f
Show file tree
Hide file tree
Showing 12 changed files with 574 additions and 7 deletions.
54 changes: 54 additions & 0 deletions .github/workflows/ci_cd_pipeline.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
name: ETL CI/CD Pipeline

on:
push:
branches:
- main
pull_request:
branches:
- main

jobs:
build:
runs-on: ubuntu-latest

steps:
- name: Checkout the code
uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.9

- name: Install Docker and Docker Compose
run: |
sudo apt-get update
sudo apt-get install -y docker.io
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
- name: Build and start Docker containers with docker-compose
run: |
docker-compose -f docker-compose.yml up -d
sleep 10 # Wait for the DB to start properly (adjust if needed)
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Run database setup
run: python sql/sqlite_db/setup_db.py

- name: Load data into database
run: tests/load_data.py

- name: Run tests
run: |
pytest tests/test_etl.py
continue-on-error: true

- name: Clean up Docker containers
run: |
docker-compose -f docker-compose.yml down
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
__pycache__
etl.db
.venv
.pytest_cache
22 changes: 22 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Dockerfile

# Use a base Python image
FROM python:3.9-slim

# Set the working directory
WORKDIR /app

# Copy the requirements file into the container
COPY requirements.txt .

# Install the dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy the project files into the container
COPY . .

# Set environment variables (if necessary)
ENV DB_PATH="/opt/airflow/sqlite_db/etl.db"

# Command to run when the container starts
CMD ["pytest", "tests/test_etl.py"]
63 changes: 56 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,59 @@ Handle missing or invalid values for Quantity (e.g., replace negative values wit

## Test Plan

| **Test Case ID** | **Test Case Description** | **Steps to Execute** | **Expected Result** | **Business Rule Compliance** | **Risk Level** | **Test Data** |
|------------------|-----------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------|------------------------------------------------------------------|----------------------------------|-----------------------------------------------------------|
| TC_01 | **Validate Customer_ID Uniqueness** | - Insert two orders with the same Customer_ID.<br>- Check if the system raises an error or rejects the second order. | **Failure**: The system should reject the second order with the same Customer_ID. | Duplicate Customer_ID violates uniqueness in the orders table. | **Critical** – Affects data integrity. | Customer_ID: 1234 (used for two orders)<br>Order_Date: "2024-12-01"<br>Product_ID: 567<br>Quantity: 2 |
| TC_02 | **Validate Correct Date Format** | - Insert an order with an invalid date format (e.g., `12/01/2024` for `Order_Date`).<br>- Attempt to save the order. | **Failure**: The system should reject the order due to incorrect date format. | The `Order_Date` must follow a standardized format. | **High** – Incorrect data can cause parsing issues and errors in reporting. | Customer_ID: 1234<br>Order_Date: "12/01/2024" (invalid format)<br>Product_ID: 567<br>Quantity: 2 |
| TC_03 | **Validate Missing Customer_Name** | - Insert an order with a missing `Customer_Name` value.<br>- Attempt to save the order. | **Failure**: The system should reject the order due to missing customer name. | The `Customer_Name` field is mandatory for all orders. | **High** – Missing customer information affects order processing and analysis. | Customer_ID: 1234<br>Order_Date: "2024-12-01"<br>Product_ID: 567<br>Quantity: 2 (Customer_Name: NULL) |
| TC_04 | **Validate Negative Quantity** | - Insert an order with a negative `Quantity` value.<br>- Attempt to save the order. | **Failure**: The system should reject the order due to invalid quantity. | `Quantity` must always be a positive number. | **High** – Negative quantity violates business logic and can affect financial calculations. | Customer_ID: 1234<br>Order_Date: "2024-12-01"<br>Product_ID: 567<br>Quantity: -5 |
| TC_05 | **Validate Missing Order Date** | - Insert an order with a missing `Order_Date` value.<br>- Attempt to save the order. | **Failure**: The system should reject the order due to missing order date. | `Order_Date` cannot be missing. | **Critical** – Missing order dates make the data unusable for time-based analysis. | Customer_ID: 1234<br>Customer_Name: "John Doe"<br>Product_ID: 567<br>Quantity: 2 (Order_Date: NULL) |
| Test Case ID | Test Case Description | Steps to Execute | Expected Result | Risk Level | Test Data |
|--------------|------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| TC_001 | **Validate Customer ID Uniqueness** | - Execute `validate_customer_id_unique` query.<br>- Fetch the results into a DataFrame.<br>- Check for any duplicate `Customer_ID`s. | **Failure**: The DataFrame should be empty, indicating no duplicates. | **Critical** – Affects data integrity | Customer_ID: 1234 (used for two orders)<br>Order_Date: "2024-12-01"<br>Product_ID: 567<br>Quantity: 2 |
| TC_002 | **Validate Correct Date Format** | - Execute `validate_order_date_format` query.<br>- Fetch the results into a DataFrame.<br>- Validate if the `Order_Date` is in the correct format (`dd/mm/yyyy`). | **Failure**: The DataFrame should have no invalid date formats. | **High** – Affects date parsing and reporting | Customer_ID: 1234<br>Order_Date: "12/01/2024" (invalid format)<br>Product_ID: 567<br>Quantity: 2 |
| TC_003 | **Validate Missing Customer Name** | - Execute `get_orders_with_missing_customer_name` query.<br>- Fetch the results into a DataFrame.<br>- Check for any missing `Customer_Name` values. | **Failure**: There should be no missing customer names. | **High** – Affects order processing | Customer_ID: 1234<br>Order_Date: "2024-12-01"<br>Product_ID: 567<br>Quantity: 2 (Customer_Name: NULL) |
| TC_004 | **Validate Negative Quantity Orders** | - Execute `get_orders_with_negative_quantity` query.<br>- Fetch the results into a DataFrame.<br>- Check for negative `Quantity` values. | **Failure**: The DataFrame should have no rows with negative quantities. | **High** – Affects business logic and financial calculations | Customer_ID: 1234<br>Order_Date: "2024-12-01"<br>Product_ID: 567<br>Quantity: -5 |
| TC_005 | **Validate Order Date Range (December 2024 only)** | - Execute the query to fetch all `Order_ID` and `Order_Date` from the `Orders` table.<br>- Check each order's date format and ensure it's within the range `2024-12-01` to `2024-12-31`.<br>- Identify invalid or out-of-range dates. | **Failure**: Orders with `Order_Date` outside the range `2024-12-01` to `2024-12-31` should be flagged.<br>**Failure**: Orders with invalid date formats should be flagged. | **High** – Invalid or out-of-range dates can affect reporting and processing. | Customer_ID: 1234<br>Order_Date: "01/12/2024"<br>Product_ID: 567<br>Quantity: 10 (Valid date)<br>Customer_ID: 5678<br>Order_Date: "01/11/2024" (Out of range)<br>Customer_ID: 91011<br>Order_Date: "InvalidDate" (Invalid format) | |
| TC_006 | **Validate Invalid Email Format** | - Execute `get_invalid_email_customers` query.<br>- Fetch the results into a DataFrame.<br>- Check for invalid email formats. | **Failure**: The DataFrame should have no rows with invalid emails. | **High** – Affects customer communication | Customer_ID: 1234<br>Order_Date: "2024-12-01"<br>Product_ID: 567<br>Quantity: 2<br>Customer_Email: "invalid_email" |
| TC_007 | **Ensure Unique Product_ID in Order** | - Execute `get_orders_with_duplicate_product_id` query.<br>- Fetch the results into a DataFrame.<br>- Check for duplicate `Product_ID`s in orders. | **Failure**: The DataFrame should be empty, indicating no duplicates. | **Critical** – Affects data integrity | Customer_ID: 1234<br>Order_Date: "2024-12-01"<br>Product_ID: 567 (duplicate)<br>Quantity: 2 |
| TC_008 | **Ensure Product_Name Cannot Be NULL** | - Execute `get_orders_with_null_product_name` query.<br>- Fetch the results into a DataFrame.<br>- Check for any `NULL` values in `Product_Name`. | **Failure**: The DataFrame should have no rows with NULL `Product_Name`. | **High** – Affects order completeness | Customer_ID: 1234<br>Order_Date: "2024-12-01"<br>Product_ID: 567<br>Quantity: 2<br>Product_Name: NULL |
| TC_009 | **Validate Referential Integrity Between Orders and Products** | - Execute `get_invalid_product_references` query.<br>- Fetch the results into a DataFrame.<br>- Check for any `Product_ID` references that do not exist in Products. | **Failure**: The DataFrame should have no rows indicating invalid `Product_ID` references. | **Critical** – Affects data integrity | Customer_ID: 1234<br>Order_Date: "2024-12-01"<br>Product_ID: 999 (non-existing)<br>Quantity: 2 |

## To run a specific test case:
**Run by exact function name:**
```sh
pytest -s tests/test_etl.py::test_invalid_product_id
```
This will run only the test_invalid_product_id test case in tests/test_etl.py.

## Running the Project Locally

**1.Create and Activate a Virtual Environment**
```sh
python3 -m venv venv
```

Activate the virtual environment:
```sh
source venv/bin/activate
```

**2.Install Project Dependencies**
```sh
pip install -r requirements.txt
```

**3.Docker step**
```sh
docker-compose down
docker-compose up -d
```

**4.Set Up the Database**
```sh
python sql/sqlite_db/setup_db.py
```

**5.Load Data into the Database**
```sh
python tests/load_data.py
```

**6.Run the test**
```sh
pytest tests/test_etl.py
```
16 changes: 16 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
version: '3.8'

services:
sqlite_db:
build: ./sql # Path where your Dockerfile is located
container_name: sqlite_db
volumes:
- ./sql/sqlite_db:/opt/sqlite_db # Map the local folder to the container's folder
ports:
- "8081:8080" # Adjust if needed
networks:
- sqlite_network

networks:
sqlite_network:
driver: bridge
Binary file added orders_test_data.xlsx
Binary file not shown.
5 changes: 5 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# For setting up the environment
apache-airflow==2.5.0 # If using Airflow for orchestration
pandas==1.5.3 # For handling data manipulation
pytest==7.2.2 # For running tests
openpyxl==3.0.10 # For reading and writing Excel files (e.g., orders_test_data.xlsx)
7 changes: 7 additions & 0 deletions sql/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
FROM nouchka/sqlite3:latest

# Set working directory to where the database will reside
WORKDIR /opt/sqlite_db

# Initialize or create the SQLite database
RUN sqlite3 /opt/sqlite_db/etl.db "CREATE TABLE IF NOT EXISTS Orders (Order_ID INTEGER PRIMARY KEY AUTOINCREMENT, Product_Name TEXT, Quantity INTEGER);"
102 changes: 102 additions & 0 deletions sql/sqlite_db/db_queries.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Query to Validate Customer_ID Uniqueness
def validate_customer_id_unique():
return """
SELECT Customer_ID, Order_Date, COUNT(*) AS Order_Count
FROM Orders
GROUP BY Customer_ID, Order_Date
HAVING COUNT(*) > 1
"""

# Query to Validate Correct Date Format
def validate_order_date_format():
return """
SELECT Order_ID, Order_Date
FROM Orders
WHERE Order_Date IS NULL
OR NOT (Order_Date GLOB '????-??-??'
AND LENGTH(Order_Date) = 10
AND CAST(substr(Order_Date, 1, 4) AS INTEGER) > 0
AND substr(Order_Date, 6, 2) BETWEEN '01' AND '12'
AND CASE
WHEN substr(Order_Date, 6, 2) IN ('01', '03', '05', '07', '08', '10', '12') THEN substr(Order_Date, 9, 2) BETWEEN '01' AND '31'
WHEN substr(Order_Date, 6, 2) IN ('04', '06', '09', '11') THEN substr(Order_Date, 9, 2) BETWEEN '01' AND '30'
WHEN substr(Order_Date, 6, 2) = '02' THEN (
CASE
WHEN (CAST(substr(Order_Date, 1, 4) AS INTEGER) % 4 = 0
AND CAST(substr(Order_Date, 1, 4) AS INTEGER) % 100 != 0)
OR CAST(substr(Order_Date, 1, 4) AS INTEGER) % 400 = 0 THEN substr(Order_Date, 9, 2) BETWEEN '01' AND '29'
ELSE substr(Order_Date, 9, 2) BETWEEN '01' AND '28'
END
)
ELSE 0
END = 1
);
"""

# Query to find orders with negative quantities
def get_orders_with_negative_quantity():
return """
SELECT Order_ID, Customer_ID, Product_ID, Quantity
FROM Orders
WHERE Quantity < 0
"""

# Query to find orders with missing Customer_Name
def get_orders_with_missing_customer_name():
return """
SELECT Order_ID, Customer_ID, Customer_Name, Product_ID, Quantity
FROM Orders
WHERE Customer_Name IS NULL
"""

# Query to ensure unique Product_ID (no duplicates allowed in Orders)
def get_orders_with_duplicate_product_id():
return """
SELECT Product_ID, COUNT(*)
FROM Orders
GROUP BY Product_ID
HAVING COUNT(*) > 1
"""

# Query to ensure Product_Name cannot be NULL in Products
def get_orders_with_null_product_name():
return """
SELECT *
FROM Products
WHERE Product_Name IS NULL
"""

# Query to get email customer in Orders
def get_invalid_email_customers():
"""
Query to find customers with invalid email format.
Returns rows where the email does not match the expected pattern.
"""
query = """
SELECT *
FROM Orders
WHERE Email NOT LIKE '%_@__%.__%';
"""
return query

def get_orders_with_invalid_date_range():
"""
Query to find orders where the Order_Date is outside the range '2024-01-01' to '2024-12-31'.
"""
query = """
SELECT *
FROM Orders
WHERE Order_Date < '2024-01-01' OR Order_Date > '2024-12-31';
"""
return query

def get_invalid_product_references():
"""
Returns the SQL query to check for invalid Product_ID references in the Orders table.
"""
return """
SELECT o.Order_ID, o.Product_ID
FROM Orders o
LEFT JOIN Products p ON o.Product_ID = p.Product_ID
WHERE p.Product_ID IS NULL;
"""
39 changes: 39 additions & 0 deletions sql/sqlite_db/setup_db.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
import sqlite3

# Path to SQLite database
DB_PATH = 'sql/sqlite_db/etl.db'

# Establish a connection
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()

# Drop tables if they exist to ensure schema updates
cursor.execute('DROP TABLE IF EXISTS Orders;')
cursor.execute('DROP TABLE IF EXISTS Products;')

# Create the Orders table with the updated schema (including Email column)
cursor.execute('''
CREATE TABLE Orders (
Order_ID INTEGER PRIMARY KEY,
Customer_ID INTEGER,
Customer_Name TEXT,
Order_Date TEXT,
Product_ID INTEGER,
Quantity INTEGER,
Email TEXT
);
''')

# Create the Products table
cursor.execute('''
CREATE TABLE Products (
Product_ID INTEGER PRIMARY KEY,
Product_Name TEXT
);
''')

# Commit changes and close the connection
conn.commit()
conn.close()

print("Database and tables set up successfully.")
Loading

0 comments on commit 23e1d9f

Please sign in to comment.