2023_BA_Project

This repository is a team project for the 2023 Fall Business Analytics course in Seoul Nataional University of Science and Technology. The project aims to analyze the Seoul Public Bike(따릉이) project and develop strategies for deficit reduction. The analysis involves using machine learning models, specifically LightGBM and XGBoost, to predict public bicycle usage patterns and optimize system efficiency.

2023_BA_Project

Project Overview

The Seoul Public Bike project has faced financial challenges, with deficits increasing over the years. The goal of this project is to leverage business analytics to understand the usage patterns, optimize system efficiency, and propose strategies for deficit reduction.

Objectives

Financial Analysis: Conduct a comprehensive analysis of the Seoul Public Bike project's financial status, identifying trends, and understanding the factors contributing to deficits.
Usage Pattern Analysis: Explore usage patterns of public bicycles, considering factors such as borrowed hour, borrowed day, and environmental conditions.
Model Development: Implement machine learning models, including LightGBM and XGBoost, to predict bicycle usage and optimize station-specific trends.
Ensemble Model: Combine the strengths of LightGBM and XGBoost through ensemble modeling to enhance prediction accuracy.
Deficit Reduction Strategies: Based on the analysis results, propose strategies to reduce the financial deficits associated with the public bicycle project.

Execution Environment

Python 3.11 or higher
Conda (for managing the virtual environment)

Model Execution

Model Versions

pandas==1.5.3
numpy==1.24.3
scikit-learn==1.3.0
lightgbm==4.1.0
xgboost==2.0.2

Create and Activate Conda Environment

Create Conda Environment:
```
conda create --name myenv python=3.11
```
Replace myenv with the desired environment name.
Activate Conda Environment:
```
conda activate myenv
```

Install Required Packages

conda install --file requirements.txt

This command installs the necessary packages specified in the requirements.txt file within the Conda environment.

Launch Jupyter Notebook

jupyter notebook

Now, open the Jupyter Notebook and navigate to the team7_ensemble_model.ipynb notebook to run the code under the "1. Load and Preprocess Data" section.

Ensure that you are using Python 3.11 or a higher version and have activated your Conda environment before installing the required packages.

1. Load and Preprocess Data

Load the Data:

import pandas as pd

# Load data
data = pd.read_csv('merged_data.csv', encoding='utf-8')

Select Relevant Features and Preprocess Data:

# Selected Features
selected_features = ['stn_id', 'borrowed_hour', 'borrowed_day', 'is_holiday', 'borrowed_num_nearby', '강수량(mm)', 'wind_chill', 'nearby_id', 'borrowed_date', 'borrowed_num']
data = data[selected_features]

# Label Encoding for Categorical Features
categorical_features = ['stn_id', 'nearby_id']
for feature in categorical_features:
    data[feature] = pd.factorize(data[feature])[0]

2. Training and Model Creation

2.1 Train LightGBM Model

import lightgbm as lgb

# LightGBM Parameters for Regression
lgb_params = {
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',
    'num_leaves': 80,
    'learning_rate': 0.05,
    'feature_fraction': 1.0,
    'device': 'gpu'
}

# Create training and test datasets
train_data_lgb = lgb.Dataset(X_train, label=y_train)
test_data_lgb = lgb.Dataset(X_test, label=y_test, reference=train_data_lgb)

# Train the LightGBM model
lgb_model = lgb.train(lgb_params, train_data_lgb, num_boost_round=10000, valid_sets=[test_data_lgb, train_data_lgb], callbacks=[
    lgb.early_stopping(stopping_rounds=3, verbose=100),
])

2.2 Train XGBoost Model

import xgboost as xgb

# XGBoost Parameters for Regression
xgb_params = {
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse',
    'booster': 'gbtree',
    'learning_rate': 0.1,
    'max_depth': 13,
    'subsample': 0.8,
    'device': 'gpu'
}

# Create training and test datasets
train_data_xgb = xgb.DMatrix(X_train, label=y_train)
test_data_xgb = xgb.DMatrix(X_test, label=y_test)

# Train the XGBoost model
xgb_model = xgb.train(xgb_params, train_data_xgb, num_boost_round=10000, evals=[(test_data_xgb, 'eval')], early_stopping_rounds=3, verbose_eval=100)

3. Ensemble Prediction

# Combine predictions of both models for ensemble prediction
y_pred_ensemble = (lgb_model.predict(X_test, num_iteration=lgb_model.best_iteration) + xgb_model.predict(test_data_xgb)) / 2

# Evaluate the performance of the ensemble model
ensemble_rmse = mean_squared_error(y_test, y_pred_ensemble, squared=False)
ensemble_r2 = r2_score(y_test, y_pred_ensemble)

print(f'Ensemble Test RMSE: {ensemble_rmse}')
print(f'Ensemble Test R-squared: {ensemble_r2}')

4. Deficit Reduction Strategies

4.1 Analysis and Insights

Based on the ensemble model results, analyze patterns and insights obtained from the predictions.

4.2 Propose Strategies for Deficit Reduction

Considering the analysis, propose effective strategies to reduce the financial deficits associated with the Seoul Public Bike project.

5. Results Storage

Save the predictions in the new_data_with_predictions.csv file:

# Save Predictions
new_data.to_csv('new_data_with_predictions.csv', index=False, encoding='utf-8')

6. Model Performance

After training and evaluating the LightGBM and XGBoost models, here are the key performance metrics:

LightGBM Model

Metric	Training Value	Test Value
RMSE	1.9139	1.9659
R-squared	0.5621	0.5377

XGBoost Model

Metric	Training Value	Test Value
RMSE	1.7220	1.9135
R-squared	0.6455	0.5620

Ensemble Model

Metric	Training Value	Test Value
RMSE	1.5199	1.7171
R-squared	0.7128	0.6473

These metrics provide insights into how well the models are performing, and users can quickly assess the quality of predictions.

7. Additional Notes and Considerations

Include any additional details, configurations, or modifications needed for the code. Clarify that the 'device' parameter is optional and can be adjusted based on the user's environment.

7.1 Hyperparameter Tuning Details

For detailed information about the hyperparameter tuning process for XGBoost and LightGBM, including the configurations used and insights gained, please refer to the 23_BA_preprocessing repository.

The hyperparameter tuning results and analysis can be found in the Hyperparameter Tuning section of the 23_BA_preprocessing repository.

8. Dataset

8.1 Preprocessing Data

For the preprocessing of Seoul Bike Rental Station Information, you can refer to the BA_Preprocessing repository. The preprocessing repository includes the following files:

seoul_bicycle_master.json: Master data of Seoul Bike rental stations.
master_preprocessing.ipynb: Jupyter Notebook for normalizing the coordinates in the master data where they are recorded as 0.0 using the Google API.
seoul_bicycle_maser_preprocessed.csv: File containing data processed using master_preprocessing.ipynb.
master_info_with_nearby.ipynb: Jupyter Notebook for adding columns of data for the nearest rental station and its distance using seoul_bicycle_maser_preprocessed.csv.
master_info_with_nearby.csv: File containing data with added information about nearby rental stations using master_info_with_nearby.ipynb.
master_final.ipynb: Jupyter Notebook for processing rows where the district data has not been correctly recorded due to differences in address formatting.
master_final.csv: File containing data where 'stn_gu' has been appropriately added to all data using master_final.ipynb.

Columns in master_final.csv:

stn_id: Represents the id of the rental station and is of object type.
stn_addr: Represents the full address of the rental station and is of object type.
stn_lat: Represents the latitude of the rental station and is of float64 type.
stn_lng: Represents the longitude of the rental station and is of float64 type.
nearby_id: Represents the id of the nearest rental station and is of object type.
nearby_km: Represents the distance to the nearest rental station in km and is of float64 type.
stn_gu: A district data column was added for analysis as the weather data classification is done by district. This is of object type.

BA_Preprocessing Repository

8.2 Training Dataset

To replicate the analysis and run the code, you'll need the dataset file merged_data.csv. You can download it using the following link:

Download merged_data.csv

Place the downloaded file in the project's root directory before running the Jupyter Notebook.

8.3 Collecting Real-time Rental Data (Optional)

If you want to collect real-time Seoul Public Bike rental data for testing purposes, you can use the provided Jupyter Notebook:

따릉이 Real Data Collection Notebook

Follow the instructions in the notebook to collect real-time rental data. Note that this step is optional, and you can proceed with the analysis without real-time data collection.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.gitignore		.gitignore
23_BA_Proposal_TEAM7.pdf		23_BA_Proposal_TEAM7.pdf
BA DATE.ipynb		BA DATE.ipynb
Google_drive_Link.txt		Google_drive_Link.txt
Merging_JSON_18_19PM.ipynb		Merging_JSON_18_19PM.ipynb
Proposal.md		Proposal.md
README.md		README.md
ensemble_optimization.ipynb		ensemble_optimization.ipynb
random_hyperparameter.ipynb		random_hyperparameter.ipynb
requirements.txt		requirements.txt
team7_ensemble_model.ipynb		team7_ensemble_model.ipynb
따릉이 Real Data.ipynb		따릉이 Real Data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2023_BA_Project

Project Overview

Objectives

Execution Environment

Model Execution

Model Versions

Create and Activate Conda Environment

Install Required Packages

Launch Jupyter Notebook

1. Load and Preprocess Data

2. Training and Model Creation

2.1 Train LightGBM Model

2.2 Train XGBoost Model

3. Ensemble Prediction

4. Deficit Reduction Strategies

4.1 Analysis and Insights

4.2 Propose Strategies for Deficit Reduction

5. Results Storage

6. Model Performance

LightGBM Model

XGBoost Model

Ensemble Model

7. Additional Notes and Considerations

7.1 Hyperparameter Tuning Details

8. Dataset

8.1 Preprocessing Data

8.2 Training Dataset

8.3 Collecting Real-time Rental Data (Optional)

About

Releases

Packages

Contributors 3

Languages

cyl0424/2023_BA_Project

Folders and files

Latest commit

History

Repository files navigation

2023_BA_Project

Project Overview

Objectives

Execution Environment

Model Execution

Model Versions

Create and Activate Conda Environment

Install Required Packages

Launch Jupyter Notebook

1. Load and Preprocess Data

2. Training and Model Creation

2.1 Train LightGBM Model

2.2 Train XGBoost Model

3. Ensemble Prediction

4. Deficit Reduction Strategies

4.1 Analysis and Insights

4.2 Propose Strategies for Deficit Reduction

5. Results Storage

6. Model Performance

LightGBM Model

XGBoost Model

Ensemble Model

7. Additional Notes and Considerations

7.1 Hyperparameter Tuning Details

8. Dataset

8.1 Preprocessing Data

8.2 Training Dataset

8.3 Collecting Real-time Rental Data (Optional)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages