This repository is a team project for the 2023 Fall Business Analytics course in Seoul Nataional University of Science and Technology. The project aims to analyze the Seoul Public Bike(따릉이) project and develop strategies for deficit reduction. The analysis involves using machine learning models, specifically LightGBM and XGBoost, to predict public bicycle usage patterns and optimize system efficiency.
- 2023_BA_Project
- Project Overview
- Objectives
- Execution Environment
- Model Execution
- Model Versions
- Create and Activate Conda Environment
- Install Required Packages
- Launch Jupyter Notebook
- 1. Load and Preprocess Data
- 2. Training and Model Creation
- 3. Ensemble Prediction
- 4. Deficit Reduction Strategies
- 5. Results Storage
- 6. Model Performance
- 7. Additional Notes and Considerations
- 8. Dataset
The Seoul Public Bike project has faced financial challenges, with deficits increasing over the years. The goal of this project is to leverage business analytics to understand the usage patterns, optimize system efficiency, and propose strategies for deficit reduction.
-
Financial Analysis: Conduct a comprehensive analysis of the Seoul Public Bike project's financial status, identifying trends, and understanding the factors contributing to deficits.
-
Usage Pattern Analysis: Explore usage patterns of public bicycles, considering factors such as borrowed hour, borrowed day, and environmental conditions.
-
Model Development: Implement machine learning models, including LightGBM and XGBoost, to predict bicycle usage and optimize station-specific trends.
-
Ensemble Model: Combine the strengths of LightGBM and XGBoost through ensemble modeling to enhance prediction accuracy.
-
Deficit Reduction Strategies: Based on the analysis results, propose strategies to reduce the financial deficits associated with the public bicycle project.
- Python 3.11 or higher
- Conda (for managing the virtual environment)
- pandas==1.5.3
- numpy==1.24.3
- scikit-learn==1.3.0
- lightgbm==4.1.0
- xgboost==2.0.2
-
Create Conda Environment:
conda create --name myenv python=3.11
Replace
myenv
with the desired environment name. -
Activate Conda Environment:
conda activate myenv
conda install --file requirements.txt
This command installs the necessary packages specified in the requirements.txt
file within the Conda environment.
jupyter notebook
Now, open the Jupyter Notebook and navigate to the team7_ensemble_model.ipynb
notebook to run the code under the "1. Load and Preprocess Data" section.
Ensure that you are using Python 3.11 or a higher version and have activated your Conda environment before installing the required packages.
-
Load the Data:
import pandas as pd # Load data data = pd.read_csv('merged_data.csv', encoding='utf-8')
-
Select Relevant Features and Preprocess Data:
# Selected Features selected_features = ['stn_id', 'borrowed_hour', 'borrowed_day', 'is_holiday', 'borrowed_num_nearby', '강수량(mm)', 'wind_chill', 'nearby_id', 'borrowed_date', 'borrowed_num'] data = data[selected_features] # Label Encoding for Categorical Features categorical_features = ['stn_id', 'nearby_id'] for feature in categorical_features: data[feature] = pd.factorize(data[feature])[0]
import lightgbm as lgb
# LightGBM Parameters for Regression
lgb_params = {
'objective': 'regression',
'metric': 'rmse',
'boosting_type': 'gbdt',
'num_leaves': 80,
'learning_rate': 0.05,
'feature_fraction': 1.0,
'device': 'gpu'
}
# Create training and test datasets
train_data_lgb = lgb.Dataset(X_train, label=y_train)
test_data_lgb = lgb.Dataset(X_test, label=y_test, reference=train_data_lgb)
# Train the LightGBM model
lgb_model = lgb.train(lgb_params, train_data_lgb, num_boost_round=10000, valid_sets=[test_data_lgb, train_data_lgb], callbacks=[
lgb.early_stopping(stopping_rounds=3, verbose=100),
])
import xgboost as xgb
# XGBoost Parameters for Regression
xgb_params = {
'objective': 'reg:squarederror',
'eval_metric': 'rmse',
'booster': 'gbtree',
'learning_rate': 0.1,
'max_depth': 13,
'subsample': 0.8,
'device': 'gpu'
}
# Create training and test datasets
train_data_xgb = xgb.DMatrix(X_train, label=y_train)
test_data_xgb = xgb.DMatrix(X_test, label=y_test)
# Train the XGBoost model
xgb_model = xgb.train(xgb_params, train_data_xgb, num_boost_round=10000, evals=[(test_data_xgb, 'eval')], early_stopping_rounds=3, verbose_eval=100)
# Combine predictions of both models for ensemble prediction
y_pred_ensemble = (lgb_model.predict(X_test, num_iteration=lgb_model.best_iteration) + xgb_model.predict(test_data_xgb)) / 2
# Evaluate the performance of the ensemble model
ensemble_rmse = mean_squared_error(y_test, y_pred_ensemble, squared=False)
ensemble_r2 = r2_score(y_test, y_pred_ensemble)
print(f'Ensemble Test RMSE: {ensemble_rmse}')
print(f'Ensemble Test R-squared: {ensemble_r2}')
Based on the ensemble model results, analyze patterns and insights obtained from the predictions.
Considering the analysis, propose effective strategies to reduce the financial deficits associated with the Seoul Public Bike project.
Save the predictions in the new_data_with_predictions.csv
file:
# Save Predictions
new_data.to_csv('new_data_with_predictions.csv', index=False, encoding='utf-8')
After training and evaluating the LightGBM and XGBoost models, here are the key performance metrics:
Metric | Training Value | Test Value |
---|---|---|
RMSE | 1.9139 | 1.9659 |
R-squared | 0.5621 | 0.5377 |
Metric | Training Value | Test Value |
---|---|---|
RMSE | 1.7220 | 1.9135 |
R-squared | 0.6455 | 0.5620 |
Metric | Training Value | Test Value |
---|---|---|
RMSE | 1.5199 | 1.7171 |
R-squared | 0.7128 | 0.6473 |
These metrics provide insights into how well the models are performing, and users can quickly assess the quality of predictions.
Include any additional details, configurations, or modifications needed for the code. Clarify that the 'device' parameter is optional and can be adjusted based on the user's environment.
For detailed information about the hyperparameter tuning process for XGBoost and LightGBM, including the configurations used and insights gained, please refer to the 23_BA_preprocessing repository.
The hyperparameter tuning results and analysis can be found in the Hyperparameter Tuning section of the 23_BA_preprocessing
repository.
For the preprocessing of Seoul Bike Rental Station Information, you can refer to the BA_Preprocessing repository. The preprocessing repository includes the following files:
- seoul_bicycle_master.json: Master data of Seoul Bike rental stations.
- master_preprocessing.ipynb: Jupyter Notebook for normalizing the coordinates in the master data where they are recorded as 0.0 using the Google API.
- seoul_bicycle_maser_preprocessed.csv: File containing data processed using master_preprocessing.ipynb.
- master_info_with_nearby.ipynb: Jupyter Notebook for adding columns of data for the nearest rental station and its distance using seoul_bicycle_maser_preprocessed.csv.
- master_info_with_nearby.csv: File containing data with added information about nearby rental stations using master_info_with_nearby.ipynb.
- master_final.ipynb: Jupyter Notebook for processing rows where the district data has not been correctly recorded due to differences in address formatting.
- master_final.csv: File containing data where 'stn_gu' has been appropriately added to all data using master_final.ipynb.
Columns in master_final.csv:
- stn_id: Represents the id of the rental station and is of object type.
- stn_addr: Represents the full address of the rental station and is of object type.
- stn_lat: Represents the latitude of the rental station and is of float64 type.
- stn_lng: Represents the longitude of the rental station and is of float64 type.
- nearby_id: Represents the id of the nearest rental station and is of object type.
- nearby_km: Represents the distance to the nearest rental station in km and is of float64 type.
- stn_gu: A district data column was added for analysis as the weather data classification is done by district. This is of object type.
To replicate the analysis and run the code, you'll need the dataset file merged_data.csv
. You can download it using the following link:
Place the downloaded file in the project's root directory before running the Jupyter Notebook.
If you want to collect real-time Seoul Public Bike rental data for testing purposes, you can use the provided Jupyter Notebook:
따릉이 Real Data Collection Notebook
Follow the instructions in the notebook to collect real-time rental data. Note that this step is optional, and you can proceed with the analysis without real-time data collection.