This repository contains a solution to the NT-M5-H1 Smoking Prediction Competition, which focuses on predicting smoking status using health-related data. This project employs machine learning models, primarily using Python and various data science libraries, to create accurate predictions based on provided datasets.
The goal of this project is to predict whether an individual is a smoker based on several health indicators. The dataset includes various attributes related to age, height, weight, cholesterol levels, blood pressure, and other health metrics.
- Data Analysis and Visualization: Initial data exploration and visualizations to understand feature distributions and relationships.
- Feature Engineering: Creation of new, relevant features to enhance model performance.
- Modeling: Implementation of multiple classifiers, including logistic regression, decision trees, ensemble methods, and more.
- Hyperparameter Tuning: Finding optimal model parameters using techniques like cross-validation and grid search.
- Evaluation: Assessing model performance using metrics such as ROC-AUC and accuracy.
NT-M5-H1-Smoking-Cmpetition/
├── notebooks/ # Jupyter notebooks for data analysis and modeling
├── data/ # datasets for usage model
├── requirements.txt # List of required libraries and dependencies
├── smoking_model.joblib # Finla model for prediction
└── README.md # Project documentation
To run this project locally, follow these steps:
- Clone this repository:
git clone https://github.com/UznetDev/NT-M5-H1-Smoking-Cmpetition.git
- Navigate to the project directory:
cd NT-M5-H1-Smoking-Cmpetition
- Install the required dependencies:
pip install -r requirements.txt
To load and use the model for prediction, follow these steps:
from joblib import load
# Load the model
model = load('smoking_model.joblib')
# Define input data (X) for prediction
# Example input data for one individual
X_new = [] # Replace with actual input values
# Predict smoking status
prediction = model.predict(X_new)
# Display the result
if prediction[0] == 1:
print("This individual is likely a smoker.")
else:
print("This individual is likely a non-smoker.")
The smoking_model.joblib
file contains a pre-trained model created with scikit-learn
. You can use the predict
function to determine smoking status based on input health measurements.
Note: The example input (X_new
) should be replaced with real data according to your project requirements.
The dataset consists of several features related to individuals' health, including:
- Basic Attributes: Age, height, weight, waist circumference, eyesight, and hearing levels.
- Blood Pressure: Systolic and diastolic blood pressure.
- Blood Metrics: Cholesterol, triglyceride, HDL, LDL, and hemoglobin levels.
- Additional Health Indicators: Serum creatinine, AST, ALT, Gtp, and urine protein.
- Target: A binary column indicating whether an individual is a smoker (1 for smoker, 0 for non-smoker).
The following new features have been engineered to enhance the model's predictive power:
- BMI (Body Mass Index)
- Waist-to-Height Ratio
- Cholesterol Ratio (Total Cholesterol/HDL)
- Liver Enzyme Ratio (ALT/AST)
The project explores multiple machine learning algorithms for classification:
- Linear Models: Logistic Regression, Ridge Classifier, SGD Classifier
- Naive Bayes Classifiers: GaussianNB, MultinomialNB, BernoulliNB
- Support Vector Machines: SVC, LinearSVC, NuSVC
- Decision Trees and Ensembles: Decision Tree, RandomForest, Gradient Boosting, AdaBoost, Extra Trees
- Discriminant Analysis: Linear Discriminant Analysis, Quadratic Discriminant Analysis
To find the best parameters for the models, this project utilizes optuna and cross-validation methods, specifically focusing on maximizing ROC-AUC scores to improve model performance.
The models are evaluated based on the following metrics:
- ROC-AUC Score: Primary metric to assess model performance in terms of distinguishing between smokers and non-smokers.
- Exploratory Data Analysis (EDA): Start with EDA notebooks in the
notebooks/
folder to understand the dataset and visualize patterns. - Model Training: Run the scripts or notebooks to train the models and evaluate their performance.
- Hyperparameter Tuning: Use the scripts for hyperparameter tuning to improve model accuracy.
- Save nodel: save model in 'model.joblib.
Contributions are welcome! If you'd like to improve this project, please fork the repository and make a pull request.
- Fork the repository.
- Create a new branch for your feature or bug fix:
git checkout -b feature-name
- Commit your changes:
git commit -m "Add a new feature"
- Push to your branch:
git push origin feature-name
- Open a pull request.
This project is licensed under the MIT License.
If you have any questions or suggestions, please contact:
- Email: [email protected]
- GitHub Issues: Issues section
- GitHub Profile: UznetDev
- Telegram: UZNet_Dev
- Linkedin: Abdurakhmon Niyozaliev
Thank you for your interest in this project. We hope it helps in your journey to understand and predict smoking habits using data science!