diff --git a/docs/ML/projects/Heart_Disease_Detection_Model.md b/docs/ML/projects/Heart_Disease_Detection_Model.md new file mode 100644 index 00000000..a25c7bff --- /dev/null +++ b/docs/ML/projects/Heart_Disease_Detection_Model.md @@ -0,0 +1,223 @@ +# Heart Disease Detection Model + +### AIM +The aim of this project is to develop a reliable and efficient machine learning-based system for the early detection and diagnosis of heart disease. By leveraging advanced algorithms, the system seeks to analyze patient data, identify significant patterns, and predict the likelihood of heart disease, thereby assisting healthcare professionals in making informed decisions. + + +### DATASET LINK +This project uses a publicly available heart disease dataset from [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/45/heart+disease) + + +### NOTEBOOK LINK +This is notebook of the following project [Kaggle](https://www.kaggle.com/code/nandagopald2004/heart-disease-detection-using-ml) + + +### LIBRARIES NEEDED + + + - pandas + - numpy + - scikit-learn + - matplotlib + - seaborn + +--- + +### DESCRIPTION + +what is the requirement of the project?, +The project requires a dataset containing patient health records, including attributes like age, cholesterol levels, blood pressure, and medical history. Additionally, it needs machine learning tools and frameworks (e.g., Python, scikit-learn) for building and evaluating predictive models. + +why is it necessary?, +Early detection of heart disease is crucial to prevent severe complications and reduce mortality rates. A machine learning-based system provides accurate, fast, and cost-effective predictions, aiding timely medical intervention and improved patient outcomes. + +how is it beneficial and used?, +This system benefits healthcare by improving diagnostic accuracy and reducing reliance on invasive procedures. It can be used by doctors for decision support, by patients for risk assessment, and in hospitals for proactive healthcare management. + +how did you start approaching this project?, +The project begins by collecting and preprocessing a heart disease dataset, ensuring it is clean and ready for analysis. Next, machine learning models are selected, trained, and evaluated to identify the most accurate algorithm for predicting heart disease. + +Any additional resources used like blogs reading, books reading (mention the name of book along with the pages you have read)? +Kaggle kernels and documentation for additional dataset understanding. +Tutorials on machine learning regression techniques, particularly for Random Forest, SVR, and Decision Trees. + +### EXPLANATION + +#### DETAILS OF THE DIFFERENT FEATURES + + +Age: Patient's age in years. + +Sex: Gender of the patient (1 = male; 0 = female). + +Chest Pain Type (cp): Categorized as: + +0: Typical angina +1: Atypical angina +2: Non-anginal pain +3: Asymptomatic +Resting Blood Pressure (trestbps): Measured in mm Hg upon hospital admission. + +Serum Cholesterol (chol): Measured in mg/dL. + +Fasting Blood Sugar (fbs): Indicates if fasting blood sugar > 120 mg/dL (1 = true; 0 = false). + +Resting Electrocardiographic Results (restecg): + +0: Normal +1: Having ST-T wave abnormality (e.g., T wave inversions and/or ST elevation or depression > 0.05 mV) +2: Showing probable or definite left ventricular hypertrophy by Estes' criteria +Maximum Heart Rate Achieved (thalach): Peak heart rate during exercise. + +Exercise-Induced Angina (exang): Presence of angina induced by exercise (1 = yes; 0 = no). + +Oldpeak: ST depression induced by exercise relative to rest. + +Slope of the Peak Exercise ST Segment (slope): + +0: Upsloping +1: Flat +2: Downsloping +Number of Major Vessels Colored by Fluoroscopy (ca): Ranges from 0 to 3. + +Thalassemia (thal): + +1: Normal +2: Fixed defect +3: Reversible defect +Target: Diagnosis of heart disease (0 = no disease; 1 = disease). + +--- + +#### PROJECT WORKFLOW + +### 1.Problem Definition + +Identify the objective: To predict the presence or absence of heart disease based on patient data. +Define the outcome variable (target) and input features. + +### 2.Data Collection + +Gather a reliable dataset, such as the Cleveland Heart Disease dataset, which includes features relevant to heart disease prediction. + +### 3.Data Preprocessing + +Handle missing values: Fill or remove records with missing data. +Normalize/standardize data to ensure all features have comparable scales. +Encode categorical variables like sex, cp, and thal using techniques like one-hot encoding or label encoding. + +### 4.Exploratory Data Analysis (EDA) + +Visualize data distributions using histograms, boxplots, or density plots. +Identify relationships between features using correlation matrices and scatterplots. +Detect and handle outliers to improve model performance. + +### 5.Feature Selection + +Use statistical methods or feature importance metrics to identify the most relevant features for prediction. +Remove redundant or less significant features. + +### 6.Data Splitting + +Divide the dataset into training, validation, and testing sets (e.g., 70%-15%-15%). +Ensure a balanced distribution of the target variable in all splits. + +### 7.Model Selection + +Experiment with multiple machine learning algorithms such as Logistic Regression, Random Forest, Decision Trees, Support Vector Machines (SVM), and Neural Networks. +Select models based on the complexity and nature of the dataset. + +### 8.Model Training + +Train the chosen models using the training dataset. +Tune hyperparameters using grid search or random search techniques. + +### 9.Model Evaluation + +Assess models on validation and testing datasets using metrics such as: +Accuracy +Precision, Recall, and F1-score +Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC). +Compare models to identify the best-performing one. + +### 10.##Deployment and Prediction + +Save the trained model using frameworks like joblib or pickle. +Develop a user interface (UI) or API for end-users to input data and receive predictions. + +### 11.Iterative Improvement + +Continuously refine the model using new data or advanced algorithms. +Address feedback and optimize the system based on real-world performance. + + + +#### PROJECT TRADE-OFFS AND SOLUTIONS + + +=== "Trade Off 1" + - Accuracy vs. Interpretability + - Complex models like Random Forests or Neural Networks offer higher accuracy but are less interpretable compared to simpler models like Logistic Regression. +=== "Trade Off 2" + - Overfitting vs. Generalization + - Models with high complexity may overfit the training data, leading to poor generalization on unseen data. +--- + + + +### MODELS USED AND THEIR EVALUATION METRICS + + +| Model | Score | +|------------|----------| +| Logistic regression | 88% | +| K-Nearest Classifier | 68% | +| Random Forest Classifier | 86% | + +--- + +### CONCLUSION + +#### KEY LEARNINGS + + +1. Data Insights +Understanding Healthcare Data: Learned how medical attributes (e.g., age, cholesterol, chest pain type) influence heart disease risk. +Data Imbalance: Recognized the challenges posed by imbalanced datasets and explored techniques like SMOTE and class weighting to address them. +Importance of Preprocessing: Gained expertise in handling missing values, scaling data, and encoding categorical variables, which are crucial for model performance. + +2. Techniques Mastered +Exploratory Data Analysis (EDA): Applied visualization tools (e.g., histograms, boxplots, heatmaps) to uncover patterns and correlations in data. +Feature Engineering: Identified and prioritized key features using statistical methods and feature importance metrics. +Modeling: Implemented various machine learning algorithms, including Logistic Regression, Random Forest, Gradient Boosting, and Support Vector Machines. +Evaluation Metrics: Learned to evaluate models using metrics like Precision, Recall, F1-score, and ROC-AUC to optimize for healthcare-specific goals. +Hyperparameter Tuning: Used grid search and random search to optimize model parameters and improve performance. +Interpretability Tools: Utilized SHAP and feature importance analysis to explain model predictions. + +3. Skills Developed +Problem-Solving: Addressed trade-offs such as accuracy vs. interpretability, and overfitting vs. generalization. +Critical Thinking: Improved decision-making on model selection, preprocessing methods, and evaluation strategies. +Programming: Strengthened Python programming skills, including the use of libraries like scikit-learn, pandas, matplotlib, and TensorFlow. +Collaboration: Enhanced communication and teamwork when discussing medical insights and technical challenges with domain experts. +Time Management: Balanced experimentation with computational efficiency, focusing on techniques that maximized impact. +Ethical Considerations: Gained awareness of ethical issues like ensuring fairness in predictions and minimizing false negatives, which are critical in healthcare applications. + +4. Broader Understanding +Interdisciplinary Knowledge: Combined expertise from data science, healthcare, and statistics to create a meaningful application. +Real-World Challenges: Understood the complexities of translating machine learning models into practical tools for healthcare. +Continuous Learning: Learned that model development is iterative, requiring continuous refinement based on feedback and new data. + +#### USE CASES + + +=== "Application 1" + + **Clinical Decision Support Systems (CDSS)** + + - ML models can be integrated into Electronic Health Record (EHR) systems to assist doctors in diagnosing heart disease. The model can provide predictions based on patient data, helping clinicians make faster and more accurate decisions. +=== "Application 2" + + **Early Screening and Risk Assessment** + + - Patients can undergo routine screening using a heart disease detection system to assess their risk level. The system can predict whether a patient is at high, moderate, or low risk, prompting early interventions or lifestyle changes. +