Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heart Disease Detection using ML #108 #123

Merged
merged 7 commits into from
Jan 9, 2025
223 changes: 223 additions & 0 deletions docs/ML/projects/Heart_Disease_Detection_Model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
# Heart Disease Detection Model

### AIM
The aim of this project is to develop a reliable and efficient machine learning-based system for the early detection and diagnosis of heart disease. By leveraging advanced algorithms, the system seeks to analyze patient data, identify significant patterns, and predict the likelihood of heart disease, thereby assisting healthcare professionals in making informed decisions.


### DATASET LINK
This project uses a publicly available heart disease dataset from [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/45/heart+disease)


### NOTEBOOK LINK
This is notebook of the following project [Kaggle](https://www.kaggle.com/code/nandagopald2004/heart-disease-detection-using-ml)


### LIBRARIES NEEDED


- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn

---

### DESCRIPTION

what is the requirement of the project?,
The project requires a dataset containing patient health records, including attributes like age, cholesterol levels, blood pressure, and medical history. Additionally, it needs machine learning tools and frameworks (e.g., Python, scikit-learn) for building and evaluating predictive models.

why is it necessary?,
Early detection of heart disease is crucial to prevent severe complications and reduce mortality rates. A machine learning-based system provides accurate, fast, and cost-effective predictions, aiding timely medical intervention and improved patient outcomes.

how is it beneficial and used?,
This system benefits healthcare by improving diagnostic accuracy and reducing reliance on invasive procedures. It can be used by doctors for decision support, by patients for risk assessment, and in hospitals for proactive healthcare management.

how did you start approaching this project?,
The project begins by collecting and preprocessing a heart disease dataset, ensuring it is clean and ready for analysis. Next, machine learning models are selected, trained, and evaluated to identify the most accurate algorithm for predicting heart disease.

Any additional resources used like blogs reading, books reading (mention the name of book along with the pages you have read)?
Kaggle kernels and documentation for additional dataset understanding.
Tutorials on machine learning regression techniques, particularly for Random Forest, SVR, and Decision Trees.

### EXPLANATION

#### DETAILS OF THE DIFFERENT FEATURES
<!-- Elaborate the features as mentioned in the issues, perfoming any googling to learn about the features -->
<!-- Describe the key features of the project, explaining each one in detail. -->
Age: Patient's age in years.

Sex: Gender of the patient (1 = male; 0 = female).

Chest Pain Type (cp): Categorized as:

0: Typical angina
1: Atypical angina
2: Non-anginal pain
3: Asymptomatic
Resting Blood Pressure (trestbps): Measured in mm Hg upon hospital admission.

Serum Cholesterol (chol): Measured in mg/dL.

Fasting Blood Sugar (fbs): Indicates if fasting blood sugar > 120 mg/dL (1 = true; 0 = false).

Resting Electrocardiographic Results (restecg):

0: Normal
1: Having ST-T wave abnormality (e.g., T wave inversions and/or ST elevation or depression > 0.05 mV)
2: Showing probable or definite left ventricular hypertrophy by Estes' criteria
Maximum Heart Rate Achieved (thalach): Peak heart rate during exercise.

Exercise-Induced Angina (exang): Presence of angina induced by exercise (1 = yes; 0 = no).

Oldpeak: ST depression induced by exercise relative to rest.

Slope of the Peak Exercise ST Segment (slope):

0: Upsloping
1: Flat
2: Downsloping
Number of Major Vessels Colored by Fluoroscopy (ca): Ranges from 0 to 3.

Thalassemia (thal):

1: Normal
2: Fixed defect
3: Reversible defect
Target: Diagnosis of heart disease (0 = no disease; 1 = disease).

---

#### PROJECT WORKFLOW

### 1.Problem Definition

Identify the objective: To predict the presence or absence of heart disease based on patient data.
Define the outcome variable (target) and input features.

### 2.Data Collection

Gather a reliable dataset, such as the Cleveland Heart Disease dataset, which includes features relevant to heart disease prediction.

### 3.Data Preprocessing

Handle missing values: Fill or remove records with missing data.
Normalize/standardize data to ensure all features have comparable scales.
Encode categorical variables like sex, cp, and thal using techniques like one-hot encoding or label encoding.

### 4.Exploratory Data Analysis (EDA)

Visualize data distributions using histograms, boxplots, or density plots.
Identify relationships between features using correlation matrices and scatterplots.
Detect and handle outliers to improve model performance.

### 5.Feature Selection

Use statistical methods or feature importance metrics to identify the most relevant features for prediction.
Remove redundant or less significant features.

### 6.Data Splitting

Divide the dataset into training, validation, and testing sets (e.g., 70%-15%-15%).
Ensure a balanced distribution of the target variable in all splits.

### 7.Model Selection

Experiment with multiple machine learning algorithms such as Logistic Regression, Random Forest, Decision Trees, Support Vector Machines (SVM), and Neural Networks.
Select models based on the complexity and nature of the dataset.

### 8.Model Training

Train the chosen models using the training dataset.
Tune hyperparameters using grid search or random search techniques.

### 9.Model Evaluation

Assess models on validation and testing datasets using metrics such as:
Accuracy
Precision, Recall, and F1-score
Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC).
Compare models to identify the best-performing one.

### 10.##Deployment and Prediction

Save the trained model using frameworks like joblib or pickle.
Develop a user interface (UI) or API for end-users to input data and receive predictions.

### 11.Iterative Improvement

Continuously refine the model using new data or advanced algorithms.
Address feedback and optimize the system based on real-world performance.



#### PROJECT TRADE-OFFS AND SOLUTIONS


=== "Trade Off 1"
- Accuracy vs. Interpretability
- Complex models like Random Forests or Neural Networks offer higher accuracy but are less interpretable compared to simpler models like Logistic Regression.
=== "Trade Off 2"
- Overfitting vs. Generalization
- Models with high complexity may overfit the training data, leading to poor generalization on unseen data.
---



### MODELS USED AND THEIR EVALUATION METRICS
<!-- Summarize the models used and their evaluation metrics in a table. -->

| Model | Score |
|------------|----------|
| Logistic regression | 88% |
| K-Nearest Classifier | 68% |
| Random Forest Classifier | 86% |

---

### CONCLUSION

#### KEY LEARNINGS


1. Data Insights
Understanding Healthcare Data: Learned how medical attributes (e.g., age, cholesterol, chest pain type) influence heart disease risk.
Data Imbalance: Recognized the challenges posed by imbalanced datasets and explored techniques like SMOTE and class weighting to address them.
Importance of Preprocessing: Gained expertise in handling missing values, scaling data, and encoding categorical variables, which are crucial for model performance.

2. Techniques Mastered
Exploratory Data Analysis (EDA): Applied visualization tools (e.g., histograms, boxplots, heatmaps) to uncover patterns and correlations in data.
Feature Engineering: Identified and prioritized key features using statistical methods and feature importance metrics.
Modeling: Implemented various machine learning algorithms, including Logistic Regression, Random Forest, Gradient Boosting, and Support Vector Machines.
Evaluation Metrics: Learned to evaluate models using metrics like Precision, Recall, F1-score, and ROC-AUC to optimize for healthcare-specific goals.
Hyperparameter Tuning: Used grid search and random search to optimize model parameters and improve performance.
Interpretability Tools: Utilized SHAP and feature importance analysis to explain model predictions.

3. Skills Developed
Problem-Solving: Addressed trade-offs such as accuracy vs. interpretability, and overfitting vs. generalization.
Critical Thinking: Improved decision-making on model selection, preprocessing methods, and evaluation strategies.
Programming: Strengthened Python programming skills, including the use of libraries like scikit-learn, pandas, matplotlib, and TensorFlow.
Collaboration: Enhanced communication and teamwork when discussing medical insights and technical challenges with domain experts.
Time Management: Balanced experimentation with computational efficiency, focusing on techniques that maximized impact.
Ethical Considerations: Gained awareness of ethical issues like ensuring fairness in predictions and minimizing false negatives, which are critical in healthcare applications.

4. Broader Understanding
Interdisciplinary Knowledge: Combined expertise from data science, healthcare, and statistics to create a meaningful application.
Real-World Challenges: Understood the complexities of translating machine learning models into practical tools for healthcare.
Continuous Learning: Learned that model development is iterative, requiring continuous refinement based on feedback and new data.

#### USE CASES


=== "Application 1"

**Clinical Decision Support Systems (CDSS)**

- ML models can be integrated into Electronic Health Record (EHR) systems to assist doctors in diagnosing heart disease. The model can provide predictions based on patient data, helping clinicians make faster and more accurate decisions.
=== "Application 2"

**Early Screening and Risk Assessment**

- Patients can undergo routine screening using a heart disease detection system to assess their risk level. The system can predict whether a patient is at high, moderate, or low risk, prompting early interventions or lifestyle changes.