- Instructor: Jose Marcial Portilla
- Platform: Udemy
- Duration: 25 hours, 165 videos
- Level: Beginner to Intermediate
- Topics: Python (numpy, pandas, matplotlib, seaborn, scikit-learn), SQL, Data Analysis, Machine Learning, Big Data, PySpark, AWS EC2.
This intensive bootcamp provides a comprehensive grounding in SQL, data visualization, data analysis, machine learning, and big data tools like Spark. It progresses from introductory Python and data manipulation with libraries like NumPy and Pandas to advanced machine learning and big data applications. The course is designed to build foundational skills, solve real-world data problems, and develop hands-on experience with popular data science tools.
Sections 1-4: Setup and Introduction
- Environment setup: Installing Anaconda, setting up Jupyter, and a Python crash course.
- Objective: Set up a robust data analysis environment and gain an introduction to Python basics.
Section 5: NumPy Essentials
- Topics: Arrays, indexing, operations, and exercises to develop core NumPy skills.
- Objective: Handle numerical data efficiently and prepare for data manipulation with Pandas.
Section 6-7: Data Manipulation with Pandas
- Topics: DataFrames, handling missing data, groupby, merging, concatenation, operations, input/output, and exercises.
- Objective: Learn the full scope of data wrangling techniques needed for real-world data analysis.
Section 8-12: Data Visualization Techniques
- Matplotlib: Basic plotting and exercises.
- Seaborn: In-depth exploration of distribution plots, regression plots, matrix plots, and more.
- Pandas Built-in: Quick data visualizations using Pandas.
- Plotly & Cufflinks: Interactive visualizations for dashboards or websites.
- Geographical Plotting: Using maps for data visualization with examples (e.g., 2014 World Power Consumption, 2012 Election Data).
Section 13: Data Analysis Capstone Project
- 911 Calls Project: Exploring call distribution by type and creating heatmaps.
- Finance Project: Analyzing stock prices and exploring financial crisis impacts.
- Objective: Practice exploratory data analysis (EDA) with complex datasets and tackle real-life questions.
Sections 14-17: Machine Learning Basics
- Introduction to Machine Learning: Supervised learning, classification, and regression metrics.
- Linear Regression: Theory, exercises, and a project on eCommerce customer behavior.
- Cross-validation: Techniques and bias-variance trade-off.
- Logistic Regression: Theory, exercises, and an advertising data prediction project.
Section 18-20: Advanced Machine Learning Techniques
- K Nearest Neighbors (KNN): Theory, EDA, standardization, optimal K-values, and a KNN project.
- Decision Trees & Random Forests: Project on loan repayment prediction with LendingClub data.
- Support Vector Machines (SVM): SVM theory, application with the Iris dataset, and evaluation.
Section 21-24: Unsupervised Learning
- K-Means Clustering: Clustering universities and evaluating results.
- Principal Component Analysis (PCA): Dimensionality reduction on the Breast Cancer dataset.
- Natural Language Processing (NLP): Classifying Yelp reviews with NLP pipeline.
Section 25: Neural Networks and Deep Learning
- Topics: Introduction to artificial neural networks (ANNs), activation functions, cost functions, backpropagation, TensorFlow, and Keras.
- Objective: Build, train, and evaluate neural networks using TensorFlow and Keras.
Section 26: Big Data and Spark
- Topics: Big Data overview, AWS EC2 setup, SSH with Mac, PySpark, RDD transformations, and actions.
- Objective: Gain hands-on experience with big data technologies and distributed computing.
- 911 Calls Project: Analyzing and visualizing emergency call data.
- Finance Data Project: EDA on stock prices during the financial crisis.
- Linear Regression Project: Advising an eCommerce company on mobile app vs. website focus.
- Logistic Regression Project: Predicting ad-clicking behavior.
- Random Forest Project: Loan repayment prediction with LendingClub data.
- NLP Project: Sentiment analysis on Yelp reviews.
- Set up an environment for data analysis and machine learning.
- Master data wrangling with Pandas and data visualization with Matplotlib, Seaborn, and Plotly.
- Gain foundational machine learning skills with Scikit-learn, including supervised and unsupervised algorithms.
- Develop practical experience with neural networks using TensorFlow and Keras.
- Understand big data basics and work with Spark.
- Python libraries: NumPy, Pandas, Matplotlib, Seaborn, Plotly, Scikit-learn, TensorFlow, Keras
- Big Data: PySpark, AWS EC2, SSH setup
- Additional: Plotly Cufflinks for interactive visualizations
Throughout the bootcamp, the course tackled real-world challenges:
- Data Quality: Handling missing values and inconsistent formats in large datasets.
- Computational Limitations: Using PySpark and AWS EC2 instances for large-scale data.
- Model Complexity: Employing cross-validation and hyperparameter tuning to enhance model accuracy and minimize overfitting.
- Big Data Challenges: Acquiring skills for distributed data processing with Spark.