Skip to content

Commit

Permalink
[New Hack] 072-DataScienceInFabric (#883)
Browse files Browse the repository at this point in the history
* Creating directory xxx-DataScience_In_MicrosoftFabric/Coach/Solutions/Notebooks

* Committing 7 items from workspace 9aafe34c-0112-488c-ac8a-787e353a55e1

* Committing 6 items from workspace 9aafe34c-0112-488c-ac8a-787e353a55e1

* Committing 4 items from workspace 9aafe34c-0112-488c-ac8a-787e353a55e1

* Add files via upload

* Rename 03-Train-Register-HeartFailurePredictionModel  (1).ipynb to 03-Train-Register-HeartFailurePredictionModel.ipynb

* Rename 02-data-analysis-preprocess (4).ipynb to 02-data-analysis-preprocess.ipynb

* Update Challenge-04.md

* Update Challenge-04.md

* Update Challenge-04.md

* Update Challenge-04.md

* Update Challenge-04.md

* Update Challenge-04.md

* Update Challenge-04.md

* Update Challenge-04.md

* Update Challenge-04.md

* Update Challenge-04.md

* Update Challenge-00.md

* Update Challenge-04.md

* Update Challenge-04.md

* Update Challenge-04.md

* Update Solution-04.md

* Update Solution-04.md

* Add files via upload

* Update Solution-04.md

* Update Solution-04.md

* Update Solution-04.md

* Add files via upload

* Update Solution-04.md

* Update Solution-04.md

* Update Solution-04.md

* Add files via upload

* Update Solution-04.md

* Update Solution-04.md

* Update Solution-04.md

* Add files via upload

* Update Solution-04.md

* Add files via upload

* Update Solution-04.md

* Update Solution-04.md

* Add files via upload

* Delete xxx-DataScience_In_MicrosoftFabric/Screenshot_26-6-2024_03016_ml.azure.com.jpeg

* Add files via upload

* Update Solution-04.md

* Update README.md

* Add files via upload

* Rename 01-Ingest-Heart-Failure-Dataset-to-Lakehouse (1).ipynb to 01-Ingest-Heart-Failure-Dataset-to-Lakehouse.ipynb

* Rename 03-Train-Register-HeartFailurePredictionModel (1).ipynb to 03-Train-Register-HeartFailurePredictionModel.ipynb

* Add files via upload

* Create a

* Rename xxx-DataScience_In_MicrosoftFabric/Coach/Screenshot_26-6-2024_01752_ml.azure.com.jpeg to xxx-DataScience_In_MicrosoftFabric/Coach/Photos/Screenshot_26-6-2024_01752_ml.azure.com.jpeg

* Rename xxx-DataScience_In_MicrosoftFabric/Coach/Screenshot_26-6-2024_03016_ml.azure.com.jpeg to xxx-DataScience_In_MicrosoftFabric/Coach/Photos/Screenshot_26-6-2024_03016_ml.azure.com.jpeg

* Rename xxx-DataScience_In_MicrosoftFabric/Coach/image-10.png to xxx-DataScience_In_MicrosoftFabric/Coach/Photos/image-10.png

* Rename xxx-DataScience_In_MicrosoftFabric/Coach/image-11.png to xxx-DataScience_In_MicrosoftFabric/Coach/Photos/image-11.png

* Rename xxx-DataScience_In_MicrosoftFabric/Coach/image-12.png to xxx-DataScience_In_MicrosoftFabric/Coach/Photos/image-12.png

* Rename xxx-DataScience_In_MicrosoftFabric/Coach/image-13.png to xxx-DataScience_In_MicrosoftFabric/Coach/Photos/image-13.png

* Rename xxx-DataScience_In_MicrosoftFabric/Coach/image-14.png to xxx-DataScience_In_MicrosoftFabric/Coach/Photos/image-14.png

* Rename xxx-DataScience_In_MicrosoftFabric/Coach/image-15.png to xxx-DataScience_In_MicrosoftFabric/Coach/Photos/image-15.png

* Update Solution-04.md

* Update Solution-01.md

* Update Solution-01.md

* Update Solution-00.md

* Update Solution-00.md

* Update Solution-00.md

* Update Challenge-00.md

* Update Challenge-00.md

* Update Solution-00.md

* Update Solution-01.md

* Update Solution-01.md

* Update Challenge-01.md

* Update Challenge-00.md

* Update Challenge-00.md

* Update Challenge-05.md

* Update Challenge-06.md

* Update Challenge-06.md

* Update Challenge-06.md

* Update Solution-06.md

* Update Solution-06.md

* Update Solution-00.md

* Update Challenge-00.md

* Update Solution-00.md

* Update Solution-00.md

* Update Challenge-00.md

* Update Solution-00.md

* Update Solution-01.md

* Update Challenge-01.md

* Update Challenge-01.md

* Update Challenge-01.md

* Update Challenge-01.md

* Update Challenge-01.md

* Update Challenge-01.md

* Update Challenge-01.md

* Update Challenge-01.md

* Update Solution-03.md

* Update Solution-03.md

* Update Solution-03.md

* Update Solution-03.md

* Update Solution-03.md

* Update Solution-01.md

* Update Solution-01.md

* Update Solution-03.md

* Update Solution-01.md

* Update Challenge-03.md

* Update Solution-04.md

* Update Challenge-04.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Solution-05.md

* Update Solution-05.md

* Update Solution-05.md

* Update Solution-05.md

* Update Solution-05.md

* Update Solution-05.md

* Update Solution-05.md

* Update Challenge-02.md

* Update Solution-01.md

* Update Solution-01.md

* Update Challenge-02.md

* Update Solution-01.md

* Update Challenge-02.md

* Update Challenge-02.md

* Update Challenge-02.md

* Update Solution-03.md

* Delete xxx-DataScience_In_MicrosoftFabric/Coach/Solutions/Notebooks/01-Ingest-Heart-Failure-Dataset-to-Lakehouse.ipynb

* Delete xxx-DataScience_In_MicrosoftFabric/Coach/Solutions/Notebooks/02-data-analysis-preprocess.ipynb

* Delete xxx-DataScience_In_MicrosoftFabric/Coach/Solutions/Notebooks/03-Train-Register-HeartFailurePredictionModel.ipynb

* Add files via upload

* Update Solution-02.md

* Update Solution-02.md

* Update Solution-02.md

* Update Solution-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Solution-06.md

* Delete xxx-DataScience_In_MicrosoftFabric/Coach/Photos/a

* Add files via upload

* Rename Screenshot 2024-07-17 153448.png to postman-body.png

* Rename Screenshot 2024-07-17 153526.png to postman-token.png

* Rename Screenshot 2024-07-17 153501.png to postman-header.png

* Update Solution-06.md

* Update Challenge-06.md

* Update Challenge-06.md

* Update Challenge-06.md

* Update Challenge-06.md

* Add files via upload

* Create a

* Delete xxx-DataScience_In_MicrosoftFabric/Coach/Notebooks/a

* Create a

* Add files via upload

* Add files via upload

* Delete xxx-DataScience_In_MicrosoftFabric/Student/Resources/01-Ingest-Heart-Failure-Dataset-to-Lakehouse.ipynb

* Delete xxx-DataScience_In_MicrosoftFabric/Student/Resources/03-Train-Register-HeartFailurePredictionModel.ipynb

* Delete xxx-DataScience_In_MicrosoftFabric/Student/Resources/04-Perform-HeartFailure-Batch-Predictions.ipynb

* Delete xxx-DataScience_In_MicrosoftFabric/Coach/CoachResources.zip

* Add files via upload

* Add files via upload

* Create a

* Add files via upload

* Delete xxx-DataScience_In_MicrosoftFabric/Coach/Notebooks/a

* Delete xxx-DataScience_In_MicrosoftFabric/Student/Notebooks/a

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Add files via upload

* Delete xxx-DataScience_In_MicrosoftFabric/Student/Resources/heart_failure_clinical_records_dataset.csv

* Delete xxx-DataScience_In_MicrosoftFabric/Student/StudentResources.zip

* Add files via upload

* Delete xxx-DataScience_In_MicrosoftFabric/Coach/CoachResources.zip

* Add files via upload

* Delete xxx-DataScience_In_MicrosoftFabric/Student/StudentResources.zip

* Add files via upload

* Update README.md

* Update README.md

* Delete xxx-DataScience_In_MicrosoftFabric/Student/Challenge-07.md

* Delete xxx-DataScience_In_MicrosoftFabric/Student/Challenge-08.md

* Update Challenge-00.md

* Update Challenge-01.md

* Update Challenge-02.md

* Update Challenge-02.md

* Update Challenge-02.md

* Update Challenge-02.md

* Update Challenge-02.md

* Update Challenge-02.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-05.md

* Update Challenge-06.md

* Update Challenge-06.md

* Update Challenge-06.md

* Update Challenge-06.md

* Update Challenge-06.md

* Update Challenge-06.md

* Delete xxx-DataScience_In_MicrosoftFabric/Coach/Lectures.pptx

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update Solution-05.md

* Update Solution-05.md

* Update Solution-05.md

* Update Solution-06.md

* Delete xxx-DataScience_In_MicrosoftFabric/Coach/Solution-07.md

* Delete xxx-DataScience_In_MicrosoftFabric/Coach/Solution-08.md

* Delete xxx-DataScience_In_MicrosoftFabric/Coach/Solutions directory

* Renaming top level folder from xxx to 072

* Moving student resources

* Moved coach solutions

* Update student setup instructions

* Updating coach setup instructions

* Moved Images folder to root, updated references

* Update licensing requirements across both guides

* Add wordlist and fix spellcheck errors

* Second spellchecker fix

* Adding changes from Cameron's test run

* Update README.md

Grammar updates

* Update Solution-00.md

Fixing Links

* Update Solution-02.md

Grammar update

* Update Solution-03.md

Grammar update

* Update Solution-04.md

grammar/spelling update

* Update Challenge-00.md

Fixing links

* Update Challenge-04.md

grammar update

* Resolved issues uncovered in review by WTH team

* Fix typo in Solution-02.md

fixing typo

* Remove GitHub reference in solution guide

* Update script csv link

---------

Co-authored-by: Pardeep Kumar Singla <[email protected]>
Co-authored-by: Leandro Santana <[email protected]>
Co-authored-by: Andy Huang <[email protected]>
  • Loading branch information
4 people authored Jan 7, 2025
1 parent 2f4795d commit 0dd244c
Show file tree
Hide file tree
Showing 39 changed files with 1,897 additions and 0 deletions.
8 changes: 8 additions & 0 deletions 072-DataScienceInFabric/.wordlist.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
dataframe
DataFrame
DataFrame's
MLFlow
Leandro
interpretability
auc
repurpose
85 changes: 85 additions & 0 deletions 072-DataScienceInFabric/Coach/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# What The Hack - Data Science In Microsoft Fabric

## Introduction

Welcome to the coach's guide for the Data Science In Microsoft Fabric What The Hack. Here you will find links to specific guidance for coaches for each of the challenges.

**NOTE:** If you are a Hackathon participant, this is the answer guide. Don't cheat yourself by looking at these during the hack! Go learn something. :)

## Coach's Guides

- Challenge 00: **[Prerequisites - Ready, Set, GO!](./Solution-00.md)**
- Configure your Fabric workspace and gather your data
- Challenge 01: **[Bring your data to the OneLake](./Solution-01.md)**
- Creating a shortcut to the available data
- Challenge 02: **[Data preparation with Data Wrangler](./Solution-02.md)**
- Clean and transform the data into a useful format while leveraging Data Wrangler
- Challenge 03: **[Train and register the model](./Solution-03.md)**
- Train a machine learning model with ML Flow with the help of Copilot
- Challenge 04: **[Generate batch predictions](./Solution-04.md)**
- Score a static dataset with the model
- Challenge 05: **[Visualize predictions with a PowerBI report](./Solution-05.md)**
- Visualize generated predictions by using a PowerBI report
- Challenge 06: **[(Optional) Deploy the model to an AzureML real-time endpoint](./Solution-06.md)**
- Deploy the trained model to an AzureML endpoint for inference

## Coach Prerequisites

This hack has pre-reqs that a coach is responsible for understanding and/or setting up BEFORE hosting an event. Please review the [What The Hack Hosting Guide](https://aka.ms/wthhost) for information on how to host a hack event.

The guide covers the common preparation steps a coach needs to do before any What The Hack event, including how to properly configure Microsoft Teams.

### Student Resources

Always refer students to the [What The Hack website](https://aka.ms/wth) for the student guide: [https://aka.ms/wth](https://aka.ms/wth)

**NOTE:** Students should **not** be given a link to the What The Hack repo before or during a hack. The student guide does **NOT** have any links to the Coach's guide or the What The Hack repo on GitHub.


## Azure and Fabric Requirements

This hack requires students to have access to Azure and Fabric. These requirements should be shared with a stakeholder in the organization that will be providing the licenses that will be used by the students.

### Fabric and PowerBI licensing requirements:

Each student will need access to Microsoft Fabric and be licensed to create PowerBI reports for this hack. The following are the options to complete these licensing requirements:

1. **Recommended if available**: Individual [Fabric free trials](https://learn.microsoft.com/en-us/fabric/get-started/fabric-trial#start-the-fabric-capacity-trial). This will grant users access to creating the required Fabric items as well as the PowerBI report. **If previously used, the Fabric free trial may be unavailable**
2. Fabric Capacity and PowerBI Pro/Premium per user license. Each user would need their own PowerBI license but capacities could be shared and scaled up according to their needs. If running the hack on an individual basis, an F4 capacity would be adequate, and an F8 capacity would have generous compute power margin. **Alternatively, users can activate a [PowerBI Free Trial](https://learn.microsoft.com/en-us/power-bi/fundamentals/service-self-service-signup-for-power-bi) if available.** The PowerBI trial could be available even if the Fabric one is not.


### Azure licensing requirements

There are 2 challenges that require access to Azure:

- Challenge 1: Students are required to navigate an Azure ADLS Gen 2 account through the Azure Portal to learn how to set up a Fabric shortcut to an existing file. This challenge requires each student to have contributor permissions to the resource, but 1 single storage account/directory/file could be shared among all students, given that they will not modify it but rather just access and connect to it.

- Challenge 6: Students are required to have Azure AI Developer access to an Azure Machine Learning resource. Each student will need to register their own model and create their own real-time endpoint, which is why it is **recommended to individually deploy an Azure ML workspace per student**.

Given these requirements, each student could have their own Azure subscription, or they could share access to a single subscription.

These Azure resources can be deployed on an individual per-student basis using the `deployhack.sh` script included in the student resources folder.

## Suggested Hack Agenda

You may schedule this hack in any format, as long as the challenges are completed sequentially.

Time estimates for each challenge:
- Challenge 00: 15 minutes
- Challenge 01: 30 minutes
- Challenge 02: 30 minutes
- Challenge 03: 45 minutes
- Challenge 04: 30 minutes
- Challenge 05: 30 minutes
- Challenge 06: 45 minutes

## Repository Contents

- `./Coach`
- Coach's Guide and related files
- `./Coach/Solutions`
- Solution files with completed example answers to challenges
- `./Student`
- Student's Challenge Guide
- `./Student/Resources`
- Student resource files, also available as a download link on Student Challenge 0
62 changes: 62 additions & 0 deletions 072-DataScienceInFabric/Coach/Solution-00.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Challenge 00 - Prerequisites - Ready, Set, GO! - Coach's Guide

**[Home](./README.md)** - [Next Solution >](./Solution-01.md)

## Introduction

Thank you for participating in the Data Science in Microsoft Fabric What The Hack. Before you can hack, you will need to set up some prerequisites.

## Common Prerequisites

We have compiled a list of common tools and software that will come in handy to complete most What The Hack Azure-based hacks!

You might not need all of them for the hack you are participating in. However, if you work with Azure on a regular basis, these are all things you should consider having in your toolbox.

<!-- If you are editing this template manually, be aware that these links are only designed to work if this Markdown file is in the /xxx-HackName/Student/ folder of your hack. -->

- [Azure Subscription](../../000-HowToHack/WTH-Common-Prerequisites.md#azure-subscription)
- [Postman](https://www.postman.com/downloads/)
- [Managing Cloud Resources](../../000-HowToHack/WTH-Common-Prerequisites.md#managing-cloud-resources)
- [Azure Portal](../../000-HowToHack/WTH-Common-Prerequisites.md#azure-portal)
- [Azure Cloud Shell](../../000-HowToHack/WTH-Common-Prerequisites.md#azure-cloud-shell)
- [Azure CLI (optional)](../../000-HowToHack/WTH-Common-Prerequisites.md#azure-cli)
- [Note for Windows Users](../../000-HowToHack/WTH-Common-Prerequisites.md#note-for-windows-users)
- [Azure PowerShell CmdLets](../../000-HowToHack/WTH-Common-Prerequisites.md#azure-powershell-cmdlets)

- [Azure Storage Explorer (optional)](../../000-HowToHack/WTH-Common-Prerequisites.md#azure-storage-explorer)

Additionally please refer to the [Coach Hack introduction](./README.md) for more information about licensing requirements and options
## Description

Now that you have the common pre-requisites installed on your workstation, there are prerequisites specific to this hack.

In Challenge 0 on the student guide, students are instructed to download the Resources folder [here](https://aka.ms/FabricdsWTHResources). This folder contains the notebooks students will be working with, as well as a shell script that they will use to deploy some needed Azure resources.

The [coach solution notebooks](./Solutions/) are the completed versions of the student notebooks. The solutions can be used as a guide or uploaded to Fabric to complete each Challenge.


**NOTE:** The resources.zip folder also includes the heart.csv file. You can upload this data directly to the Fabric Lakehouse if you decide you want to go through this hack without needing an Azure subscription. However, this will skip half of Challenge 1 and the important concept of using shortcuts in Fabric. If you are going to be setting up the Azure resources and using the shortcut, ignore the heart.csv file.

To begin setting up your Azure subscription for this hack, you will run a bash script that will deploy and configure a list of resources. You can find this script as the `HackSetup.sh` file in the resources folder.
- Download the setup file to your computer
- Go to the Azure portal and click on the cloud shell button on the top navigation bar, to the right of the Copilot button.
- **NOTE**: This script has been designed for the Azure CLI. It might fail to deploy if you attempt to run it from a local terminal.
- Once the cloud shell connects, make sure you are using a Bash shell. If you are not, click on the button on the top-right corner of the cloud shell to switch to bash.
- Click on the Manage Files button on the shell's navigation bar and select upload. Select the setup file from your computer.
- Run the `sh HackSetup.sh` command in your cloud shell.
- Follow the prompts in the shell.

After setting up your Azure resources, head to [Microsoft Fabric](https://fabric.microsoft.com/).
- Create a new workspace by clicking on 'Workspaces' in the vertical menu on the left side of the screen. Use the 'New Workspace' button at the bottom of the list.
- Once you are inside your new workspace, select the Data Science experience using the button on the bottom left corner of the screen.
- At the top of the Data Science experience menu, check that you are still in the new workspace and select 'Import Notebook' from the top row of options.
- Follow the prompts to upload the 4 notebook (`.ipynb`) files contained within the resources folder.


## Success Criteria

To complete this challenge successfully, you should be able to:

- Verify that you have a storage account with the heart.csv data in a container
- Verify that you have a Fabric workspace where your 4 notebooks are available
- (Optional) Verify that your Azure ML workspace has correctly deployed (if completing Challenge)
45 changes: 45 additions & 0 deletions 072-DataScienceInFabric/Coach/Solution-01.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Challenge 01 - Bring your data to the OneLake - Coach's Guide

[< Previous Solution](./Solution-00.md) - **[Home](./README.md)** - [Next Solution >](./Solution-02.md)

## Notes & Guidance

In this challenge, hack participants must create a shortcut to the folder deployed in their Azure subscription on Challenge 0. This will allow them to use the data in Fabric without the need for replication. Once the shortcut is completed, participants will open Notebook 1 to load the csv file into a delta table for further modification on Notebook 2.

### Sections

1. Create a Lakehouse (non-notebook)
2. Create a Shortcut (non-notebook)
3. Read the .csv file into a dataframe in the notebook (Notebook 1)
4. Write the dataframe to the lakehouse as a delta table (Notebook 1)

### Student step-by-step instructions (creating a shortcut)
- Creating a Lakehouse:
- Participants must create a lakehouse on the Fabric workspace they previously set up. In Fabric, navigate to the workspace.
- On the top left of the screen, select new and more options.
- On the data engineering section, select Lakehouse. Give the lakehouse a unique name and click on create.

- Creating a Shortcut:
- On the Lakehouse navigator, use the left hand-side menu and click on the 3 dots (...) next to files. Click on "New shortcut"
- On the shortcut wizard, click on "Azure Data Lake Storage Gen2"
- Go to your Azure portal. The URL can be found on the **Settings>Endpoint** side menu of the Storage Account. In this menu, you will see a variety of endpoint Resource IDs and URLs. Find and copy the **data lake storage URL** from the list. Enter it into the wizard in Fabric.
- Create a new connection, give it a name and select "Account Key" as the authentication kind.
- Go back to your Azure portal. The Account Key can be found on in the **Security + Networking>Access keys** side menu of the Storage Account. Show and copy one of the keys. Enter it into the wizard in Fabric.
- Click on next to access the file explorer. Wait for the screen to load.
- On the side menu, expand the file-system folder. Select the check mark next to the "files" folder.
- Click next to move to the next screen, then click on create to create the shortcut.
- Verify that your shortcut is showing under the **Files** folder of the lakehouse navigator. You might need to click on the 3 dots and on refresh if your shortcut is not present initially.

### Overview of student directions (running Notebook 1)
- This section of the challenge is notebook based. All the instructions and links required for participants to successfully complete this section can be found on Notebook 1 in the `student/resources.zip/notebooks` folder.
- To run the notebook, go to your Fabric workspace and select Notebook 1. Ensure that it is correctly attached to the lakehouse. You might need to connect to the lakehouse you previously created on the left-hand side file explorer menu.
- The students must follow the instructions, leverage the documentation and complete the code cells sequentially.

### Coaches' guidance
- This challenge has 2 main sections, creating a shortcut and loading the files into delta tables. The first section must be completed before working on Notebook 1.
- The full version of Notebook 1, with all code cells filled in, can be found for reference in the `coach/solutions` folder.
- The aim of this challenge, as noted in the student guide, is to understand lakehouses, shortcuts and the delta format.
- To assist students, coaches can clear up doubts regarding the Python syntax or how to get started with notebooks, but students should focus on learning how to set up shortcuts, navigate the Fabric UI and read/write to the delta lake.

## Success criteria
- The heart.csv data is now saved as a delta table on the lakehouse
62 changes: 62 additions & 0 deletions 072-DataScienceInFabric/Coach/Solution-02.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Challenge 02 - Data preparation with Data Wrangler - Coach's Guide

[< Previous Solution](./Solution-01.md) - **[Home](./README.md)** - [Next Solution >](./Solution-03.md)

## Notes & Guidance

In this challenge, hack participants must use Data Wrangler to prepare the heart dataset for model training. The purpose is to focus on transforming and preparing the data for the next challenges. They will have the flexibility to either write code in a notebook or leverage Data Wrangler’s intuitive interface to streamline the pre-processing tasks.

### Sections

1. Read the .csv file into a pandas dataframe in the notebook. (Notebook 2)
2. Launch the Data Wrangler and interact with the data cleaning operations (Notebook 2)
3. Apply the operations using python codes (Notebook 2)
4. Develop feature engineering using spark. (Notebook 2)
5. Write the dataframe to the lakehouse as a delta table. (Notebook 2)

### Student step-by-step instructions
- Launching Data Wrangler:
- Participants must create a pandas dataframe the fabric notebook. It’s necessary to complete the first cell in notebook 2.
- Once executed, under the notebook ribbon Home tab, select Launch Data Wrangler. You'll see a list of activated pandas DataFrames available for editing.
- Select the DataFrame you just created in last cell and open in Data Wrangler. From the Pandas dataframe list, select `df`.


- Data Cleaning Operations – (Data Wrangler)
- *Removing Unnecessary Columns*
- On the *Operations* panel, expand *Schema* and select *Drop columns*.
- Select `RowNumber`. This column will appear in red in the preview, to show they're changed by the code (in this case, dropped.)
- Select **Apply**, a new step is created in the **Cleaning steps panel** on the bottom left.

- *Dropping Missing Values*
- On the **Operations** panel, select **Find and replace**, and then select **Drop missing values**.
- Select the `RestingBP`, `Cholesterol` and `FastingBS` columns. Those are the columns that are pointed as having missing values on the right-hand side menu of the screen.
- Select **Apply**, a new step is created in the **Cleaning steps panel** on the bottom left.

- *Dropping Duplicate Rows*
- On the **Operations** panel, select **Find and replace**, and then select **Drop duplicate rows**.
- Select **Apply**, a new step is created in the **Cleaning steps panel** on the bottom left.

- Feature Engineering - (Notebook)
- This part is notebook based. Participants will work in cells 09, 10, and 11 to transform categorical values into numerical labels.
- You can also explore how to one-hot encode the categorical columns with Data Wrangler. However, this will not create labels in your existing columns, but rather a new column for each category with True and False values. Using this alternative format might need some modification to the code in the model training process. Please discuss this possibility with hack attendees to raise awareness of this Data Wrangler feature.

### Overview of student directions (running Notebook 2)
- This section of the challenge is notebook based. All the instructions and links required for participants to successfully complete this section can be found on Notebook 2 in the `student/resources.zip/notebooks` folder.
- To run the notebook, go to your Fabric workspace and select Notebook 2. Ensure that it is correctly attached to the lakehouse. You might need to connect to the lakehouse you previously created on the left-hand side file explorer menu.
- The students must follow the instructions, leverage the documentation and complete the code cells sequentially.

### Coaches' guidance

- This challenge has 3 main sections, Data Wrangler operations, feature engineering and saving processed data to a delta table.
- The full version of Notebook 2, with all code cells filled in, can be found for reference in the `coach/solutions` folder of this GitHub.
- The aim of this challenge, as noted in the student guide, is to understand data preparation using data wrangler and fabric notebooks.
- To assist students, coaches can clear up doubts regarding the Python syntax or how to get started with notebooks, but students should focus on learning how to operate data wrangler, navigate the Fabric UI, code in notebooks and read/write to the delta lake.


## Success criteria
- The heart dataset totally shaped, cleaned and prepared for the model training.
- No data duplicated or exceeded columns.
- No missing values.
- No categorical values.


Loading

0 comments on commit 0dd244c

Please sign in to comment.