Machine learning is a data driven process. Sharing key datasets publicly allows
smart minds around the world to perform novel analyses and generate new insights.
In this lab, we will use the 1000 Genomes Project
dataset - a collection of DNA sequence variations from over 1000 individuals.
We will apply unsupervised learning via the Amazon-provided K-Means algorithm to
group the geographic location of sequences based on their variant information.
- In your notebook instance, click the New button on the right and select Folder.
- Click the checkbox next to your new folder, click the Rename button above in the menu bar, and give the folder a name such as 'genomics-clustering'.
- Click the folder to enter it.
- To upload the notebook, click the Upload button on the right. Then in the file selection popup, select the file 'genome-kmeans-py3.ipynb' from the notebooks subdirectory in the folder on your computer where you downloaded this GitHub repository. Click the blue Upload button that appears to the right of the notebook's file name.
- You are now ready to begin the notebook: click the notebook's file name to open it.
- In the
bucket = '<your_s3_bucket_name_here>'
code line, paste the name of the S3 bucket you created in Module 1 to replace<your_s3_bucket_name_here>
. The code line should now read similar tobucket = 'smworkshop-john-smith'
. Do NOT paste the entire path (s3://.......), just the bucket name.