Contributors
- Raul Eulogio
UPDATE (3/1/2017): Here are the accompanying repos for the workshop!
- adultConsusIncome
- Contributors:
- mushroom
- Contributors:
- emailClassification
- Contributors:
- wineQuality
- Contributors:
Goal:
As we continue our journey through the world of data analysis and data science, we have addressed the issue of learning Python for data analysis. This component of data analysis is important for our futures, but unfortunately has been non-existent in our curriculum, so we are deciding to change that.
Obviously we can't teach you everything about Python in one 3 hour coding seshion, but we can help you become proficient in data analysis, more specifically using modules that we believe are important for anyone entering the world of data analysis. Many of which are comparable to the processes used in our statistics classes when utilizing R.
Another important component for this workshop is to have you learn how to iterate through a project. Working on your main project is great for team building, and learning data analysis, but we found if you learn to iterate on a project (i.e. what we've been referring to as baby projects) within a certain time frame, that's the best way you learn how to hack a project.
Process:
For this workshop we will start you off with determining what data set you will use for your analysis. These data sets will make your team use critical and necessary methods for data analysis in Python. This is considered the data wrangling aspect of data science, an integral, but often slept on part of data analysis because many of us are given clean data sets that we take for granted (until today).
We will form teams of 4 (or you can work with the team you are already working with) and once those teams are formed we will hand every team a data set. Along with a set of instructions that will guide you in the process of data analysis.
Team Sections:
For the teams, we will section off by level of experience as it relates to Python and data analysis. We will create three levels relating to experience where for each level there will be certain endgoals with respect to your abilites as a hacker and as a team.
Here is how we'll section off:
- Beginner
- Intermediate
- Well-Versed
Here are some important packages that we feel like be necessary in your analysis:
numpy
pandas
sklearn
tensorflow
matplotlib
seaborn
urlopen
With the start of the projects your team will need to brainstorm (by employing a Stand up) about the processes including the code and the team work necessary.
Many of you are still getting adjusted to learning Python so you will have to work as a team to discuss strengths and weaknesses.
My advice is have your team work off of one person's computer to write the code. While other teammates can use theirs to research Python packages, algorithms, or work on the README files. Every teammate should always be working on something at all times. We came here to learn so any idleness is a waste of your and my time (Obviously teams should take breaks when appropriate).
For this step you will create a GitHub repository with the following requirements:
- README detailing
- List of contributors
- Description of data set
- Along with any process taken to retrieve the data set
- Link to the source of data set
- The problem you are trying to solve (i.e. the instructions on the file for your given data set)
- Programming language along with relevant modules
- Setting up Virtual Environment as well as having a requirements.txt file
- An images folder to provide any important Exploratory analysis images from your analysis
For this step you will work as a team to address the issue that was presented to your team. For this section, teamwork is important. We want you to be working together addressing how to go about the data wrangling process in order for you to start the data analysis. For this process we need the following requirements:
- A script documenting the process of data wrangling (this file will be seperate from your data analysis file)
- Can be done with a simple .py script or Jupyter Notebook, your preference, but we would like to be well commented explaining your process to going from the raw file to the file prepped for data analysis
- Must be well commented to show us what tf you did to go from the raw file to the file you will be using for your analysis.
- Detail important processes that you took and format in a clear and concise way that makes the script readable and reproducible.
- Include a summary at the end detailing some issues addressed (Remember always include resources used, sometimes it helps to include the resource in the actual script)
- Uploaded raw file (in csv, txt or data file) on the GitHub repository, as well as a description in the README file explaining the process taken when doing your data wrangling.
For this step, once you are done with the data wrangling aspect we want you to output the cleaned up file(s) and begin a seperate script detailing the data analysis component of your project.
- We would like an explanation of the process taken so some research is necessary on your part with respect to knowing how to explain the method/algorithm your team will be using.
For the final step everything should be presentable (i.e. code is cleaned up, README file is lit af, images are in the right directory, files are named to give people context). You should have a GitHub repo that looks similar to those on the inertia7 GitHub account.
Obviously we can't go over everyone's repo to ensure that the code is running, but we can do a brief overview of everyone's GitHub repository. This is why GitHub repos are important they give context when someone can't run your code, but want to see your results. As well as acting as a medium of presentation until you start posting projects on inertia7 :)
So good luck and happy coding
Sources Cited: