Skip to content

Latest commit

 

History

History
374 lines (260 loc) · 16.4 KB

readme.md

File metadata and controls

374 lines (260 loc) · 16.4 KB

#Algorithms, Summer 2016 ##LEDE Program, Columbia University, Graduate School of Journalism

###Instructor:

Richard Dunks: rad2184 [at] columbia [dot] edu

####Room Number: Pulitzer Hall 601B

####Course Dates: 12 July - 25 August 2016


###Navigation


###Course Overview

This course presents an overview of algorithms as they relate to journalistic tradecraft, with particular emphasis on algorithms that relate to the discovery, cleaning, and analysis of data. This course intends to provide literacy in the common types of data algorithms, while providing practice in the design, development, and testing of algorithms to support news reporting and analysis, including the basic concepts of algorithm reverse engineering in support of investigative news reporting. The emphasis in this class will be on practical applications and critical awareness of the impact algorithms have in modern life.

######back to top

###Learning Objectives

  • You will understand the basic structure and operation of algorithms
  • You will be familiar with basic descriptive statistics
  • You will understand the primary types of data science algorithms, including techniques of supervised and unsupervised machine learning
  • You will be practiced in implementing basic algorithms in Python
  • You will be able to meaningfully explain and critique the use and operation of algorithms as tools of public policy and business
  • You will understand how algorithms are applied in the newsroom

######back to top

###Course Requirements All students will be expected to have a laptop during both lectures and lab time. Time will be set aside to help install, configure, and run the programs necessary for all assignments, projects, and exercises. Where possible, all programs will be free and open-source. All assigned work using services hosted online can be run using free accounts.

######back to top

###Course Readings The required readings for this course consist of book chapters, newspaper articles, and short blog posts. The intention is to help give you a foundation in the critical skills ahead of class lectures. All required readings are available online or will be made available to you electronically. Recommended readings are suggestions if you wish to study further the topics covered in class. Suggested readings will also be provided as appropriate for those interested in a more in-depth discussion of the material covered in class. Readings assigned in class are to be completed before the next class.

######back to top

###Assignments This course consists of programming and critical response assignments intended to reinforce learning and provide you with practical applications of the material covered in class. Completion of these assignments is critical to achieving the learning objectives of this course. Assignments are intended to be completed during lab time or for homework. Generally, assignments will be due before the start of the next class, unless otherwise stated. For example, assignments given on Tuesday will be due before class on Thursday. Time will be set aside in class to review assignments and provide feedback to you on your work.

  • Programming assignments will be submitted via Github. Please follow the tutorial for submitting assignments on Github. The exercises should be standalone for each assignment, not a combination of all assignments. This allows them to be tested and scored separately.
  • Programming assignments should be created and submitted in their own branch. See the tutorial for specific instructions on how to create a branch.
  • Programming assignments not following the naming convention <lastname>_<firstname>_<class_num>_<assignment_num>.ipynb will not be counted as completed.
  • Response questions should be clear, concise, and use the elements of good grammar. This is an opportunity to develop your ability to explain algorithms to your audience. You will receive further direction on how to submit these assignments.

######back to top

###Class Format Class runs from 10am to 1pm Tuesday and Thursday. Lab time will be from 2pm to 5pm Tuesday and Thursday. The class will be broken up into two blocks of approximately 85 minutes each, with a 10-minute break between each block. Class will be a mix of lecture and practical exercise work, emphasizing the application of skills covered in the lecture portion of the class. Lab time is intended for the completion of exercises, but may also include guided learning sessions as necessary to ensure comprehension of the course material.

######back to top

###Course Policies

  • Attendance and Tardiness: I expect you to attend every class, arriving on time and staying for the entire duration of class. Absences will only be excused for circumstances coordinated in advance and you are responsible for making up any missed work.
  • Participation: I expect you to be fully engaged while you’re in class. This means asking questions when necessary, engaging in class discussions, participating in class exercises, and completing all assigned work. Learning will occur in this class only when you actively use the tools, techniques, and skills described in the lectures. I will provide you ample time and resources to accomplish the goals of this course and expect you to take full advantage of what’s offered.
  • Late Assignments: All assignments are to be submitted before the start of class. Assignments posted by the end of the day following class will be marked down 10% and assignments posted at the end of the day following will be marked down 20%. No assignments will be accepted for a grade after three days following class.
  • Office Hours: I won’t be holding regular office hours, but am available via email and Slack to answer whatever questions you may have about the material and to arrange a time to meet. Please feel free to also reach out to the Teaching Assistants as necessary for support and guidance with the exercises, particularly during lab time.

######back to top


###Resources ####Technical

####(Some) Open Data Sources

####Visualizations

####Data Journalism and Critiques

####Suggested Reading Conway, Drew and John Myles White. Machine Learning for Hackers. O'Reilly Media, Inc., 2012.

Knuth, Donald E. The Art of Computer Programming. Addison-Wesley Professional, 2011.

MacCormick, John. Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers. Princeton University Press, 2011.

McCallum, Q Ethan. Bad Data Handbook. O'Reilly Media, Inc., 2012.

McKinney, Wes. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O'Reilly Media, Inc., 2012.

O'Neil, Cathy and Rachel Schutt. Doing Data Science: Straight Talk from the Front Line. O'Reilly Media, Inc., 2013.

Russell, Matthew A. Mining the Social Web. O'Reilly Media, Inc., 2013.

Sedgewick, Robert and Kevin Wayne. Algorithms. Addison-Wesley Professional, 2011.

Steiner, Christopher. Automate This: How Algorithms Came to Rule Our World. Penguin Group, 2012.

######back to top


###Course Outline (Subject to change) ####Note: Readings and homework are due the next class unless otherwise noted

####Week 1: Introduction to Algorithms/Python Review

#####Class 1: Overview of algorithms and data structures in Python

######Topics

  • Course policies and expectations
  • What is an algorithm?
  • Review data structures in Python
  • Overview of Github in class

######Readings

######Homework 0. Fork the repository. Clone the repository onto your local machine (git clone). Commit a brief biography of yourself as assignment 0 (text file). Do a pull request to submit.

  1. Write a function that takes in a list of numbers and outputs the mean of the numbers using the formula for mean. Do this without any built-in functions like sum(), len(), and, of course, mean()
  2. Create your own version of the Mayoral Excuse Machine in Python that takes in a name and location, selects an excuse at random and prints an excuse (“Sorry, Richard, I was late to City Hall to meet you, I had a very rough night and woke up sluggish”). Use the “excuses.csv” in the Github repository. Extra credit if you print the link to the story as well.
  3. Modify the code (in the repository) that prints every prime number between 1 and 100 to only print every other prime number. Extra credit if you can modify the code to speed it up.
  4. The code in Exercise4.ipynb is meant to search for New York Times articles on gay marriage and look at the mean and median word count, but the code has some problems. Follow the instructions in the notebook to fix the code and submit your fixed code.
  5. React to today's class in a short paragraph and email it to me. Include what you learned today and what topics you look forward to in the class. Include any additional information you feel I should know about yourself and your experience in this program so far

#####Class 2: Python data structures and control statements

######Topics

  • Git in detail
  • Review of Sets in Python
  • Control flow in Python

######Readings

######Homework

  • None

####Week 2: Analysis of Algorithms/Introduction to Statistics

#####Class 3: Branching in Github/Analysis of Algorithms

######Topics

  • Branching with Git
  • Control flow in Python
  • Designing algorithms with pseudocode
  • Estimating algorithm complexity

######Readings

######Homework

  1. Implement the sorting algorithm you came up with in pseudocode with Python. Test the sorting algorithm with a list of 10, 100, 1000 random numbers and compare the result using the %time to time your code and submit your results in code comments
  2. Implement the search algorithm you came up with in pseudocode with Python. Test the search algorithm with a list of 10,100,1000 random numbers (sorted with your sorting algorithm) and compare the result using the %time to time your code and submit your results in code comments

#####Class 4: Statistics Review

######Topics

  • Descriptive statistics
  • Exploratory data analysis
  • Descriptive statistics in Python with pandas
  • Statistical correlation

######Readings

######Homework

  1. Perform a basic statistical analysis of the time DOT 311 (table is called dot_311) complaints are open (subtract closed date from created date). Connect to the database to get the data and do the analysis. Submit the code through Github and type up your results in your PR.
  2. Using the 2013_NYC_CD_MedianIncome_Recycle.xlsx file, calculate the correlation between the recycling rate and the median income. Discuss your findings in your PR.
  3. Using the heights_weights_genders.csv, analyze the difference between the height weight correlation in women and men.

####Week 3: Supervised Learning - Linear Regression and Decision Trees

#####Class 5: Linear Regression

######Topics

  • Coefficient of determination
  • Statistical significance
  • Linear regression

######Readings

  • TBA

######Homework

  • TBA

#####Class 6: Decision Trees

######Topics

  • Introduction to Machine Learning
  • Decision trees
  • Training, test, and validation
  • Supervised Learning

######Readings

  • TBA

######Homework

  • TBA

####Week 4: Supervised Learning - Random Forest and Naive Bayes

#####Class 7:

######Topics

  • Feature engineering
  • Cross validation
  • Boosting
  • Bagging

######Readings

  • TBA

######Homework

  • TBA

#####Class 8: Naive Bayes

######Topics

  • Conditional probabilities
  • Naive Bayes

######Readings

  • TBA

######Homework

  • TBA

####Week 5: Supervised Learning - Random Forests, kNN, and Neural Networks

#####Class 9: Random Forest and Ensemble Methods

######Topics

  • Ensemble Methods
  • Random Forests

######Readings

  • TBA

######Homework

  • TBA

#####Class 10: kNN and Neural Networks

######Topics

  • Calculating distance
  • k-Nearest Neighbor (kNN)
  • Neural Networks

######Readings

  • TBA

######Homework

  • TBA

####Week 6: Unsupervised Learning - Clustering and Natural Language Processing

#####Class 11: Clustering

######Topics

  • Unsupervised learning
  • Clustering

######Readings

  • TBA

######Homework

  • TBA

#####Class 12: Natural Language Processing

######Topics

  • Working with text
  • Topic modeling
  • Recurrent neural networks

######Readings

  • TBA

######Homework

  1. Working in groups or as individuals, create a short (5 min) presentation based around an article or other piece of research that you find significant and deliver it to the class

####Week 7: Algorithms in Everyday Life/Open Lab

#####Class 13: Ethics of Algorithms

######Topics

  • Ethics of algorithms

######Readings

  • TBA

######Homework

  • TBA

#####Class 14: Advanced Topics/Open Lab

######Topics

  • TBA

######back to top