Skip to content

Latest commit

 

History

History
143 lines (139 loc) · 12.9 KB

File metadata and controls

143 lines (139 loc) · 12.9 KB

Computational-Social-Science-Labs

This repo contains all of the materials for Sociology 273, Computational Social Science Parts A/B. Designed as part of Berkeley's Computational Social Science Training Program This course is a rigorous, yearlong introduction to computational social science. The target audience is 2nd year and beyond PhD students who have completed their home departments' introductory statistics courses. We cover topics spanning reproducibility and collaboration, machine learning, natural language processing, and causal inference. This course has a strong applied focus with emphasis placed on doing computational social science. It makes extensive use of simulations, functional programming, and visualizations to illustrate statistical concepts and demonstrate how "computational social science" is a framework to think about how to analyze big data. By the end of the course, students will be well acquainted with some of the latest research and advanced in computational social science research, and begin working on their own projects.

Most modules contains both a "student" version and a "solutions" version. These are substantially the same, with the difference being that the student versions leave some code lines partially blank for in-class challenges. Each project is designed for groups of 3-4 students who use GitHub to collaborate and version control code. Several popular data science libraries are used frequently including sklearn, numpy, pandas, spaCy, gensim, tidyverse, tidymodels, and SuperLearner. For the most part the latest versions for any of these libraries should work, with exceptions noted in the notebooks as necessary.

Table of Contents

  1. Setup Anaconda Installation
  2. Reproducible Data Science and Introduction
    • a. Command Line Intro:
      • Introduction to use a command line interface (CLI) to interact with a computer
      • Basics of navigating file directory, text editing, and running shell/python scripts
    • b. GitHub Intro:
      • Introduction to git, version control, and GitHub.
      • Best practices for using version control to track code changes, collaborate with others without running into conflicts, and using GitHub to showcase portfolio and find open source software/code
    • c. Statistics Refresher
    • d. [Project 1]:
      • Use command line and GitHub to create a group repo and practice with version control and branching.
      • Create a personal website using GitHub Pages.
  3. Fundamentals of Machine Learning
    • a. Math Review:
      • Matrix multiplication
      • Derivatives
      • Integrals
      • numpy/scipy
    • b. Bias-Variance Tradeoff and Data Splitting:
      • Introduction to train/validation/test splits and cross-validation for machine learning
      • Bias-variance tradeoff
      • Confusion matrices
    • c. Regression:
      • Ordinary Least Squares
      • Regularization via Ridge/LASSO
      • Coefficient plots
      • Hyperparameter tuning
    • d. Project 2:
      • Predict county-level diabetes rates
      • Exploratory data analysis, data cleaning and preparation, hyperparameter tuning, feature selection, model validation
  4. Supervised Machine Learning
    • a. Classification:
      • Imbalanced class labels
      • Logistic regression, decision tree classifier, support vector machine
      • Hyperparameter tuning
      • Metrics (accuracy, recall, precision, AUC-ROC)
    • b. Trees and Ensembles:
      • Decision tree, random forest, adaboost
      • Variable importance plot
    • c. Neural Networks:
      • Multi-layer perceptron
      • keras tensorflow
      • Convolutional neural network
    • d. Project 3::
      • Predict health code violations in Chicago restaurants.
      • Data preprocessing, classification models, interpretable and explainable machine learning, prediction policy problems
  5. Unsupervised Machine Learning and AutoML:
    • a. Clustering and PCA:
      • Principal components analysis
      • Clustering (k-means, spectral, etc.)
      • Unsupervised learning outputs as inputs to supervised learning
    • b. TPOT:
      • TPOT genetic programming to automatically search for machine learning pipeline for preprocessing, unsupervised learning, and classification/regression
    • c. Project 4:
      • Unsupervised learning and neural network classification on National Health and Nutrition Examination Survey (NHANES)
      • Difference between dimensionality reduction and clustering
      • Combining dimensionality reduction and clustering
      • Deep learning with one hidden layer
  6. Natural Language Processing
    • a. Text Preprocessing:
      • Tokenization
      • Stop words
      • Entity recognition
      • Lemmatization
      • Bag of words/term frequency-inverse document frequency
      • Naive Bayes
      • spaCy
    • b. Exploratory Data Analysis and Unsupervised Methods:
      • Word clouds
      • Sentiment polarity
      • Topic modeling
    • c. Text Feature Engineering and Classification:
      • N-grams
      • Word counts
      • Topic model proportions as input to classification
      • Combining text and non-text features
    • d. word2vec:
      • Word embeddings
      • t-SNE
      • doc2vec
      • Document average word embeddings
      • Pre-trained embeddings using gensim
    • e. Project 5:
      • Investigate asymmetric polarization and moderation/extremism in U.S. Congress tweets.
      • Text preprocessing, exploratory data analysis, text feature engineering, classification
  7. Causal Inference
    • a. R Refresher:
      • Introduction to R
      • Dplyr, tidyr, ggplot, purrr
    • b. Randomized Experiments:
      • Average Treatment Effect (ATE)
      • Individual-level Treatment Effect (ITE)
      • Average Treatment Effect on the Treated (ATT)
      • Heterogenous Treatment Effects
      • Randomization Designs (completely, cluster, block)
      • Statistical tests of difference
    • c. Matching:
      • Propensity score matching
      • Full/optimal/greedy matching
      • Mahalanobis distance
      • Double robust estimators
    • d. Project 6:
      • Replicate studies examining effect of college attendance on political participation.
      • Preprocessing, matching after randomized study to improve covariate balance, simulations to examine different matching configurations effect on ATE estimates
    • e. Regression Discontinuity:
      • Regression discontinuity
      • Running variable
      • McCrary density test
      • Sharp discontinuity
      • Bandwidth selection via Imbens-Kalyanaraman and cross-validation
    • f. Instrumental Variables:
      • Directed Acyclic Graphs (DAGs)
      • Exclusion restriction
      • Colliders
      • Two-Stage Least Squares (2SLS)
    • g. Diff-in-Diffs and Synthetic Control:
      • Difference-in-differences method
      • Parallel trends assumption
      • Synthetic control
      • Augmented synthetic control with Ridge regularization
      • Staggered adoption synthetic control
    • h. Project 7:
      • Diff-in-diffs and synthetic control to analyze the effect of Affordable Care Act (ACA) Medicaid expansion among adoptees over time.
    • i. Sensitivity Analysis:
      • Manski bounds
      • Rosenbaum sensitivity analysis
      • E-values
    • j. SuperLearner and Longitudinal Targeted Maximum Likelihood Estimation (LTMLE):
      • Ensemble machine learning for causal inference
      • Parallelization in R
      • Targeted learning
      • Double robust estimators
      • Time-dependent confounding
    • k. Project 8:
      • Effect of blood pressure medication on heart disease using SuperLearner, TMLE, and LTMLE.