Big_Data_Analytics.ipynb.orig

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "rn-_KJ3eqxVy"
   },
   "source": [
    "# The Woof Factor in Zürich\n",
    "---\n",
    "\n",
    "Through use of data freely available from *Stadt Zürich Open Data*, we analyse dogs registered in Zurich and combine this data with other information available about the districts (kreis) that make up Zurich. A model is developed in which we ...\n",
    "\n",
    "*Project Group 23:* Corey Bothwell, Nicoletta Farabullini, Jacob Gelling, Andris Prokofjevs, Qasim Warraich"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "HYonn-cUpjzv"
   },
   "source": [
    "# Libraries\n",
    "---\n",
    "\n",
    "The required packages must first be installed. On Linux, this requires the system to have Curl (for communicating with the Yandex Translate API) and GDAL (for generating the map vizualisations) installed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 625
    },
    "colab_type": "code",
    "id": "8uvzQkZ1pf5R",
    "outputId": "0102d1c9-c4a0-4188-cef8-abae10a9b2d0",
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "install.packages(\"readr\")\n",
    "install.packages(\"data.table\")\n",
    "install.packages(\"RYandexTranslate\")\n",
    "install.packages(\"plyr\")\n",
    "install.packages(\"dplyr\")\n",
    "install.packages(\"stringr\")\n",
    "install.packages(\"class\")\n",
    "install.packages(\"e1071\")\n",
    "install.packages(\"ggplot2\")\n",
    "install.packages(\"reshape2\")\n",
    "install.packages(\"leaflet\")\n",
    "install.packages(\"rgdal\")\n",
    "install.packages(\"stringi\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Libraries used throughout the analysis are imported here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(readr)\n",
    "library(data.table)\n",
    "library(RYandexTranslate)\n",
    "library(plyr)\n",
    "library(dplyr)\n",
    "library(stringr)\n",
    "library(class)\n",
    "library(e1071)\n",
    "library(ggplot2)\n",
    "library(reshape2)\n",
    "library(leaflet)\n",
    "library(rgdal)\n",
    "library(stringi)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "fxWlXtTDj5EE"
   },
   "source": [
    "# Data Cleaning\n",
    "---\n",
    "\n",
    "This section is used for importing the raw data from CSV files into R data tables. Appropriate data merging and cleaning is performed.\n",
    "\n",
    "Also performed is automatic translations from German to English for some of the column content that would otherwise be too laborious to do by hand."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "NDSSGT4R4aca"
   },
   "source": [
    "Rather than running data cleaning each time we run the notebook during development, which can be slow due to the Yandex Translate API, an R data image of the alreadly cleaned data provided with the Notebook file can be loaded instead by running the following block."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "DYmD8dDM4QoG"
   },
   "outputs": [],
   "source": [
    "load(\"dogs.Rdata\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "5BqJj3pYm1jr"
   },
   "source": [
    "## Import Dog Data\n",
    "\n",
    "The dog registration data for 2020, for which the source is available at [Stadt Zürich Open Data](https://data.stadt-zuerich.ch/dataset/sid_stapo_hundebestand/resource/a05e2101-7997-4bb5-bed8-c5a61cfffdcf), is imported and converted to a R data table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 510
    },
    "colab_type": "code",
    "id": "ynxU6g3GnNLJ",
    "outputId": "85b79030-11ba-41ef-83f2-2b51ad2d62db"
   },
   "outputs": [],
   "source": [
    "dogs2020 <- data.table(read_csv(\"data_sources/20200306_hundehalter.csv\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data cleaning is now performed. Due to what is likely a typographical error, a dog with a district value of 8 is removed (non-existant stadtquartier).\n",
    "\n",
    "Data we are uninterested in and rows that are incomplete are also removed, however for the 2020 dog dataset we find that 0 rows are ommited."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Dog w. district \"8\", typographical error? We remove it.\n",
    "dogs2020 <- dogs2020[(dogs2020$STADTQUARTIER!=8), ]\n",
    "\n",
    "# Removing unnececesary columns\n",
    "dogs2020[, c(\"RASSE1_MISCHLING\", \"RASSE2\", \"RASSE2_MISCHLING\"):=NULL]\n",
    "\n",
    "# If a row has a NA entry in one of the cells, remove the entire row\n",
    "dogs2020 <- na.omit(dogs2020)\n",
    "# 0 rows ommited, still leave the code in place in case we change data basis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Due to the international makeup of our team members, we translate the column names from German to English."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Rename columns\n",
    "setnames(dogs2020,\n",
    "         old = c(\"HALTER_ID\", \"ALTER\", \"GESCHLECHT\", \"STADTQUARTIER\", \"STADTKREIS\",   \"RASSE1\", \"RASSENTYP\",  \"GEBURTSJAHR_HUND\", \"GESCHLECHT_HUND\", \"HUNDEFARBE\"),\n",
    "         new = c(\"OWNER_ID\",  \"AGE\",   \"SEX\",        \"DISTRICT\",      \"DISTRICT_BIG\", \"BREED\",  \"BREED_TYPE\", \"YOB_DOG\",          \"SEX_DOG\",         \"COLOR_DOG\")\n",
    "        )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because stadtquartier is a more granular district segmentation than stadtkreis, and not all datasets can be merged by stadtkreis, we used it instead. We found that it is hard to view on a map when the dataset does not contain normal district names, so we extract the district names from the wealth dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import district names\n",
    "district_names <- data.table(read_csv(\"data_sources/wir100od1004.csv\"))\n",
    "district_names <- unique(district_names[, QuarSort, QuarLang])\n",
    "\n",
    "# Rename columns\n",
    "setnames(district_names,\n",
    "         old = c(\"QuarSort\", \"QuarLang\"),\n",
    "         new = c(\"DISTRICT\", \"DISTRICT_NAME\")\n",
    "        )\n",
    "\n",
    "dogs2020 <- merge(dogs2020, district_names , by = \"DISTRICT\", all.x = T)\n",
    "\n",
    "# Remove unused variable\n",
    "rm(district_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "qMHV7jrUm-E9"
   },
   "source": [
    "## Import Wealth Data\n",
    "\n",
    "Next the wealth data, for which the source is available at [Stadt Zürich Open Data](https://data.stadt-zuerich.ch/dataset/fd_median_einkommen_quartier_od1003), is imported and converted to a R data table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 221
    },
    "colab_type": "code",
    "id": "mx1asHDGnDuh",
    "outputId": "ba02bb01-58c5-4962-fa24-240d9a2304f1"
   },
   "outputs": [],
   "source": [
    "wealth <- data.table(read_csv(\"data_sources/wir100od1004.csv\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The latest data from this source is for 2017, so we select only this wealth data, store the average and aggregate the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We don't have data for 2020, the freshest data is for 2017\n",
    "wealth <- wealth[SteuerJahr == 2017,]\n",
    "\n",
    "# Create a new column to store the average\n",
    "wealth[, wealth50 := SteuerVermoegen_p50]\n",
    "\n",
    "# Replace the family values in this column with same divided by 2 for normalization\n",
    "wealth$wealth50[wealth$SteuerTarifSort == 1] <- wealth$wealth50[wealth$SteuerTarifSort == 1]/2\n",
    "\n",
    "# Aggregate data for family status\n",
    "# new table = old table, select mean of wealth50 (ignore NA), aggregate it by quartal\n",
    "wealth_merge <- wealth[,mean(wealth50, na.rm = T), by=QuarSort]\n",
    "\n",
    "# Leaving only quartals that have dogs in them\n",
    "wealth_merge <- wealth_merge[wealth_merge$QuarSort %in% dogs2020$DISTRICT]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We again translate the column names from German to English and then merge the data with the dogs data table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Rename columns\n",
    "setnames(wealth_merge,\n",
    "         old = c(\"V1\",           \"QuarSort\"),\n",
    "         new = c(\"WEALTH_T_CHF\", \"DISTRICT\")\n",
    "        )\n",
    "\n",
    "# Merge wealth into dogs dataset\n",
    "dogs2020 <- merge(dogs2020, wealth_merge, by = \"DISTRICT\", all.x = T)\n",
    "# ATTENTION! FYI: not all districts have wealth values! \n",
    "\n",
    "# Remove unused variables\n",
    "rm(wealth, wealth_merge)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "e9zuqYq6nW-J"
   },
   "source": [
    "## Import Income Data\n",
    "\n",
    "Next the income data, for which the source is available at [Stadt Zürich Open Data](https://data.stadt-zuerich.ch/dataset/fd_median_vermoegen_quartier_od1004), is imported and converted to a R data table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 221
    },
    "colab_type": "code",
    "id": "KGQ9qrWsnZoK",
    "outputId": "d36d841c-4300-4d53-b62c-d56ffa5acd7a"
   },
   "outputs": [],
   "source": [
    "income <- data.table(read_csv(\"data_sources/wir100od1003.csv\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, the latest data from this source is for 2017, so we select only this income data, store the average and aggregate the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We don't have data for 2020, the freshest data is for 2017\n",
    "income <- income[SteuerJahr == 2017,]\n",
    "\n",
    "# Create a new column to store the average\n",
    "income[, incomep50 := SteuerEInkommen_p50]\n",
    "\n",
    "# Replace the family values in this column with same divided by 2 for normalization\n",
    "income$incomep50[income$SteuerTarifSort == 1] <- income$incomep50[income$SteuerTarifSort == 1]/2\n",
    "\n",
    "# Aggregate data for family status\n",
    "# new table = old table, select mean of incomep50 (ignore NA), aggregate it by quartal\n",
    "income_merge <- income[,mean(incomep50, na.rm = T), by=QuarSort]\n",
    "\n",
    "# Leaving only quartals that have dogs in them\n",
    "income_merge <- income_merge[income_merge$QuarSort %in% dogs2020$DISTRICT]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We again translate the column names from German to English and then merge the data with the dogs data table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Rename columns\n",
    "setnames(income_merge,\n",
    "         old = c(\"V1\",           \"QuarSort\"),\n",
    "         new = c(\"INCOME_T_CHF\", \"DISTRICT\")\n",
    "        )\n",
    "\n",
    "# Merge income into dogs dataset\n",
    "dogs2020 <- merge(dogs2020, income_merge, by = \"DISTRICT\", all.x = T)\n",
    "# ATTENTION! FYI: not all districts have income values! \n",
    "\n",
    "# Remove unused variables\n",
    "rm(income, income_merge)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "I24YVJ_VncAY"
   },
   "source": [
    "## Import Education Data\n",
    "\n",
    "The import of education data is straightforward, requiring only a long to wide reshape of the data. The data source is available at [Stadt Zürich Open Data](https://data.stadt-zuerich.ch/dataset/bfs_bev_bildungsstand_statquartier_seit1970_od1012)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 187
    },
    "colab_type": "code",
    "id": "4NssSsYpneo0",
    "outputId": "105009b9-96f9-4f5a-b71c-18c1cd01ad35"
   },
   "outputs": [],
   "source": [
    "education <- data.table(read_csv(\"data_sources/bil101od1012 (2).csv\"))\n",
    "\n",
    "# Long to wide education reshape\n",
    "education <- dcast(education, RaumSort ~ Bildungsstand, value.var = \"AntBev\")\n",
    "\n",
    "# Rename columns\n",
    "setnames(education,\n",
    "         old = c(\"RaumSort\", \"Obligatorische Schule\",   \"Sekundarstufe II\",     \"Tertiärstufe\"),\n",
    "         new = c(\"DISTRICT\", \"BASIC_SCHOOL_PERCENTAGE\", \"GYMNASIUM_PERCENTAGE\", \"UNIVERSITY_PERCENTAGE\")\n",
    "        )\n",
    "\n",
    "# Merge income into dogs dataset\n",
    "dogs2020 <- merge(dogs2020, education, by = \"DISTRICT\", all.x = T)\n",
    "\n",
    "# Remove unused variable\n",
    "rm(education)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "aHUmAB4pnjK5"
   },
   "source": [
    "## Import Home Type Data\n",
    "\n",
    "Home type data, such as number of resident buildings in a district, is now imported. The data source is available at [Stadt Zürich Open Data](https://data.stadt-zuerich.ch/dataset/bau_best_geb_whg_bev_gebaeudeart_quartier_seit2008/resource/3850add1-264c-4993-98cd-d8a9ba87ee25). 2019 is the latest data available. Each building type is summed and a long to wide reshape of the data is performed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 374
    },
    "colab_type": "code",
    "id": "C80_N4ZgnmDm",
    "outputId": "45d43f21-1696-4bca-f310-6e27c77ed11b"
   },
   "outputs": [],
   "source": [
    "home_type <- data.table(read_csv(\"data_sources/bau_best_geb_whg_bev_gebaeudeart_quartier_seit2008.csv\")) \n",
    "\n",
    "# We don't have data for 2020, the freshest data is for 2019\n",
    "home_type <- home_type[Jahr == 2019,]\n",
    "\n",
    "# Count the sum of every type of building\n",
    "home_type <- home_type[, sum(AnzGeb), by = list(QuarSort,GbdArtPubName)]\n",
    "\n",
    "# Rename columns\n",
    "setnames(home_type,\n",
    "         old = c(\"QuarSort\", \"GbdArtPubName\", \"V1\"), \n",
    "         new = c(\"DISTRICT\", \"Hometype\",      \"Number_homes\")\n",
    "        )\n",
    "\n",
    "# Long to wide hometype reshape\n",
    "home_type <- data.table::dcast(home_type, DISTRICT ~ Hometype, value.var = \"Number_homes\")\n",
    "\n",
    "# Translate hometype columns\n",
    "setnames(home_type,\n",
    "         old = c(\"Produktions- und Lagergebäude\", \"Mehrfamilienhäuser\", \"Einfamilienhäuser\",   \"Infrastrukturgebäude\",     \"Kleingebäude\",    \"Kommerzielle Gebäude\", \"Spezielle Wohngebäude\"), \n",
    "         new = c(\"FACTORIES_AND_WAREHOUSES\",      \"APARTMENTS\",         \"SINGLE_FAMILY_HOMES\", \"INFRASTRUCTURE_BUILDINGS\", \"SMALL_BUILDINGS\", \"COMMERCIAL_BUILDINGS\", \"SPECIAL_ACCOMODATION\")\n",
    "        )\n",
    "\n",
    "# Remove unnececesary column\n",
    "home_type[, Unbekannt:=NULL]\n",
    "\n",
    "# Merge home types into dogs dataset\n",
    "dogs2020 <- merge(dogs2020, home_type, by = \"DISTRICT\", all.x = T)\n",
    "\n",
    "# Remove unused variable\n",
    "rm(home_type)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "55CKOQRFnrCr"
   },
   "source": [
    "## Import Population Data\n",
    "\n",
    "Next population data is imported. The data source is available at [Stadt Zürich Open Data](https://www.stadt-zuerich.ch/prd/de/index/statistik/themen/bevoelkerung/bevoelkerungsentwicklung/kreise-und-quartiere.html#daten). The source CSV file is a little messy, so some additional data cleaning is required. When importing, the first 8 lines are skipped for the parser to reach the CSV header. As one of the columns are unlabelled, a warning is produced."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "colab_type": "code",
    "id": "RLQ9mG4vntd5",
    "outputId": "30696c37-6fa7-47ea-d850-ef27c42ab3b4"
   },
   "outputs": [],
   "source": [
    "# Warning shown due to formatting of souce CSV header, can be ignored\n",
    "pop_per_district <- data.table(read_csv(\"data_sources/2019-Table_1.csv\", col_names = TRUE, skip = 8))\n",
    "\n",
    "# Set column names\n",
    "setnames(pop_per_district, old = \"X1\", new = \"DISTRICT_NAME\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data also contains hidden characters acting as whitespace which the standard regex matching pattern ```\\\\s``` does not always catch. For this reason we remove all non-numeric characters with a regex function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This Reg-Ex Matching Function Removes all Whitespaces and not only conventional ones (bad data design)\n",
    "pop_per_district[,2:5] <- data.table(apply(pop_per_district[,2:5], 2, function(x) gsub('[^0-9.]', '', x)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As normal, we continue by merging the population data into the dogs data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Join by district name (perfect match with no NAs)\n",
    "dogs2020 <- merge(dogs2020, pop_per_district, by = \"DISTRICT_NAME\", all.x = T)\n",
    "\n",
    "# Remove unused variable\n",
    "rm(pop_per_district)\n",
    "\n",
    "# Rename columns\n",
    "setnames(dogs2020,\n",
    "         old = c(\"Total\",            \"Schweizer/-innen\", \"Ausländer/-innen\",   \"Anteil ausländische\\nBevölkerung (%)\"),\n",
    "         new = c(\"TOTAL_POPULATION\", \"SWISS_POPULATION\", \"FOREIGN_POPULATION\", \"FOREIGN_POPULATION_PERCENTAGE\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The columns are then cast as numeric data types for use in mathematical operations later in the analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cast columns as numeric\n",
    "dogs2020$TOTAL_POPULATION <- as.numeric(dogs2020$TOTAL_POPULATION)\n",
    "dogs2020$SWISS_POPULATION   <- as.numeric(dogs2020$SWISS_POPULATION)\n",
    "dogs2020$FOREIGN_POPULATION   <- as.numeric(dogs2020$FOREIGN_POPULATION  )\n",
    "dogs2020$FOREIGN_POPULATION_PERCENTAGE   <- as.numeric(dogs2020$FOREIGN_POPULATION_PERCENTAGE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "nvMuam6BnzHX"
   },
   "source": [
    "## Translate Dataframes\n",
    "Using the Yandex Translate API, values in columns are automatically translated from German to English."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "3UqXvple8DvM"
   },
   "source": [
    "Firstly the Yandex translate package is fixed, as per the [solution found on GitHub](https://github.com/mukul13/RYandexTranslate/issues/2)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "7Wt8fMK68L8c"
   },
   "outputs": [],
   "source": [
    "translate = function (api_key, text = \"\", lang = \"\") \n",
    "{\n",
    "  url = \"https://translate.yandex.net/api/v1.5/tr.json/translate?\"\n",
    "  url = paste(url, \"key=\", api_key, sep = \"\")\n",
    "  if (text != \"\") {\n",
    "    url = paste(url, \"&text=\", text, sep = \"\")\n",
    "  }\n",
    "  if (lang != \"\") {\n",
    "    url = paste(url, \"&lang=\", lang, sep = \"\")\n",
    "  }\n",
    "  url = gsub(pattern = \" \", replacement = \"%20\", x = url)\n",
    "  d = RCurl::getURL(url, ssl.verifyhost = 0L, ssl.verifypeer = 0L, .encoding = \"UTF-8\")\n",
    "  d = jsonlite::fromJSON(d)\n",
    "  d$code = NULL # Remove unused variables\n",
    "  d\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "rsjgeXUK8Yhq"
   },
   "source": [
    "As the Yandex Translate API requires an API key, we define this here. It is connected to Andris' personal account, so please do not disclose it elsewhere."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Andris' personal Yandex Translate API key\n",
    "# PLEASE DO NOT DISCLOSE ELSEWHERE\n",
    "api_key <- \"trnsl.1.1.20200515T134653Z.f9fb709ac3e94036.783aefa609692b463a79b5827d5c0e7f2d037a8c\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we define the list of columns for which we want to translate their contents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "column_list <- c(\"BREED\", \"COLOR_DOG\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then loop through columns that we wish to translate and pass these to the Yandex Translate API. As translation can be slow, we pass only unique values from the columns, preventing a translation call for every row, even if the value was translated before. The output is saved to the dogs data table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1000
    },
    "colab_type": "code",
    "id": "NQfpAlpIn1xI",
    "outputId": "1eff4904-573a-4253-ce95-6ed13b2af689"
   },
   "outputs": [],
   "source": [
    "# Loop through columns in column_list\n",
    "for (column_names in column_list) {\n",
    "  # Get all unique values in the coresponding column\n",
    "  unique_values <- unique(dogs2020[, get(column_names)])\n",
    "  \n",
    "  # Loop through unique values to be translated\n",
    "  for (unique_num in 1:length(unique_values)) {\n",
    "    \n",
    "    # Print progress for debugging\n",
    "    print(column_names)\n",
    "    print(paste(unique_num, \" out of \", length(unique_values)))\n",
    "    \n",
    "    # data.table synthaxis code\n",
    "    dogs2020[\n",
    "      # Select all rows where value is equal to current unique value\n",
    "      # left part \"get\" is needed in order to use dynamic column_names.\n",
    "      dogs2020[,get(column_names)] == unique_values[unique_num],\n",
    "      # Replace old value with translated one.\n",
    "      # Left part is \"eval\" is needed in order to use dynamic column_names.\n",
    "      # Right part is a Yandex Translate function, providing api_key and text to be translated.\n",
    "      eval(column_names) := translate(api_key, text=stringi::stri_unescape_unicode(unique_values[unique_num]),\n",
    "      # Specify language pair and translation direction.\n",
    "      lang=\"de-en\"\n",
    "      # Extract only output text from the return given by Yandex.\n",
    "      )$text]\n",
    "  \n",
    "  }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We manually replace sex and sex_dog variable to match the English language conventions for sex abbreviations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dogs2020[SEX == \"w\", SEX := \"f\"]\n",
    "dogs2020[SEX_DOG == \"w\", SEX_DOG := \"f\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "HfomTPUf4s2X"
   },
   "source": [
    "## Save RData Image\n",
    "An R data image is saved here so that instead of running the entire data cleaning code block again in future runs, the image can be loaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "l0-a80wa43XP"
   },
   "outputs": [],
   "source": [
    "save.image(\"dogs.Rdata\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "XlVYqq67tNF1"
   },
   "source": [
    "## Backup Output\n",
    "Save a backup of the dogs data set in another variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "KdoFnYpZtVpH"
   },
   "outputs": [],
   "source": [
    "dogs2020_backup = copy(dogs2020)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "d1-vswSaoGvZ"
   },
   "source": [
    "# Investigation\n",
    "---\n",
    "This section of code explores predicting dog breed type."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "5qnDRGFdpwjR"
   },
   "source": [
    "## Model Preparation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 51
    },
    "colab_type": "code",
    "id": "VET_xDP0p0JH",
    "outputId": "10dce23d-7930-47b9-afe8-6677b2181bcc"
   },
   "outputs": [],
   "source": [
    "# color recode to make bigger chunks\n",
    "# doublecolored dogs will be recoded as singlecolored with first color taken as main.\n",
    "a <- str_split(dogs2020$COLOR_DOG, \"/\")\n",
    "new_colors <- a[[1]][1]\n",
    "# clear variable i from the enviroment\n",
    "rm(i)\n",
    "for (i in 2:length(a)) {\n",
    "  new_colors <- append(new_colors, a[[i]][1])\n",
    "}\n",
    "dogs2020$COLOR_DOG <- new_colors\n",
    "\n",
    "# refactoring text attributes to numeric, so that knn can work with it.\n",
    "dogs2020$DISTRICT_NAME <- as.numeric(as.factor(dogs2020$DISTRICT_NAME))\n",
    "dogs2020$AGE <- as.numeric(as.factor(dogs2020$AGE))\n",
    "dogs2020$SEX <- as.numeric(as.factor(dogs2020$SEX))\n",
    "dogs2020$SEX_DOG <- as.numeric(as.factor(dogs2020$SEX_DOG))\n",
    "dogs2020$COLOR_DOG <- as.numeric(as.factor(dogs2020$COLOR_DOG))\n",
    "dogs2020$YOB_DOG <- as.numeric(as.factor(dogs2020$YOB_DOG))\n",
    "\n",
    "#saving labels in order to see later what numbers represent what breedtypes\n",
    "breedtype_labels <- as.factor(dogs2020$BREED_TYPE)\n",
    "dogs2020$BREED_TYPE <- as.numeric(as.factor(dogs2020$BREED_TYPE))\n",
    "\n",
    "#saving labels in order to see later what numbers represent what breeds\n",
    "breed_labels <- as.factor(dogs2020$BREED)\n",
    "dogs2020$BREED <- as.numeric(as.factor(dogs2020$BREED))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "JlCqPfaRp_Xa"
   },
   "source": [
    "## Naive Approach"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1000
    },
    "colab_type": "code",
    "id": "ZQqHvoD5qCKc",
    "outputId": "0c234121-6626-4423-af11-999789995ed8"
   },
   "outputs": [],
   "source": [
    "# data separation, so that we have test and training group\n",
    "dat.d <- sample(1:nrow(dogs2020),size=nrow(dogs2020)*0.7,replace = FALSE) #random selection of 70% data.\n",
    "\n",
    "# now we tried differrent input settings in an attempt to find optimal one\n",
    "\n",
    "# train.dogs <- dogs2020[dat.d, c(\"DISTRICT_NAME\", \"YOB_DOG\", \"AGE\", \"SEX\", \"SEX_DOG\", \"COLOR_DOG\")] # 70% training data\n",
    "# test.dogs <- dogs2020[-dat.d,c(\"DISTRICT_NAME\", \"YOB_DOG\", \"AGE\", \"SEX\", \"SEX_DOG\", \"COLOR_DOG\")] # remaining 30% test data\n",
    "\n",
    "# train.dogs <- dogs2020[dat.d, c(\"DISTRICT_NAME\", \"YOB_DOG\", \"AGE\", \"SEX\", \"SEX_DOG\")] # 70% training data\n",
    "# test.dogs <- dogs2020[-dat.d,c(\"DISTRICT_NAME\", \"YOB_DOG\", \"AGE\", \"SEX\", \"SEX_DOG\")] # remaining 30% test data\n",
    "\n",
    "# train.dogs <- dogs2020[dat.d, c(\"COLOR_DOG\")] # 70% training data\n",
    "# test.dogs <- dogs2020[-dat.d,c(\"COLOR_DOG\")] # remaining 30% test data\n",
    "\n",
    "# this is the only setting where we are not \"cheating\" by using dog associated attributes\n",
    "train.dogs <- dogs2020[dat.d, c(\"DISTRICT_NAME\", \"AGE\", \"SEX\")] # 70% training data\n",
    "test.dogs <- dogs2020[-dat.d,c(\"DISTRICT_NAME\", \"AGE\", \"SEX\")] # remaining 30% test data\n",
    "\n",
    "train.dogs_labels <- dogs2020[dat.d,BREED]\n",
    "test.dogs_labels <-dogs2020[-dat.d,BREED]\n",
    "\n",
    "# train.dogs_labels <- dogs2020[dat.d,group]\n",
    "# test.dogs_labels <-dogs2020[-dat.d,group]\n",
    "\n",
    "\n",
    "#aprox k value\n",
    "k_value <-  sqrt(nrow(dogs2020))\n",
    "\n",
    "knn.test <- knn(train=train.dogs, test=test.dogs, cl=train.dogs_labels, k=k_value)\n",
    "\n",
    "ACC.test <- 100 * sum(test.dogs_labels == knn.test)/NROW(test.dogs_labels)\n",
    "\n",
    "table(knn.test ,test.dogs_labels)\n",
    "\n",
    "ACC.test\n",
    "\n",
    "# accuracy of direct prediction is 9,22%"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ZPZgLDmuqL94"
   },
   "source": [
    "## Lets try separating dataset and predict breed type at first, smaller chunks may lead to more accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# and try knn on breed directly now\n",
    "rm(i)\n",
    "results <- 0\n",
    "for (i in 1:length(unique(test.dogs_labels)))\n",
    "{\n",
    "  if (any(train.dogs$PREDICTED_BREED_TYPE == i) == F) {\n",
    "    next()\n",
    "  }\n",
    "  \n",
    "  train.dogs_labels <- dogs2020[dat.d,BREED]\n",
    "  test.dogs_labels <- dogs2020[-dat.d,BREED]\n",
    "  \n",
    "  knn.test <- knn(train=train.dogs[train.dogs$PREDICTED_BREED_TYPE == i], \n",
    "                  test=test.dogs[test.dogs$PREDICTED_BREED_TYPE == i], \n",
    "                  cl=train.dogs_labels[train.dogs$PREDICTED_BREED_TYPE == i], k=k_value)\n",
    "  \n",
    "  ACC.test <- 100 *\n",
    "    sum(test.dogs_labels[test.dogs$PREDICTED_BREED_TYPE == i] == knn.test)/\n",
    "    NROW(test.dogs_labels[test.dogs$PREDICTED_BREED_TYPE == i])\n",
    "  \n",
    "  table(knn.test ,test.dogs_labels[test.dogs$PREDICTED_BREED_TYPE == i])\n",
    "  \n",
    "  results[i] <- ACC.test\n",
    "  \n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# doesnt really help.\n",
    "# weighted accuracy is (2305 * 9.32754880694143/100 + 6.38297872340426/100 * 47) / 2352\n",
    "(length(test.dogs_labels[test.dogs$PREDICTED_BREED_TYPE == 1]) * results[1]/100 \n",
    "  + length(test.dogs_labels[test.dogs$PREDICTED_BREED_TYPE == 3]) * results[3]/100 ) / \n",
    "  (length(test.dogs_labels[test.dogs$PREDICTED_BREED_TYPE == 1]) + length(test.dogs_labels[test.dogs$PREDICTED_BREED_TYPE == 3]))\n",
    "# = 0.09268707, which is not substantially better than approach without breed type grouping.\n",
    "\n",
    "\n",
    "# reasonable would be to throw away every breed that comes less than 100 times.\n",
    "# cause we clearly see that because we dont have more attributes on dog owner levels \n",
    "# and more gradual data, we are becomming here just inference of distribution of dogs among the\n",
    "# attributes we use.\n",
    "# so to improve accuracy it makes sense to reduce the number of possible \"answers\"\n",
    "\n",
    "\n",
    "# in order to test it, we have to calculate how many people own which breeds\n",
    "# count how many people own every breed.\n",
    "breed_filter <- data.table(aggregate(OWNER_ID ~ BREED, data = dogs2020, FUN = function(x){NROW(x)}))\n",
    "\n",
    "#renaming\n",
    "setnames(breed_filter, old = c(\"OWNER_ID\")\n",
    "         , new = c(\"NUMBER_OWNERS\"))\n",
    "\n",
    "# sorting by number of owners\n",
    "breed_filter <- breed_filter[order(-rank(NUMBER_OWNERS),)]\n",
    "breed_filter <- breed_filter[breed_filter[, NUMBER_OWNERS > 100]]\n",
    "\n",
    "# leaving in dataset only desired breeds\n",
    "dogs2020 <- dogs2020[BREED %in% breed_filter[, BREED]]\n",
    "\n",
    "# 3685 out of 7839 rows are out. (4154 left)\n",
    "# 15 breeds out of 313 breeds left\n",
    "\n",
    "# now we rerun exact copy of code in the first try\n",
    "\n",
    "dat.d <- sample(1:nrow(dogs2020),size=nrow(dogs2020)*0.7,replace = FALSE) #random selection of 70% data.\n",
    "\n",
    "# this is the only setting where we are not \"cheating\" by using dog associated attributes\n",
    "train.dogs <- dogs2020[dat.d, c(\"DISTRICT_NAME\", \"AGE\", \"SEX\")] # 70% training data\n",
    "test.dogs <- dogs2020[-dat.d,c(\"DISTRICT_NAME\", \"AGE\", \"SEX\")] # remaining 30% test data\n",
    "\n",
    "train.dogs_labels <- dogs2020[dat.d,BREED]\n",
    "test.dogs_labels <-dogs2020[-dat.d,BREED]\n",
    "\n",
    "#aprox k value\n",
    "k_value <-  sqrt(nrow(dogs2020))\n",
    "\n",
    "knn.test <- knn(train=train.dogs, test=test.dogs, cl=train.dogs_labels, k=k_value)\n",
    "\n",
    "ACC.test <- 100 * sum(test.dogs_labels == knn.test)/NROW(test.dogs_labels)\n",
    "\n",
    "table(knn.test ,test.dogs_labels)\n",
    "\n",
    "ACC.test\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# accuracy of direct prediction is 17,64%\n",
    "# result is better, but still follows the distribution. and is not significantly improven\n",
    "\n",
    "# now in an attempt to get more accurate results we split it into chunks, so that breeds\n",
    "# that come up more often, dont get over prefferenced by knn. (same idea as with breed_type but\n",
    "# chunking is based on the number of owners of particular breed)\n",
    "\n",
    "# calculating values for grouping\n",
    "owners_sum <- sum(breed_filter$NUMBER_OWNERS)\n",
    "number_groups <- 6\n",
    "groupsize <- round(owners_sum / number_groups)\n",
    "\n",
    "# adding grouping variable\n",
    "breed_filter[,group:=0]\n",
    "\n",
    "# in any situation first value is group 1\n",
    "breed_filter$group[1] <- 1\n",
    "\n",
    "# assigning groups\n",
    "for (i in 2:nrow(breed_filter)) {\n",
    "  if (sum(breed_filter[group == breed_filter$group[i-1], NUMBER_OWNERS]) > groupsize){\n",
    "    breed_filter$group[i] <- breed_filter$group[i-1]+1\n",
    "  } else {\n",
    "    breed_filter$group[i] <- breed_filter$group[i-1]\n",
    "  }\n",
    "}\n",
    "\n",
    "# transfer grouping to original dataset\n",
    "dogs2020 <- merge(dogs2020, breed_filter[,c(\"BREED\", \"group\")], by = \"BREED\", all.x = T)\n",
    "\n",
    "# rewrite train and test datsets.\n",
    "# this is the only setting where we are not \"cheating\" by using dog associated attributes\n",
    "train.dogs <- dogs2020[dat.d, c(\"DISTRICT_NAME\", \"AGE\", \"SEX\", \"group\")] # 70% training data\n",
    "test.dogs <- dogs2020[-dat.d,c(\"DISTRICT_NAME\", \"AGE\", \"SEX\",\"group\")] # remaining 30% test data\n",
    "\n",
    "train.dogs_labels <- dogs2020[dat.d,BREED]\n",
    "test.dogs_labels <-dogs2020[-dat.d,BREED]\n",
    "\n",
    "\n",
    "\n",
    "# here we once more execute code very similar to the above part with breed_type\n",
    "\n",
    "rm(i)\n",
    "results <- 0\n",
    "for (i in 1:length(unique(breed_filter$group)))\n",
    "{\n",
    "  if (any(train.dogs$group == i) == F) {\n",
    "    next()\n",
    "  }\n",
    "  \n",
    "  train.dogs_labels <- dogs2020[dat.d,BREED]\n",
    "  test.dogs_labels <- dogs2020[-dat.d,BREED]\n",
    "  \n",
    "  knn.test <- knn(train=train.dogs[train.dogs$group == i], \n",
    "                  test=test.dogs[test.dogs$group == i], \n",
    "                  cl=train.dogs_labels[train.dogs$group == i], k=k_value)\n",
    "  \n",
    "  ACC.test <- 100 *\n",
    "    sum(test.dogs_labels[test.dogs$group == i] == knn.test)/\n",
    "    NROW(test.dogs_labels[test.dogs$group == i])\n",
    "  \n",
    "  table(knn.test ,test.dogs_labels[test.dogs$group == i])\n",
    "  \n",
    "  results[i] <- ACC.test\n",
    "  \n",
    "}\n",
    "\n",
    "# doesnt really help.\n",
    "# weighted accuracy is\n",
    "(length(test.dogs_labels[test.dogs$group == 1]) * results[1]/100 \n",
    "  + length(test.dogs_labels[test.dogs$group == 2]) * results[2]/100\n",
    "  + length(test.dogs_labels[test.dogs$group == 3]) * results[3]/100\n",
    "  + length(test.dogs_labels[test.dogs$group == 4]) * results[4]/100\n",
    "  + length(test.dogs_labels[test.dogs$group == 5]) * results[5]/100) / \n",
    "    (length(test.dogs_labels[test.dogs$group == 1]) \n",
    "  + length(test.dogs_labels[test.dogs$group == 2])\n",
    "  + length(test.dogs_labels[test.dogs$group == 3])\n",
    "  + length(test.dogs_labels[test.dogs$group == 4])\n",
    "  + length(test.dogs_labels[test.dogs$group == 5]))\n",
    "# = 0.4667201, which is substantially better than 9% we got in the beginning.\n",
    "# this is the best result we could get. However, the knn bascally still follows the distribution\n",
    "# of data. and because of that, it would be right to follow the occams razors principle and\n",
    "# use more simlpe naive bayes approach, that is made to use the data distributuion for\n",
    "# following predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "29F9YbZ-umjf"
   },
   "source": [
    "## Restore Backup\n",
    "(Delete later once investigation does not directly edit dogs2020)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "eCjIyjrSumyX"
   },
   "outputs": [],
   "source": [
    "dogs2020 = copy(dogs2020_backup)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "E3zqXtrFv-f6"
   },
   "source": [
    "# Prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "6ryLgVKdwWwm"
   },
   "source": [
    "## Investigation 1\n",
    "Here we are looking at properties at a district level and trying to predict if roperties of those districts determine percentage of dog ownership."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "NUK0HexcweRb"
   },
   "source": [
    "### Subset and Aggregate Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 418
    },
    "colab_type": "code",
    "id": "xeso5_npwixl",
    "outputId": "2279d607-7141-4770-c2d6-e642020541f5"
   },
   "outputs": [],
   "source": [
    "# First we will need to Subset the Data and Take aggregates to get District Level Statistics\n",
    "# Relevant Columns w. Duplicates Removed\n",
    "district_dog <- unique(subset(dogs2020, select=c(\"OWNER_ID\", \"DISTRICT_NAME\", \"WEALTH_T_CHF\", \"INCOME_T_CHF\",\n",
    "                                                 \"BASIC_SCHOOL_PERCENTAGE\", \"GYMNASIUM_PERCENTAGE\", \"UNIVERSITY_PERCENTAGE\",\n",
    "                                                 \"SINGLE_FAMILY_HOMES\", \"INFRASTRUCTURE_BUILDINGS\", \"SMALL_BUILDINGS\", \"COMMERCIAL_BUILDINGS\",\n",
    "                                                 \"APARTMENTS\", \"FACTORIES_AND_WAREHOUSES\", \"SPECIAL_ACCOMODATION\", \"TOTAL_POPULATION\",\n",
    "                                                 \"FOREIGN_POPULATION_PERCENTAGE\")))\n",
    "\n",
    "# Aggregation - Count total unique owners and preserve select district data\n",
    "ddc <- aggregate(OWNER_ID ~ DISTRICT_NAME + WEALTH_T_CHF + INCOME_T_CHF + BASIC_SCHOOL_PERCENTAGE\n",
    "                 + GYMNASIUM_PERCENTAGE + UNIVERSITY_PERCENTAGE + SINGLE_FAMILY_HOMES\n",
    "                 + INFRASTRUCTURE_BUILDINGS + SMALL_BUILDINGS + COMMERCIAL_BUILDINGS\n",
    "                 + APARTMENTS + FACTORIES_AND_WAREHOUSES + SPECIAL_ACCOMODATION + TOTAL_POPULATION + FOREIGN_POPULATION_PERCENTAGE, \n",
    "                 data = district_dog, FUN = function(x){NROW(x)})\n",
    "\n",
    "# Rename Column(s) and Remove Redundant Data\n",
    "colnames(ddc)[16] <- \"TOTAL_UNIQUE_OWNERS\"\n",
    "rm(district_dog)\n",
    "\n",
    "# Compute Difference\n",
    "ddc$NON_DOG_OWNERS <- ddc$TOTAL_POPULATION - ddc$TOTAL_UNIQUE_OWNERS\n",
    "ddc$PERCENT_DOG_OWNERS <- ddc$TOTAL_UNIQUE_OWNERS / ddc$TOTAL_POPULATION\n",
    "\n",
    "# Compute Total Building Count for Percentages % of Total Buildings and Residential Buildings\n",
    "ddc$TOTAL_BUILDINGS <- ddc$SINGLE_FAMILY_HOMES + ddc$INFRASTRUCTURE_BUILDINGS + ddc$SMALL_BUILDINGS\n",
    "                       + ddc$COMMERCIAL_BUILDINGS + ddc$APARTMENTS + ddc$FACTORIES_AND_WAREHOUSES + ddc$SPECIAL_ACCOMODATION\n",
    "ddc$TOTAL_RESIDENCES <- ddc$SINGLE_FAMILY_HOMES + ddc$APARTMENTS + ddc$SPECIAL_ACCOMODATION\n",
    "\n",
    "ddc$SINGLE_FAMILY_HOMES_PERCENTAGE <- ddc$SINGLE_FAMILY_HOMES / ddc$TOTAL_BUILDINGS\n",
    "ddc$SINGLE_FAMILY_HOMES_RESIDENCE_PERCENTAGE <- ddc$SINGLE_FAMILY_HOMES / ddc$TOTAL_RESIDENCES\n",
    "\n",
    "ddc$APARTMENTS_PERCENTAGE <- ddc$APARTMENTS / ddc$TOTAL_BUILDINGS\n",
    "ddc$APARTMENTS_RESIDENCE_PERCENTAGE <- ddc$APARTMENTS / ddc$TOTAL_RESIDENCES\n",
    "\n",
    "ddc$SPECIAL_ACCOMODATION_PERCENTAGE <- ddc$SPECIAL_ACCOMODATION / ddc$TOTAL_BUILDINGS\n",
    "ddc$SPECIAL_ACCOMODATION_RESIDENCE_PERCENTAGE <- ddc$SPECIAL_ACCOMODATION / ddc$TOTAL_RESIDENCES\n",
    "\n",
    "# Summary Statistics\n",
    "summary(ddc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "jry04_10wi-Z"
   },
   "source": [
    "### Linear Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 163
    },
    "colab_type": "code",
    "id": "2bBDkl_-wkhQ",
    "outputId": "8609a635-4952-44a4-86da-ab2bb85af55c"
   },
   "outputs": [],
   "source": [
    "# Compute Scatter Plots for Preliminary Investigation of Independent Variables\n",
    "scatter.smooth(x=ddc$TOTAL_POPULATION, y=ddc$PERCENT_DOG_OWNERS, main=\"Dog Ownership ~ Population\")\n",
    "scatter.smooth(x=ddc$FOREIGN_POPULATION_PERCENTAGE, y=ddc$PERCENT_DOG_OWNERS, main=\"Dog Ownership ~ Foreign Population %\")\n",
    "\n",
    "scatter.smooth(x=ddc$WEALTH_T_CHF, y=ddc$PERCENT_DOG_OWNERS, main=\"Dog Ownership ~ Wealth\")\n",
    "scatter.smooth(x=ddc$INCOME_T_CHF, y=ddc$PERCENT_DOG_OWNERS, main=\"Dog Ownership ~ Income\")\n",
    "\n",
    "scatter.smooth(x=ddc$SINGLE_FAMILY_HOMES_PERCENTAGE, y=ddc$PERCENT_DOG_OWNERS, main=\"Dog Ownership ~ Single Fam. Home %\")\n",
    "scatter.smooth(x=ddc$SINGLE_FAMILY_HOMES_RESIDENCE_PERCENTAGE, y=ddc$PERCENT_DOG_OWNERS, main=\"Dog Ownership ~ Single Fam. Home Res. %\")\n",
    "\n",
    "scatter.smooth(x=ddc$APARTMENTS_PERCENTAGE, y=ddc$PERCENT_DOG_OWNERS, main=\"Dog Ownership ~ Apartments %\")\n",
    "scatter.smooth(x=ddc$APARTMENTS_RESIDENCE_PERCENTAGE, y=ddc$PERCENT_DOG_OWNERS, main=\"Dog Ownership ~ Apartments Res. %\")\n",
    "\n",
    "scatter.smooth(x=ddc$SPECIAL_ACCOMODATION_PERCENTAGE, y=ddc$PERCENT_DOG_OWNERS, main=\"Dog Ownership ~ Spec. Acc. %\")\n",
    "scatter.smooth(x=ddc$SPECIAL_ACCOMODATION_RESIDENCE_PERCENTAGE, y=ddc$PERCENT_DOG_OWNERS, main=\"Dog Ownership ~ Spec. Acc. Res. %\")\n",
    "\n",
    "# Based upon the plots we decide to investigate the quality of wealth and income on Dog Ownership %\n",
    "cor(ddc$PERCENT_DOG_OWNERS, ddc$WEALTH_T_CHF)\n",
    "cor(ddc$PERCENT_DOG_OWNERS, ddc$INCOME_T_CHF)\n",
    "\n",
    "# Two Independent Regressions\n",
    "linearModWealth <- lm(PERCENT_DOG_OWNERS ~ WEALTH_T_CHF, data=ddc)\n",
    "linearModIncome <- lm(PERCENT_DOG_OWNERS ~ INCOME_T_CHF, data=ddc)\n",
    "linearModCombi <- lm(PERCENT_DOG_OWNERS ~ WEALTH_T_CHF + INCOME_T_CHF, data=ddc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "dMKLFyCbwtBo"
   },
   "source": [
    "#### Only income seems to have a reliable effect - Rsq: 0.4214, independent variable is statistically significant. Each additional 1000 CHF of income predicts an additional 0.0269% increase in dog ownership.\n",
    "\n",
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Here you can find a summary of the linear model of income."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 129
    },
    "colab_type": "code",
    "id": "ZDS8trR2w9i8",
    "outputId": "08775d2a-0c9d-4407-9e4d-e6f7190a795f"
   },
   "outputs": [],
   "source": [
    "summary(linearModIncome)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "eubb1WX9xCjp"
   },
   "source": [
    "<br>\n",
    "\n",
    " #  Investigation Part 3 : Naive Bayes\n",
    "In this section we start again from square one and investigate the qualities of owners that may determine their choice of dog breed. Using a Naive Bayes model we calculate frequency values for unique dog breeds against certain charateristics of the owners, specifically, age and gender. We also plot these frequency calculations on a per breed basis to better visualise the output of the model.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "VoJo7RGHxM2u"
   },
   "source": [
    "### Subset Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "QKiNNNVRxPcY"
   },
   "outputs": [],
   "source": [
    "# Subset our Data\n",
    "dog_owner_chars <- subset(dogs2020, select=c(\"BREED\", \"BREED_TYPE\", \"YOB_DOG\", \"SEX_DOG\", \"COLOR_DOG\", \n",
    "                                             \"OWNER_ID\", \"AGE\", \"SEX\", \"WEALTH_T_CHF\", \"INCOME_T_CHF\", \n",
    "                                             \"DISTRICT_NAME\"))\n",
    "\n",
    "# Remove Outlier Breeds (<20 Entries)\n",
    "dog_owner_chars <- ddply(dog_owner_chars, \"BREED\", function(d) {if(nrow(d)>19) d else NULL})\n",
    "# A helper data frame with Unique Breeds\n",
    "unique_common_breeds <- unique(subset(dog_owner_chars, select=c(\"BREED\")))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Nskij5cBxQwd"
   },
   "source": [
    "### Naive Bayes Model\n",
    "Using the naiveBayes() function from the e1071 library we create a dataframe to then loop over and create our frequency plots. We perform this twice, once for the frequency of breeds across age of owner and once for the frequency across genders of owners. \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Frequency vs Age"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 102
    },
    "colab_type": "code",
    "id": "oirXGldHxVDC",
    "outputId": "997efd97-e85e-40bb-dfa6-ce3efe213feb"
   },
   "outputs": [],
   "source": [
    "setDT(dog_owner_chars)\n",
    "\n",
    "# Needs to Be Replaced with Whole Data Set No? Not Just first 100\n",
    "dog_owner_chars <- subset(dog_owner_chars, select=c(\"BREED\", \"AGE\", \"SEX\"))\n",
    "\n",
    "# Naive Bayes Implementation\n",
    "nb <- naiveBayes(BREED ~ ., data=dog_owner_chars, laplace = 0, na.action = na.pass)\n",
    "\n",
    "# Convert nb into a data frame\n",
    "nb_df_age <- as.data.frame(nb$tables$AGE)\n",
    "\n",
    "for (i in 1:length(unique_common_breeds$BREED)) {\n",
    "  breed_name <- unique_common_breeds$BREED[i]\n",
    "  breed <- which(nb_df_age$Y == breed_name)\n",
    "  # create data frame for breed values\n",
    "  d_age <- data.frame(x = nb_df_age[breed,]$AGE, y = nb_df_age[breed,]$Freq)\n",
    "  # create plot\n",
    "  plot <- ggplot(d_age, aes(x = x, y = y, group = 1)) + geom_point() + geom_line() + \n",
    "    labs(x = \"Age group\", y = \"Frequency\", title = paste(breed_name, \"-- Frequency Across Age Groups\"))\n",
    "  print(plot)\n",
    "}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "### Frequency vs Gender"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Similar approach for sex\n",
    "nb_df_sex <- as.data.frame(nb$tables$SEX)\n",
    "\n",
    "for (i in 1:length(unique_common_breeds$BREED)) {\n",
    "  breed_name <- unique_common_breeds$BREED[i]\n",
    "  breed <- which(nb_df_sex$Y == breed_name)\n",
    "  # Create data frame for sex values\n",
    "  # geom_bar() in ggplot2 takes all of the values that one wants to plot in the bar plot and automatically calculates teh frquency. \n",
    "  # Here the frequency is already given. \n",
    "  # A workaround is to create a data frame that replicates the frequency value with respect the number given for frequency, such that the right amount of values is output\n",
    "  rep_freq_f <- cbind(\"x\" = rep(nb_df_sex[breed,]$Freq[1], nb_df_sex[breed,]$Freq[1]*100),\n",
    "                      \"y\" = rep(\"f\", nb_df_sex[breed,]$Freq[1]*100))\n",
    "  rep_freq_m <- cbind(\"x\" = rep(nb_df_sex[breed,]$Freq[2], nb_df_sex[breed,]$Freq[2]*100),\n",
    "                      \"y\" = rep(\"m\", nb_df_sex[breed,]$Freq[2]*100))\n",
    "  freq <- as.data.frame(rbind(rep_freq_f, rep_freq_m))\n",
    "  # Create plot\n",
    "  plot <- ggplot(freq, aes(x = y)) + geom_bar() + \n",
    "    labs(x = \"Sex\", y = \"Frequency\", title = paste(breed_name, \"-- Frequency Across Genders\")) \n",
    "  print(plot)\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "N--2CUsM0CDE"
   },
   "source": [
    "<br>\n",
    "<br>\n",
    "\n",
    "# Visualisation Section\n",
    "\n",
    "### Maps and Charts\n",
    "Here we build a visualisation over the districts of Zürich. We are interested in providing a mental map for how wealth is distributed across the city. A lot of the analysis presented in this notebook hinges on economic analysis so we thought this would be a fitting visualisation to accompany the rest of the material present in the notebook.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "-Cjz7IpwEpLf"
   },
   "outputs": [],
   "source": [
    "######################################\n",
    "# Owner Age - Dog Breed Relationship #\n",
    "######################################\n",
    "\n",
    "# Generate breeds table for totals\n",
    "breeds <- table(breed=dogs2020$BREED, age=dogs2020$AGE)\n",
    "breeds <- cbind(breeds, total = rowSums(breeds)) %>%\n",
    "as.data.frame()\n",
    "\n",
    "# Use pie charts to visualise\n",
    "par(mfrow = c(2,5))\n",
    "pie(breeds$`11-20`, dogs2020$BREED, main=\"11-20\")\n",
    "pie(breeds$`21-30`, dogs2020$BREED, main=\"21-30\")\n",
    "pie(breeds$`31-40`, dogs2020$BREED, main=\"31-40\")\n",
    "pie(breeds$`41-50`, dogs2020$BREED, main=\"41-50\")\n",
    "pie(breeds$`51-60`, dogs2020$BREED, main=\"51-60\")\n",
    "pie(breeds$`61-70`, dogs2020$BREED, main=\"61-70\")\n",
    "pie(breeds$`71-80`, dogs2020$BREED, main=\"71-80\")\n",
    "pie(breeds$`81-90`, dogs2020$BREED, main=\"81-90\")\n",
    "pie(breeds$`91-100`, dogs2020$BREED, main=\"91-100\")\n",
    "pie(breeds$total, dogs2020$BREED, main=\"All ages\")\n",
    "\n",
    "# Delete generated table\n",
    "rm(breeds)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Here we group subdistricts together and produce metric for the average wealth per city district. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "TFBq2zS1Ewp8"
   },
   "outputs": [],
   "source": [
    "\n",
    "zh_rg <- readOGR(\"./data_sources/stzh.adm_stadtkreise_v.json\")\n",
    "\n",
    "# group sub-districts together\n",
    "list_districts <- list(\n",
    "  District_1 <- c(\n",
    "    \"Rathaus\",\n",
    "    \"Hochschulen\",\n",
    "    \"Lindenhof\",\n",
    "    \"City\"\n",
    "  ),\n",
    "  District_2\t<- c(\n",
    "    \"Wollishofen\",\n",
    "    \"Leimbach\",\n",
    "    \"Enge\"\n",
    "  ),\n",
    "  District_3 <- c(\n",
    "    \"Alt-Wiedikon\",\n",
    "    \"Friesenberg\",\n",
    "    \"Sihlfeld\"\n",
    "  ),\n",
    "  District_4 <- c(\n",
    "    \"Werd\",\n",
    "    \"Langstrasse\",\n",
    "    \"Hard\"\n",
    "  ),\n",
    "  District_5 <- c(\n",
    "    \"Gewerbeschule\",\n",
    "    \"Escher Wyss\"\n",
    "  ),\n",
    "  District_6\t<- c(\n",
    "    \"Unterstrass\",\n",
    "    \"Oberstrass\",\n",
    "    \"Unterstrass\"\n",
    "  ),\n",
    "  District_7 <- c(\t\t\n",
    "    \"Fluntern\",\n",
    "    \"Hottingen\",\n",
    "    \"Hirslanden\",\n",
    "    \"Witikon\"\n",
    "  ),\n",
    "  District_8 <- c(\n",
    "    \"Seefeld\",\n",
    "    \"Mühlebach\",\n",
    "    \"Weinegg\"\n",
    "  ),\n",
    "  District_9 <- c(\t\n",
    "    \"Albisrieden\",\n",
    "    \"Altstetten\"\n",
    "  ),\n",
    "  District_10 <- c(\t\n",
    "    \"Höngg\",\n",
    "    \"Wipkingen\"\n",
    "  ),\n",
    "  District_11 <- c(\t\n",
    "    \"Affoltern\",\n",
    "    \"Oerlikon\",\n",
    "    \"Seebach\"\n",
    "  ),\n",
    "  District_12 <- c(\n",
    "    \"Saatlen\",\n",
    "    \"Schwamendingen-Mitte\",\n",
    "    \"Hirzenbach\"\n",
    "  )\n",
    ")\n",
    "avg_wealth <- unlist(lapply(seq_len(length(list_districts)), function (z) {\n",
    "  mean(unlist(lapply(list_districts[[z]], function (y) {\n",
    "    mean(unlist(lapply(y, function (x) {\n",
    "      dogs2020[which(x == dogs2020$DISTRICT_NAME),]$WEALTH_T_CHF\n",
    "    })))\n",
    "  })))\n",
    "}))\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### This block handles the colouring and generation of a leaflet map that will show a district wise wealth map of Zürich."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "qM9dsO-aE8e0"
   },
   "outputs": [],
   "source": [
    "bins <- c(0, 25, 30, 50, 60, 75, 100, 120, 150, 175)\n",
    "pal <- colorBin(\"Greens\", domain = avg_wealth, bins = bins)\n",
    "\n",
    "map <- leaflet(zh_rg) %>%\n",
    "  addPolygons(fillColor = ~pal(unlist(avg_wealth)), weight = 2, fillOpacity = 0.9, \n",
    "              opacity = 1) %>%\n",
    "  addTiles() %>% \n",
    "  addLegend(colors = pal(unlist(avg_wealth)), labels = zh_rg$kname, title = \"Zurich Districts\", opacity = 1)\n"
   ]
  },
  {
<<<<<<< HEAD
   "cell_type": "markdown",
   "metadata": {},
=======
>>>>>>> 58ee12e1a8ff9945f03cd3a4d6d9dc741b56743b
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "## Leaflet map\n",
    "This map shows a district wise breakdown of wealth in the City of Zürich. The darker colours represent lower wealth where as the lighter shades represent wealthier districts. In other words, Kreis 12 is the poorest district, whereas Districts 1 and 2 are the wealthiest.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "map"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "Big Data Analytics.ipynb",
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}