transform.qmd

---
title: "Thesis - Tranform Variables"
toc: true
number-sections: true
code-fold: true
warning: false
output: false
error: true
format: gfm
editor_options: 
  markdown: 
    wrap: sentence
---

# Load libraries

```{r}
library(tidyverse)
```

# Select and name all relevant variables

```{r}
df <- raw_df %>% 
  select(
    name = 1,
    muni_id,
    district = 3,
    type = 4,
    distance_ta = 5,
    pop = 13,
    jew_pct = 14,
    arab_pct = 16,
    muslim_pct = 17,
    christ_pct = 18,
    druze_pct = 19,
    age_0_4_pct = 22,
    age_5_9_pct = 23,
    age_10_14_pct = 24,
    age_15_19_pct = 25,
    age_20_29_pct = 26,
    age_30_44_pct = 27,
    age_45_59_pct = 28,
    age_60_64_pct = 29,
    age_65_plus_pct = 30,
    age_0_17_pct = 31,
    age_75_plus_pct = 32,
    dep_ratio = 33,
    immig_1990_pct = 41,
    unemp_allowance_pct = 87,
    income_wage = 116,
    wage_num = 122,
    min_wage_pct = 123,
    freelance_num = 124,
    income_freelance = 125,
    bagrut_pct = 154,
    bagrut_uni_pct = 155,
    high_educ_35_55_pct = 156,
    ses_c_2015 = 231,
    ses_i_2015 = 232,
    ses_r_2015 = 233,
    peri_c_2015 = 237,
    peri_i_2015 = 238,
    peri_r_2015 = 239,
    last_col(19):last_col(0)
  )
```

# Transform types and problematic values of variables

## Change relevant charachter variables to numeric

```{r}
df <- df %>% 
  mutate(
    across(
      c(
        distance_ta,
        jew_pct,
        arab_pct,
        muslim_pct,
        christ_pct,
        druze_pct,
        immig_1990_pct,
        starts_with("ses"),
        starts_with("peri")
      ),
      as.numeric
    )
  )

# Check which variables have NA's after the coercion to numeric variable
check <- df %>% 
  summarise(
    across(
      everything(),
      ~sum(is.na(.))
    )
  )
```

## Replace Na's with 0's

except for distance from Tel Aviv, which is NA for regional councils

```{r}
df <- df %>% 
  mutate(
    across(
      c(
        jew_pct,
        arab_pct,
        muslim_pct,
        christ_pct,
        druze_pct,
        immig_1990_pct,
        likud_pct,
        coal_pct,
        pot_votes
      ),
      replace_na, 0
    )
  )
```

## Create factor variables

```{r}
df <- df %>% 
  mutate(
    across(
      c(
        district,
        type
      ),
      as_factor
    )
  )
```

## Round relevant numeric variables

```{r}
df <- df %>% 
  mutate(
    across(
      c(
        distance_ta,
        dep_ratio,
        income_wage,
        income_freelance,
        ends_with("pct")
      ),
      round, 1
    )
  )
```

# Explore distribution of variables

## District

```{r}
df %>% 
  ggplot(aes(district)) +
  geom_bar() +
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white")
```

## type

```{r}
df %>% 
  ggplot(aes(type)) +
  geom_bar() +
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white")
```

## Population (pop)

```{r}
df %>% 
  ggplot(aes(pop)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density()
```

The population variable is right-skewed.
Let's exaimne if a log10 transformation would help.

```{r}
df %>% 
  ggplot(aes(log10(pop))) +
  geom_histogram(aes(y = ..density..)) +
  geom_density()
```

As seen, the population variable is right-skewed.
after a log transformation, the variable is more normally distributed.
## Sector (Jewish/Arab/Mixed)

```{r}
df %>% 
  ggplot(aes(jew_pct)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density()
```

As seen, the distribution of the jewish population has 3 modes, and therefore we will create a new categorical variable for sector.
Currently the variable is defined as "Arab" if there are more than 50% Arabs, "Mixed" if there are between 10%-50% Arabs, and "Jewish" otherwise.
I consider making a "special sector" variable of some sort, to also include Haredi and Druze.
Should it be another variable or added to the sector variable?

```{r}
df <- df %>% 
  mutate(
    sector = as_factor(case_when(
      arab_pct > 50 ~ "arab",
      arab_pct > 10 ~ "mixed",
      TRUE ~ "jewish"
    ))
  )

df %>% 
  ggplot(aes(sector)) +
  geom_bar() +
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white")
```

## Age distribution

```{r}
df %>% 
  ggplot(aes(age_65_plus_pct)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density()

df %>% 
  ggplot(aes(age_0_17_pct)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density()
```

It is not clear which one is better, but they are both somewhat normally distributed with some right-skewness.
Let's check if sector has an impact on the distribution.

```{r}
df %>% 
  pivot_longer(c(age_0_17_pct, age_65_plus_pct), names_to = "var", values_to = "value") %>% 
  ggplot(aes(value)) + 
  geom_histogram(aes(y = ..density..)) +
  geom_density() +
  facet_grid(sector ~ var)
```

It seems like the distribution is not too much effected by sector.
No need to give it a special treatment.

## Dependance ratio

```{r}
df %>% 
  ggplot(aes(dep_ratio)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density()
```

The dependence ratio variable is right-skewed and it is possible to perform a log10 transformation, but as of now it does not seem to add more useful information to the age distribution variables, so no need for transformation.

## Percent of immigrants that came to Israel after 1990 from the population

```{r}
df %>% 
  ggplot(aes(immig_1990_pct)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density()
```

It seems that the variable is right-skewed, but it is probably due to the fact that Arab municipalities don't have lots of immigrants.
Let's check this hypothesis.

```{r}
df %>% 
  ggplot(aes(immig_1990_pct)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density() +
  facet_wrap(~ sector, scales = "free_y")
```

First, it seems that indeed Arab municipalities don't have immigrants.
This makes it a good candidate for interaction.
Second, the variable is still right-skewed, even only for Jewish municipalities.
Let's try log10 transformation.

```{r}
df %>% 
  ggplot(aes(log10(immig_1990_pct))) +
  geom_histogram(aes(y = ..density..)) +
  geom_density() +
  facet_wrap(~ sector, scales = "free_y")
```

Since a lot of Arab municipalities don't have immigrants at all, the variable is not a good candidate for using it in the model in itself.
We will use a log10 transformation but only with interaction with sector variable.

## Unemployment Allowance Percent

```{r}
df %>% 
  ggplot(aes(unemp_allowance_pct)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density()

df %>% 
  arrange(desc(unemp_allowance_pct)) %>% 
  select(name, unemp_allowance_pct)
```

This does not seem like a good variable to determine true unemployment, for none of the top municipalities have Arab or Haredi municipalities.

## Income

```{r}
df %>% 
  pivot_longer(c(income_wage, income_freelance, min_wage_pct), names_to = "var", values_to = "value") %>% 
  ggplot(aes(value)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density() +
  facet_wrap(~ var, scales = "free")
```

It seems that the percent earning under minimum wage is pretty normally distributed, but average income of wage workers is right-skewed.
The freelance income is less relevant for the much lower number of freelancers.
Let's try log10 transformation on wage income.

```{r}
df %>% 
  ggplot(aes(log10(income_wage))) +
  geom_histogram(aes(y = ..density..)) +
  geom_density()
```

This is much better, a log10 transformation is needed for wage income.

## Education

Let's begin with checking correlation between the three education variables

```{r}
df %>% 
  select(bagrut_pct, bagrut_uni_pct, high_educ_35_55_pct) %>% 
  cor()
```

It seems that the Bagrut variables are highly correlated, but less so with the higher education variable.
Let's check it visually.

```{r}
df %>% 
  ggplot(aes(bagrut_pct, bagrut_uni_pct)) +
  geom_point()

df %>% 
  ggplot(aes(bagrut_pct, high_educ_35_55_pct)) +
  geom_point()

df %>% 
  ggplot(aes(bagrut_uni_pct, high_educ_35_55_pct)) +
  geom_point()
```

The charts reaffirm this understanding.
LEt's examine the distribution of these variables.

```{r}
df %>% 
  pivot_longer(c(bagrut_pct, bagrut_uni_pct, high_educ_35_55_pct), names_to = "var", values_to = "value") %>% 
  ggplot(aes(value)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density() +
  facet_wrap(~ var, scales = "free")
```

The bagrut variables are left-skewed while the higher education variable is more bimodal and somewhat evenly distributed.
We will opt to use the higher education variable.

## CBS clusters and indexes

```{r}
df %>% 
  pivot_longer(c(starts_with("ses"), starts_with("peri")), names_to = "var", values_to = "value") %>% 
  ggplot(aes(value)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density() + 
  facet_wrap(~ var, scales = "free")
```

Obviously, the rank variable is evenly distributed by definition.
We will use the index variable as it is normally distributed and carries the most information with it.

Let's treat some missing values in the periphery data of 6 municipalities. They are municipalities that did not exist in 2004. We will replace their missing values with the 2015 periphery data.
```{r}
na_peri_muni <- c(
  "0483", # Bueina
  "0628", # Jat
  "0490", # Dir ElAssad
  "0534", # Osfia
  "69", # AlQasum
  "68" # Neve Midbar
)

replace_match_na <- function(x, id, na_vec, replace_with) {
  case_match(
    {{ id }},
    na_vec ~ {{ replace_with }},
    .default = x
  )
}

df <- df %>% 
  mutate(
    peri_2004_i = replace_match_na(peri_2004_i, muni_id, na_peri_muni, peri_i_2015),
    peri_2004_r = replace_match_na(peri_2004_r, muni_id, na_peri_muni, peri_r_2015),
    peri_2004_c = replace_match_na(peri_2004_c, muni_id, na_peri_muni, peri_c_2015)
  )
```


## Voting

```{r}
df %>% 
  pivot_longer(c(likud_pct, coal_pct), names_to = "var", values_to = "value") %>% 
  ggplot(aes(value)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density() +
  facet_wrap(~ var, scales = "free")
```

It seems that both voting variables are left skewed.
this might be because of interaction with sector variable.
Let's visualize this.

```{r}
df %>% 
  pivot_longer(c(likud_pct, coal_pct), names_to = "var", values_to = "value") %>% 
  ggplot(aes(value)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density() +
  facet_grid(var ~ sector, scales = "free")
```

It seems that indeed the voting variables are almost always very low in the Arab sector.
Also, this suggests that the "mixed" value of the sector variable should be omitted if we check interaction in the model to prevent overfitting.
This is because There are only 13 mixed municipalities.
Again, it is possible to create a "special sector" variable alongside Druze and Haredi.

## Budget

The budget variables would probably be right-skewed, as they are very much affected by the size of the population.

```{r}
df %>% 
  pivot_longer(starts_with("budget"), names_to = "var", values_to = "value") %>% 
  ggplot(aes(value)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density() +
  facet_wrap(~ var)
```

This indeed seems like right-skewness.
Let's try to divide the budget by population.

```{r}
df %>% 
  pivot_longer(starts_with("budget"), names_to = "var", values_to = "value") %>% 
  ggplot(aes(value / pop)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density() +
  facet_wrap(~ var)
```

After normalizing by population size the distribution is even more right-skewed.
Let's try a log10 transformation.

```{r}
df %>% 
  pivot_longer(starts_with("budget"), names_to = "var", values_to = "value") %>% 
  ggplot(aes(log10(1 + value / pop))) +
  geom_histogram(aes(y = ..density..)) +
  geom_density() +
  facet_wrap(~ var)
```

This is much better and the seemingly best transformation to use.
important to note that 1 was added to the calculation to prevent log of 0.
Nevertheless, we will still keep the original budget_approved variable for comparison.

# Transform variables

```{r}
df <- df %>% 
  mutate(
    pop_log10 = log10(pop),
    immig_1990_pct_log10 = case_when(
      immig_1990_pct == 0 ~ 0,
      TRUE ~ log10(immig_1990_pct)
    ),
    income_wage_log10 = log10(income_wage),
    budget_approved_capita_log10 = log10(1 + budget_approved / pop),
    sector = as_factor(case_when(
      arab_pct > 50 ~ "arab",
      TRUE ~ "jewish"
    )),
    pop_2015 = pop_2015 * 1000
  )
```