Excercise 3

# Excercise 3
### Martin Modrák
### 2021/08/24 (updated: 2022-01-26)

---

# A detective story

We have a trustworthy generator of data for linear regression 
and 3 suspicious implementations in Stan. Can you tell which is telling the truth?
Maybe all are wrong? Maybe all are correct?

---

class:small-code

# Generator

It is often useful to implement the generator code in the most simple and stupid
way to reduce the probability of errors. Keep all the fancy optimizations for your
Stan code!

We promise the code below is correct. `N` is the number of data points, `K` the number of predictors.

```r
single_sim_regression <- function(N, K) {
  x <- matrix(rnorm(n = N * K, mean = 0, sd = 1), 
              nrow = N, ncol = K)
  alpha <- rnorm(n = 1, mean = 0, sd = 1)
  beta <- rnorm(n = K, mean = 0, sd = 1)
  sigma <- abs(rnorm(n = 1, mean = 0, sd = 2))
  
  y <- array(NA_real_, N)
  for(n in 1:N) {
    mu <- alpha
    for(k in 1:K) {
      mu <- mu + x[n,k] * beta[k]
    }
    y[n] <- rnorm(n = 1, mean = mu, sd = sigma) 
  }
  
  list(
    variables = list(
      alpha = alpha,
      beta = beta,
      sigma = sigma),
    generated = list(
      N = N, K = K,
      x = x, y = y
    )
  )
  
}
```

---

# Task 1/2

Build the generator and run simulations.
Once again, 10 simulations should serve us well.

```r
generator_regression <- SBC_generator_function(
               single_sim_regression, N = 100, K = 2)
datasets_regression <- generate_datasets(
               generator_regression, 10)
```

---

# Suspect 1

Already saved in `regression_1.stan` if you ran the setup script, otherwise store it in this file yourself :-)

```stan
data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  vector[N] y;      // outcome vector
}
parameters {
  real alpha;           // intercept
  vector[K] beta;       // coefficients for predictors
  real<lower=0> sigma;  // error scale
}
model {
  vector[N] mu = rep_vector(alpha, N);
  for(i in 1:K) {
    for(j in 1:N) {
      mu[j] += beta[i] * x[j, i];
    }
  }
  y ~ normal(mu, sigma);  // likelihood
  alpha ~ normal(0, 5);
  beta ~ normal(0, 1);
  sigma ~ normal(0, 2);
}
```

---

# Suspect 2

Store in a file called  `regression_2.stan`.

```stan
data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  vector[N] y;      // outcome vector
}
parameters {
  real alpha;           // intercept
  vector[K] beta;       // coefficients for predictors
  real<lower=0> sigma;  // error scale
}
model {
  vector[N] mu;
  for(i in 1:N) {
    mu[i] = alpha;
    for(j in 1:K) {
      mu[i] += beta[j] * x[j, j];
    }
  }
  y ~ normal(mu, sigma);  // likelihood
  alpha ~ normal(0, 5);
  beta ~ normal(0, 1);
  sigma ~ normal(0, 2);
}
```

---

# Suspect 3

Store in a file called  `regression_3.stan`.

```stan
data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  vector[N] y;      // outcome vector
}
parameters {
  real alpha;           // intercept
  vector[K] beta;       // coefficients for predictors
  real<lower=0> sigma;  // error scale
}
model {
  y ~ normal(transpose(beta) * transpose(x) + alpha, sigma);  // likelihood
  alpha ~ normal(0, 5);
  beta ~ normal(0, 1);
  sigma ~ normal(0, 2);
}
```

---

# Question 1/2

Make a guess - which of the suspects you believe implement linear regression correctly? (could be all three)
Which of the suspects will not give correct results? (could be all three).

Don't spend too much time investigating, if you don't see it - SBC is going to 
help us!

---

# Task 2/2 - Run SBC

Backend with `cmdstanr`
```r
  model_regression_1 <- cmdstan_model("regression_1.stan")
  backend_regression_1 <- SBC_backend_cmdstan_sample(
      model_regression_1, 
      iter_warmup = 400, iter_sampling = 500)

```

Backend with `rstan`

```r
  model_regression_1 <- stan_model("regression_1.stan")
  backend_regression_1 <- SBC_backend_rstan_sample(
      model_regression_1, 
       iter = 900, warmup = 400)
```

SBC + plots

```r
results_regression_1 <- 
  compute_SBC(datasets_regression, backend_regression_1)
plot_rank_hist(results_regression_1)
plot_ecdf(results_regression_1)
plot_ecdf_diff(results_regression_1)

```

(Repeat for all 3 suspects)

---

# Question 2/2

Which of the suspects were caught misleading us using SBC? 
Which variables are affected the most and in which direction?

---

# Bonus Task 1

Figure out what went wrong in the problematic model(s). Can you fix the problem(s)?

There is a hint on the next slide.

---

# Bonus Task 1 - Hint

Indexing and for loops are a frequent source of typos. Some letters just look quite similar.

---

# Bonus Task 2

Can you explain why the problematic model(s) produced the SBC plots they did?

# Bonus Task 3

Look at `results_regression_X$stats` for the problematic models. Would it be possible to use `z_score` to diagnose the problem faster?