Machine Learning for Economists

Economists often require two parts to a model in order to make effective policies: prediction and causations
- Many computer science algorithms can handle prediction, and some machine learning ones can handle causation
- Double machine learning handles both but is limited
Trade offs
- Prediction Accuracy vs. Intuitive Interpretation
  - Linear models are simple but easy to interpret; more complex models are harder to interpret but more accurate
- Good fit vs. Over/Under Fit
  - Overfitting is bad because it can’t predict new data accurately; underfitting is bad because it doesn’t fit training data accurately
- Simplicty (Parsimony) vs. Black-box
  - Fewer variables can be better as opposed to using all variables
- Bias vs. Variance
  - More complicated models will have less bias but lead to more variance (i.e. be more susceptible to noisy data)

Feature Reduction Techniques

Given p parameters and 1 <= k <= p, fit all pCk models that contain exactly k predictors
Pick the best model amongst these and call it $M_k$
Pick the best models from the set of best models $M_1, M_2, …, M_k$
Inefficient; requires lots of computation, especially for larger p
- Since the search space is so large, the best model might be the best due to chance

Regularization: A strategy for building models where the cost function increases in value when more features are used
- Strategies take the form of $\frac{1}{n} dev(\beta) + \lambda \Sigma _j c(\beta _j)$
- $dev$ is the squared difference between predictions and actual data
- c is the cost function that adds a penalty for nonzero coefficients
- OLS uses this strategy but does not include the cost function c
Types of cost functions
- Ridge: Places big penalties on large coefficients and smaller penalties on smaller ones ($\beta ^2$)
- Lasso: Penalty is equal to absolute value of coefficient ($\vert \beta \vert$)
- Elastic net: A linear combination of ridge and lasso
- Log penalty: Penalizes changes from 0 to small values heavily but not as much for changes in big coefficients ($log(1 + \vert \beta \vert)$)

Types of dynamic models
- $y_t = f(x_t, x_{t-1}, x_{t-3}, …)$
- $y_t = f(y_{t-1}, x_t)$
- $y_t = f(x_t) + e_t, e_t = f(e_{t-1})$