In order to calculate the OLS estimator, X’X must be invertible (full rank)
In other words, the rank of X has to be less than the minimum of n and k
Gauss-Markov Theorem
Gauss-Markov Theorem: Under the assumptions of the classical regression model, the OLS estimator is the best (minimum variance) linear unbiased estimator (AKA BLUE)
Proof:Suppose β^=CyIf this estimator is unbiased, then E(β^∣X)=βE(Cy∣X)=E(CXβ+Cϵ∣X)=CXβ+CE(ϵ∣X)=CXβ=βThus, CX=ILet D=C−(X′X)−1X′DX=CX−(X′X)−1X′X=I−I=0 (Optimal when DX = 0)Var(β^∣X)=Var(Cy∣X)=Var[(D+(X′X)−1X′)(Xβ+ϵ)∣X]=Var[Dϵ+(X′X)−1X′ϵ∣X] (Since beta is a constant and DX = 0)=σ2DD′+σ2(X′X)−1=σ2DD′+Var(β^OLS∣X)Becase σ2DD′ is symmetric positive semi-definite, this variance is minimized when D=0Thus, C=(X′X)−1X′, so β^=(X′X)−1X′y
Omitted Variables Bias
True model: y=X1β1+X2β2+ϵEstimated model: y=X1β1+ϵβ^=(X1′X1)−1X1′y=(X1′X1)−1X1′(X1β1+X2β2+ϵ)=β1+(X1′X1)−1X1′X2β2+(X1′X1)−1X1′ϵE(β^∣X)=β1+(X1′X1)−1X1′X2β, has bias term
Bias is 0 if $\beta _2 = 0$ or if $X_1$ and $X_2$ are orthogonal (independent)
The bias depends on the correlation between the two features; negative bias leads to negative correlation, and vice versa
Running a model with more variables then needed will not affect the unbiasedness of the estimator, as the true value of the extra coefficients will go to 0
This also increases the variance
Assumptions for Estimation
No micronumerosity (n must be greater than or equal to k)
Can solve this by adding more observations or removing parameters
No perfect multicollinearity (Columns must be linearly independent and span k dimensions)
Fixes: Drop collinear columns, get more data, use ridge regression
Statistical Properties/Assumptions
$ y = X \beta + \epsilon $ is the true model (data generating process)
There exist many possible models given a dataset; hard to know if one is actually true
$ E(\epsilon \vert X) = 0 $
If X is fixed, then this is a harmless assumption as long as X has a constant term
If X is random, then we are ruling out endogeneity
$Var(\epsilon \vert X) = \sigma ^2 I_n$
Homoskedasticity; constant errors
No correlation between $\epsilon _i$ and $\epsilon _j$
$\sigma^2$ can affect the variance; depends on data/observations
$(X’X)^{-1}$ affects the variance depending on range; larger range of X means less variance, smaller range means more variance
Var(β^OLS∣X)=Var((X′X)−1X′y∣X)=Var((X′X)−1X′(Xβ+ϵ)∣X)=Var((X′X)−1X′Xβ+(X′X)−1X′ϵ∣X)=Var(β+(X′X)−1X′ϵ∣X)=Var((X′X)−1X′ϵ∣X) (Variance of a constant, beta, is 0)=(X′X)−1X′Var(ϵ∣X)X(X′X)−1=(X′X)−1X′σ2X(X′X)−1=σ2(X′X)−1X′X(X′X)−1=σ2(X′X)−1
DERIVATIONS WILL BE ON THE EXAM !
$E(e’e \vert X) = \sigma ^2 [n-k]$
This means that $s^2 = \frac{e’e}{n-k}$ is an unbiased estimator of $\sigma ^2$
e=y−Xβ^OLSE(e′e∣X)=E((y−Xβ^OLS)′(y−Xβ^OLS)∣X)=E((y−X(X′X)−1X′y)′(y−X(X′X)−1X′y)∣X)=E(y′(I−X(X′X)−1X′)′(I−X(X′X)−1X′)y∣X)An aside; Is (I−X(X′X)−1X′) symmetrical?(I−X(X′X)−1X′)′=I−X′(X′X−1)X; yesIs (I−X(X′X)−1X′) idempotent?I−X(X′X)−1X′−X(X′X)−1X′+X(X′X)−1X′X(X′X)−1X′=I−X(X′X)−1X′−X(X′X)−1X′+X(X′X)−1X′=I−X(X′X)−1X′; yesLet M=(I−X(X′X)−1X′)E(y′M′My∣X)Note: MX=0, so My=MXβ+Mϵ=MϵE(y′M′My∣X)=E(ϵ′M′Mϵ∣X)=E(ϵ′Mϵ∣X)Aside:Var(ϵ∣X)=E[(ϵ−E(ϵ))(ϵ−E(ϵ))′∣X]=E(ϵϵ′∣X)Trace operation:tr(A)=i=1∑naiitr(ABC)=tr(CAB)=tr(BCA) (Can put tail matrix at head)ϵ′Mϵ is a scaler, so:E(ϵ′Mϵ∣X)=E(tr(ϵ′Mϵ)∣X)=E(tr(ϵϵ′M)∣X)=tr(E(ϵϵ′M)∣X)=tr(σ2M)=σ2tr(I−X(X′X)−1X′)=σ2[tr(In)−tr(X(X′X)−1X′)]=σ2[n−tr(X′X(X′X)−1)]=σ2[n−tr(Ik)]=σ2[n−k]
Hypothesis Testing
Single Parameter Testing
t distribution is calculated as follows: $t_{df}(0,1) \sim \frac{N(0, 1)}{\sqrt{\chi ^2_{df} / df}}$
σ2s2(n−k)/n−k(β^j−βj)/σ2⋅vjj∼tdf(0,1)s2⋅vjjβ^j−βj∼tdf(0,1)Where v=(X′X)−1 and s2 is the sample variance
Multi Parameter Testing
χr2/rχp2/p∼Fp,re′e/(n−k)(eR′eR−e′e)/q∼χn−k2χq2∼Fq,n−kwhere eR is the residuals from the restricted model and q is the number of restrictions
Breaking Assumptions
Heteroskedasticity
Instead of $Var(\epsilon \vert X) = \sigma^2 I_n$, we know have $Var(\epsilon \vert X) = \Omega _{\text{n x n}}$, where $\Omega$ is symmetric positive definite
$E(\hat{\beta}_{OLS}) = \beta$ since the expected value of the errors don’t change
Var(β^OLS∣X)=Var((X′X)−1X′y∣X)=Var((X′X)−1X′(Xβ+ϵ)∣X)=Var((X′X)−1X′Xβ+(X′X)−1X′ϵ∣X)=Var(β+(X′X)−1X′ϵ∣X)=Var((X′X)−1X′ϵ∣X) (Variance of a constant, beta, is 0)=(X′X)−1X′Var(ϵ∣X)X(X′X)−1=(X′X)−1X′ΩX(X′X)−1
$\hat{\beta}_{OLS}$ is inefficient under this assumption, but it is unbiased
OLS model is equivalent to $y = X \beta + C’ \eta$ where $E(\eta \vert X) = 0, Var(\eta \vert X) = I_n$
$C’C = \Omega$ where C is the Cholesky factor; C’ is lower triangular and C is upper triangular
E[y∣X,β]=E[Xβ+C′η∣X,β]=Xβ+E[C′η∣X,β]=Xβ+C′E[η∣X,β]=XβThis is the same expectation as the OLS modelVar[y∣X,β]=Var[Xβ+C′η∣X,β]=Var[C′η∣X,β]=C′Var[η∣X,β]C=C′InC=ΩThis is the same variance as the OLS model
Multiplying both sides of the new model by $(C’)^{-1}$ gives $(C’)^{-1}y = (C’)^{-1}X \beta + \eta$ which can be written as $y^* = X^* \beta + \eta$
Involves data transformation but can now have BLUE estimators with OLS process
New estimator: $\hat{\beta}_{GLS} = (X’\Omega ^{-1}X)^{-1} X’\Omega ^{-1}y$
Derivation involves the fact that $C’C = \Omega \iff \Omega ^{-1} = C^{-1}(C’)^{-1}$
Since $\Omega$ cannot be found in practice, we estimate it and use the (F)easible GLS estimator $\hat{\beta}_{FGLS} = (X’\hat{\Omega} ^{-1}X)^{-1} X’\hat{\Omega} ^{-1}y$
Note that $E[\hat{\Omega}] = \Omega$ does not imply that $E[\hat{\Omega} ^{-1}] = \Omega ^{-1}$;
However, $\hat{\Omega}$ IS consistent by Slutsky’s Theorem
FGLS is biased in small samples, but as the sample size gets larger, it becomes efficient; thus, it is asymptotically efficient
Jensen’s Inequality
Jensen’s inequality: $f(E[X]) \geq E[f(x)] \text{ if f is concave, and } f(E[X]) \leq E[f(x)] \text{ if f is convex}$
Uses the fact that $E[X] = \overline{X}$ and $E[f(X)] = \overline{f(X)}$; compares $f(\overline{X})$ and $\overline{f(X)}$
We can use this inequality to show that $\hat{\Omega} \rightarrow \Omega$ and $\hat{\Omega}^{-1} \rightarrow \Omega^{-1}$
Estimating Variance
Model: $y_i = X_i’ \beta + \epsilon _i \rightarrow y = X \beta + \epsilon$
Since the variance is a nonnegative random variable, and we assume that the variance goes up with the values of i, then we can model variance as $ln(\sigma ^2_i) = w_i’ \gamma$
Steps to estimate variance (or to test for heteroskedasticity)
Regress y on X using OLS
Construct the residuals $e = y - X \hat{\beta}_{OLS}$
Do one of two regressions to get $\gamma$
Regress $e_i^2$ on $w_i$
Regress $ln(e_i^2)$ on $w_i$
This can be done using $\hat{\gamma} = (\omega ‘ \omega)^{-1} \omega ‘ e^2$
Construct $\hat{\Omega}$ using $\hat{\sigma}_i^2$ and apply FGLS
Tests for Heteroskedasticity
Breusch-Pagan: Run an OLS, get the residuals, and run OLS on the residuals
$\omega i’ \hat{\gamma} + u_i$ can be rewritten as $\gamma_1 + \underline{\omega{i2}\gamma_2 + … + \gamma_{iq}\omega_{q}} + u_i$; if the underlined part is nonzero, then we have heteroskedasticity
Can test this via an Chi-square with degrees of freedom (q-1)
To choose what variables to include in $\omega_i$, take the Kronecker product of $x_i$ and remove any duplicate elements
Goldfeld-Quandt: Split the data into two regions and check if the variances are different in these two regions
Uses an F-test with df $(n_1 - k, n_2 - k)$
Reject if F is not inside of the reasible region $(c_L, c_R)$
White’s robust standard errors: Uses the fact that we can consistently estimate $Var(\hat{\beta}_{OLS}) = (X’X)^{-1}X’\Omega X (X’X)^{-1}$ (specifically the $X’\Omega X$ term) by substiuting squared residuals into $\Omega$
Instrumental Variables and Two-Stage Least Squares (TSLS) Estimation
Assumption broken: $E(\epsilon \vert X) \neq 0$
OLS estimator is biased and inconsistent as the number of observations goes to infinity
β^IV=(X^′X^)−1X^′y=((ZΠ^)′(ZΠ^))−1X^′y=[(Z(Z′Z)−1Z′X)′(Z(Z′Z)−1Z′X)]−1X^′yPZ=Z(Z′Z)−1Z′PZ=PZ′PZ=PZPZPZX=X^β^IV=[(Z(Z′Z)−1Z′X)′(Z(Z′Z)−1Z′X)]−1X^′y=[(PZX)′(PZX)]−1X^′y=[X′PZ′PZX]−1X^′y=[X^′X^]−1X^′y=[X^′X]−1X^′y=[X′X^]−1X^′yβ^IVn→∞plimβ^IV=[X^′X]−1X^′y=[X^′X]−1X^′(Xβ+ε)=[X^′X]−1X^′Xβ+[X^′X]−1X^′ε=β+[X^′X]−1X^′ε=n→∞plim(β+[X^′X]−1X^′ε)=β+n→∞plim([nX^′X]−1)⋅n→∞plim(nX^′ε)AsyVar(β^IV∣X)=σ^2(X^′X^)′σ^2=n−ke′e, where e=y−Xβ^IVDo NOT use X^, as it is not used in the DGPn→∞plim([nX^′X]−1) will exist and is non-singular as long as X^ and X are correlated,such that Π=0 i.e. the IV is informative or relevant.n→∞plimnX^′ε will equal 0 as long as X^ and ε are uncorrelated, which is implied whenZ is a valid instrument.Therefore, β^IV is a consistent estimator given an appropriate instrument.
Additional Notes
Restrictions:
Since Z is n by l and X is n by k, l must be greater than or equal to k
In other words, there should be at least as many instruments as endogenous covariates
$rank(\Pi_{l \times k}) = k$; rank condition
Rules out irrelevant covariates, can’t be verified until regression is run
l must be much lower than n, as $\hat{X}$ should not be close to $X$
If $\hat{X}$ is too similar to $X$, then we find that $\hat{X}$ becomes correlated with $\varepsilon$
No weak instruments; causes biases to explode
One way to test this is to check that the F-statistic ≥ 10
$\frac{(e_R’e_R - e’e) / q}{e’e/(n-k)} \sim F_{q, (n-k)}$, where we set the number of restrictions to the number of features, as $e_R’e_R = e’e$ if and only if there is no endogeneity ($E(\varepsilon \vert X) = 0$)
OLS will be better than IV if there is perfect exogeneity, as IV is more volatile, but if the RHS variables are correlated, then IV will be better
This technique is more general and works for different DGP setups; e.g. ordinal data
Example Derivation:lnf(y∣Θ)=−2nln(2π)−2nln(σ2)−2σ21(y−Xβ)′(y−Xβ)β∂lnf(y∣Θ)=−2σ21(−X′y−X′y+(X′X+X′X)β)=0→2X′y=2X′Xβ→β^OLS=(X′X)−1X′yσ2∂lnf(y∣Θ)=−2σ2n+2σ41(y−Xβ)′(y−Xβ)=0→−nσ2+(y−Xβ)′(y−Xβ)=0→σ^MLE2=n(y−Xβ)′(y−Xβ)
These results suggest that $\hat{\beta}_{OLS} = (X’X)^{-1}X’y$ and $s^2 = \frac{e’e}{n}$
are unbiased.
Another way to express this is that $\hat{\Theta}_{MLE} \sim N(\Theta_0, I(\Theta_0)^{-1})$, where $\Theta_0$ is the true $\Theta$ and $I(\Theta_0)$ is the Fisher information matrix