Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.985
Model: OLS Adj. R-squared: 0.985
Method: Least Squares F-statistic: 1039.
Date: Sun, 24 Jul 2022 Prob (F-statistic): 2.98e-42
Time: 16:51:07 Log-Likelihood: 2.3957
No. Observations: 50 AIC: 3.209
Df Residuals: 46 BIC: 10.86
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.9269 0.082 60.112 0.000 4.762 5.092
x1 0.5189 0.013 41.053 0.000 0.493 0.544
x2 0.5977 0.050 12.029 0.000 0.498 0.698
x3 -0.0211 0.001 -19.025 0.000 -0.023 -0.019
==============================================================================
Omnibus: 1.405 Durbin-Watson: 1.916
Prob(Omnibus): 0.495 Jarque-Bera (JB): 1.136
Skew: 0.367 Prob(JB): 0.567
Kurtosis: 2.910 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.39904857 4.9307799 5.4164951 5.82361781 6.13132829 6.33398407
6.442047 6.48036438 6.48408661 6.49289211 6.54446826 6.66831944
6.88091958 7.18300551 7.55945594 7.9817758 8.41277792 8.8126928
9.14570259 9.38582586 9.52118632 9.55596401 9.50970856 9.41412724
9.30787552 9.23020613 9.2145199 9.28287814 9.44237799 9.68398642
9.98402429 10.3080568 10.61655102 10.87137102 11.04204353 11.11076457
11.07532464 10.9494717 10.76065242 10.54550279 10.34382893 10.19206703
10.11729701 10.13279337 10.2358442 10.40819676 10.61905638 10.83014586
11.00199438 11.10042242]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.08802943 10.91919511 10.61736061 10.23594536 9.84526812 9.51533048
9.29867809 9.21753556 9.25836517 9.37518137]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7f92a7398970>
Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1": x1, "y": y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 4.926917
x1 0.518931
np.sin(x1) 0.597741
I((x1 - 5) ** 2) -0.021115
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 11.088029
1 10.919195
2 10.617361
3 10.235945
4 9.845268
5 9.515330
6 9.298678
7 9.217536
8 9.258365
9 9.375181
dtype: float64