Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.985
Model:                            OLS   Adj. R-squared:                  0.985
Method:                 Least Squares   F-statistic:                     1039.
Date:                Sun, 24 Jul 2022   Prob (F-statistic):           2.98e-42
Time:                        16:51:07   Log-Likelihood:                 2.3957
No. Observations:                  50   AIC:                             3.209
Df Residuals:                      46   BIC:                             10.86
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9269      0.082     60.112      0.000       4.762       5.092
x1             0.5189      0.013     41.053      0.000       0.493       0.544
x2             0.5977      0.050     12.029      0.000       0.498       0.698
x3            -0.0211      0.001    -19.025      0.000      -0.023      -0.019
==============================================================================
Omnibus:                        1.405   Durbin-Watson:                   1.916
Prob(Omnibus):                  0.495   Jarque-Bera (JB):                1.136
Skew:                           0.367   Prob(JB):                        0.567
Kurtosis:                       2.910   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.39904857  4.9307799   5.4164951   5.82361781  6.13132829  6.33398407
  6.442047    6.48036438  6.48408661  6.49289211  6.54446826  6.66831944
  6.88091958  7.18300551  7.55945594  7.9817758   8.41277792  8.8126928
  9.14570259  9.38582586  9.52118632  9.55596401  9.50970856  9.41412724
  9.30787552  9.23020613  9.2145199   9.28287814  9.44237799  9.68398642
  9.98402429 10.3080568  10.61655102 10.87137102 11.04204353 11.11076457
 11.07532464 10.9494717  10.76065242 10.54550279 10.34382893 10.19206703
 10.11729701 10.13279337 10.2358442  10.40819676 10.61905638 10.83014586
 11.00199438 11.10042242]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[11.08802943 10.91919511 10.61736061 10.23594536  9.84526812  9.51533048
  9.29867809  9.21753556  9.25836517  9.37518137]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7f92a7398970>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           4.926917
x1                  0.518931
np.sin(x1)          0.597741
I((x1 - 5) ** 2)   -0.021115
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    11.088029
1    10.919195
2    10.617361
3    10.235945
4     9.845268
5     9.515330
6     9.298678
7     9.217536
8     9.258365
9     9.375181
dtype: float64