Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Even when using the same parameters (such as the values of p, d, q), the results can be drastically different in R and Python. #957

Closed
kestlermai opened this issue Apr 29, 2024 · 3 comments

Comments

@kestlermai
Copy link

kestlermai commented Apr 29, 2024

R 4.2.1; forecast 8.22.0 :

arima <- arima(train_data, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = 12))
summary(arima)

Series: train_data
ARIMA(0,1,1)(0,1,1)[12]

Coefficients:
ma1 sma1
-0.193 -0.791
s.e. 0.091 0.084

sigma^2 = 181: log likelihood = 37.83
AIC=-69.66 AICc=-69.45 BIC=-61.32

Python 3.11; statsmodels 0.14.1 :

model = SARIMAX(train_data['incidence'], order=(0,1,1), seasonal_order=(0,1,1,12))
result = model.fit()
print(result.summary())

SARIMAX Results
Dep. Variable: incidence No. Observations: 132
Model: SARIMAX(0, 1, 1)x(0, 1, 1, 12) Log Likelihood -99.484
Date: Mon, 29 Apr 2024 AIC 204.969
Time: 23:46:06 BIC 213.306
Sample: 0 HQIC 208.354
- 132
Covariance Type:opg
coef std err z P>|z| [0.025 0.975]

ma.L1 -0.6900 0.048 -14.322 0.000 -0.784 -0.596
ma.S.L12 -0.8250 0.102 -8.081 0.000 -1.025 -0.625
sigma2 0.2766 0.019 14.838 0.000 0.240 0.313

Ljung-Box (L1) (Q): 0.73 Jarque-Bera (JB): 438.41
Prob(Q): 0.39 Prob(JB): 0.00
Heteroskedasticity (H): 1.21 Skew: -0.82
Prob(H) (two-sided): 0.56 Kurtosis: 12.26

Using the same parameters in two different software packages results in drastically different model performances.
For example, in R: log likelihood = 37.83, aic = -69.66; while in Python: Log Likelihood = -99.484, AIC = 204.969.

Can you help me?

@robjhyndman
Copy link
Owner

  1. I don't know what objective function is used by statsmodels. But even if the docs say it is maximum likelihood, there are many variations. R is using a state space representation with a diffuse prior as explained in the documentation for stats::arima(): https://rdrr.io/r/stats/arima.html. Other objective functions may yield different results. See https://robjhyndman.com/hyndsight/estimation/
  2. Whatever objective function is used, it will contain local optima and there is no guarantee that the software finds the global optimum. See https://rjournal.github.io/articles/RN-2002-007/
  3. The AIC/BIC/etc depends on the likelihood, so different likelihood functions lead to different information criteria. Even with the same likelihood function, some software implementations omit the constant in the calculation. See https://robjhyndman.com/hyndsight/lm_aic.html.
  4. The best Python implementation of ARIMA models that I know of is provided by StatsForecast: https://nixtlaverse.nixtla.io/statsforecast/src/core/models.html#arima

@kestlermai
Copy link
Author

kestlermai commented May 2, 2024

Thank you very much for your reply.
When I tried to use the StatsForecast to build an ARIMA model, the results still differed significantly from those obtained by running R.
Under the same parameters {order=(0, 1, 1), season_length=12, seasonal_order=(0,1,1)}, MAPE: is 4.922 in R and 14.463 in Python.
This may be attributed to different software algorithms?
Anyway, thank you very much for your help.

@robjhyndman
Copy link
Owner

A MAPE difference that large suggests something's gone wrong in the Python model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants