Skip to content
josef-pkt edited this page Mar 21, 2012 · 3 revisions

Gsoc-Ideas

For general information see here

Ideas for Google Summer of Code 2012 Projects

The area where the coverage in statsmodels is lacking is still pretty wide. So, if a student has a strong preference, then it should be or might be possible to cover it.

The idea is basically, pick your favorite chapters in an econometrics or statistics book, or R package or Stata topic or any other package for statistical analysis and see what is missing and would be useful to have available with higher priority.

Of course, support for a topic will also depend on the availability of a mentor with sufficient expertise to advice.

The following are some ideas. If you are interested in one of the topics, we can also help with additional information.

Support for formulas and categorical data

Convenient support for categorical explanatory variables is still largely lacking in statsmodels. This can follow up on the existing formula implementation of Jonathan and of Nathaniel, and the start of the integration in the statsmodels account on github. The topic is pretty complex and I would recommend it only to someone familiar with the formula framework in R.

Extend linear models to non-linear models

Linear_model, robust_linear_model and generalized_linear_model could all take a given non-linear function y = f(x, parameters) instead of the current linear version y = X*beta. Technically this can follow mostly the pattern of the current linear versions, but requires that one gets familiar with all three models.

Instrumental Variables and GMM

Generic GMM is mostly implemented in the sandbox, but it has missing pieces. Except for two-stage least squares case no specific models that use GMM are implemented. The possible application areas are wide, one possibility that has been popular in recent years would be support for weak instruments.

Panel data and mixed effects models

These are models with an additional random component that can be either implemented from a statistics or an econometrics viewpoint. The topic is large so some selection has to be taken.

Panel data and GMM, or mixed effects models and GEE

similar ideas but different implementation from a statistics or an econometrics viewpoint. Estimation and inference based on moment conditions or estimating equations based on a panel or longitudinal structure of the data.

Time Series Analysis: non-linear models

A wide range of models where statsmodels is completely lacking. Examples would be threshold models, markov switching models, ...

Time Series Analysis: Factor models, Factor VAR

mainly Stock and Watson and offspring. Interesting would be also to link this up with some of the variable selection procedures in sklearn similar to Bai and Ng.

Time Series Analysis: VECM, Cointegration

extending current vector_ar models to include VECM representation and estimation and the corresponding cointegration estimation.

Time Series Analysis: Bayesian Dynamic Linear Models

adapt and integrate Wes's DLM code (JP: I don't know what the status is.)

Time Series Analysis: GARCH

large parts for univariate GARCH are written and in the sandbox, but needs cleanup, enhancements and verification.

Expand graphics support with matplotlib

statsmodels has some plots with matplotlib included, but compared to other statistical packages there are still gaps. An idea would be to implement graphics with a coverage similar to other statistical packages in a user friendly way.

Other (fill in the details)

survival, duration

two stage models (e.g. Heckman sample selection)

system of equations

multivariate models (several endogenous/response variables)

extension to discrete models

non-parametric estimation

....