Question

I have already read this question and the associated answer.

I have removed any 'all zero' columns, as recommended in the answer. I have 3,169 columns remaining.

datavals_no_con = datavals.loc[:, (datavals != datavals.iloc[0]).any()]

I checked whether any were missed, for some bizarre reason:

varcon = np.asarray([np.var(datavals_no_con[datavals_no_con.columns[i]]) for i in range(len(datavals_no_con.columns))])
print np.where(varcon==0.) #empty array.

Also checked the minimum column variance value, which ended up being 4.306x10^(-7)

This was generated by a column that has no zero entries.

When I run this:

model = VAR(datavals_no_con)

results = model.fit(2)

I still get:

Traceback (most recent call last):
  File "vector_autoregression.py", line 163, in <module>
    results = model.fit(2)
  File "/user/anaconda2/lib/python2.7/site-packages/statsmodels/tsa/vector_ar/var_model.py", line 438, in fit
    return self._estimate_var(lags, trend=trend)
  File "/user/anaconda2/lib/python2.7/site-packages/statsmodels/tsa/vector_ar/var_model.py", line 457, in _estimate_var
    z = util.get_var_endog(y, lags, trend=trend, has_constant='raise')
  File "/user/anaconda2/lib/python2.7/site-packages/statsmodels/tsa/vector_ar/util.py", line 32, in get_var_endog
    has_constant=has_constant)
  File "/user/anaconda2/lib/python2.7/site-packages/statsmodels/tsa/tsatools.py", line 102, in add_trend
    raise ValueError("x already contains a constant")
ValueError: x already contains a constant

How can I resolve this?

EDIT: It occurred to me that the problem would be that x contains a constant, not that x contains all 0s. So the original answer suggested in the previous question was not entirely sufficient.

To test whether any of my columns contained 'all the same value' (e.g. a column of all 0.5), I tried this:

ptplist = []
for i in range(len(datavals_no_con.columns)):
    ptplist.append(np.ptp(datavals_no_con[datavals_no_con.columns[i]], axis=0))

ptparray = np.asarray(ptplist)
print any(ptparray==0.) #FALSE

So none of my columns are constant, unless I'm still missing something.

EDIT 2: I have found the root cause of the problem.

Suppose my input matrix (that is, my set of endogenous variables) is a 5x5 identity matrix, for the sake of argument, and that my lag value is 2 (that is, I'm looking for an AR(2) model: y_{t+1} = A + B_1y_{t} + B_2y_{t-1} + error) :

y = np.eye(5)

1 0 0 0 0 (row 1)
0 1 0 0 0 (row 2)
0 0 1 0 0 (row 3)
0 0 0 1 0 (row 4)
0 0 0 0 1 (row 5)

In the get_var_endog function in /statsmodels/tsa/util.py, under lags=2, the y matrix gets rearranged to this general idea:

[row 2, row 1] (i.e. concatenate these two)
[row 3, row 2]
[row 4, row 3]

And this new matrix could have zero columns, in places where my original data matrix did not. In fact, this is exactly what was happening. Following my example, the np.array Z in get_endog_var looks like this:

0 1 0 0 0 1 0 0 0 0
0 0 1 0 0 0 1 0 0 0
0 0 0 1 0 0 0 1 0 0

So now columns 0, 4, 8, and 9 are completely 0, which throws the ValueError.

Two possible approaches to a solution come to mind:

1) Remove the zero columns from the Z matrix.

2) Edit the original data set such that these zero columns never occur in the first place (much harder, because then the Z matrix here would never have existed, so how can you know which columns to remove...catch 22).

I chose option 1, but now I'm dealing with shape issues down the line. Because, of course, when doing the least squares fit, the shape of the parameters is going to be different from the shape of the original data set (some columns don't exist in the parameters, because I removed them, that do exist in the original data set).

Now, this looks like it should be a relatively frequent problem. A lot of the time, we're working with high-dimensional sparse data, which would generate this issue. Does anyone have a more robust solution than what I've proposed?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top