Вопрос

I am working on algorithm to perform linear regression for one or more independent variables.

that is: (if I have m real world values and in the case of two independent variables a and b)

C + D*a1 + E* b1 = y1

C + D*a2 + E* b2 = y2

...

C + D*am + E* bm = ym

I would like to use the least squares solution to find best fitting straight line.

I will be using the matrix notation so

enter image description here

where Beta is the vector [C, D, E] where these values will be the best fit line.

Question What is the best way to solve this formula? Should I compute the inverse of enter image description here

or should I use the LU factorization/decmposition of the matrix. What is the performance of each on large amount of data (i.e a big value of m , could be in order of 10^8 ...)

EDIT

If the answer was to use Cholesky decomposition or QR decomposition, are there any implementation hints/ simple libraries to use. I am coding in C/ C++.

Это было полезно?

Решение 2

Your X^TX matrix should have a Cholesky decomposition. I'd look into this decomposition before LU. It is faster: http://en.wikipedia.org/wiki/Cholesky_decomposition

Другие советы

Two straightforward approaches spring to mind for solving a dense overdetermined system Ax=b:

  1. Form A^T A x = A b, then Cholesky-factorise A^T A = L L^T, then do two back-solves. This usually gets you an answer precise to about sqrt(machine epsilon).

  2. Compute the QR factorisation A = Q*R, where Q's columns are orthogonal and R is square and upper-triangular, using something like Householder elimination. Then solve Rx = Q^T b for x by back-substitution. This usually gets you an answer precise to about machine epsilon --- twice the precision as the Cholesky method, but it takes about twice as long.

For sparse systems, I'd usually prefer the Cholesky method because it takes better advantage of sparsity.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top