python c extension for standard deviation

https://stackoverflow.com/questions/3401521

25-09-2019
|

Question

I'm writing a c extension to calculate he standard deviation. Performance is important because it will be performed over large data sets. I'm having a hard time figuring out how to get the value of pyobject once I get the item from a list. This is my first time writing a c extension for python and any help is appreciated. Apparently I don't know how to use the code sample button correctly :(

This is what I have so far:

    #include <Python.h>
static PyObject*
func(PyObject *self, PyObject *args)
{
  PyObject *list, *item;
  Py_ssize_t i, len;
  if (!PyArg_UnpackTuple(args, "func", 1, 1, &list)){
    return NULL;
  }
  printf("hello world\n");
  Py_INCREF(list);
  len = PyList_GET_SIZE(list);
  for (i=0;i<len;i++){
    item = PyList_GET_ITEM(list, i);
    PyObject_Print(item,stdout,0);
  }
  return list;
}

static char func_doc[] = "This function calculates standard deviation.";

static PyMethodDef std_methods[] = {
  {"func", func, METH_VARARGS, func_doc},
  {NULL, NULL}
};

PyMODINIT_FUNC
initstd(void)
{
  Py_InitModule3("std", std_methods, "This is a sample docstring.");
}

Solution

You may be reinventing the wheel. There are several scientific computing libraries for Python, such as SciPy and Numpy, which are mostly wrappers around C libraries, that implement functions such as standard deviation.

OTHER TIPS

Once you have item, you can get its float value with PyNumber_Float:

PyObject* floatitem = PyNumber_Float(item);

Now you need to check and exit on error (if(!floatitem) return 0 -- or a goto to a location where you decref anything you may have incref'd in the previous part of your code, e.g. in your case list). If no error, PyFloat_AsDouble gives you the required double value for use in the rest of your C-coded loop:

double ditem = PyFloat_AsDouble(floatitem);

after which you can decref floatitem and go your merry way. Don't worry overmuch about conversion overhead in PyNumber_Float -- there won't be any if you're passed a list of floats in the first place;-). If you do still worry (would rather give an error if somebody does pass a non-float requiring conversion) you can use PyFloat_Check if you insist (but I'd suggest at least special-casing int and long items unless you want truly perplexed and unhappy users;-). In a similar vein, I'd also strongly recommend studying and using PySequence_Fast and friends, rather than astonishing users by specifically requiring lists rather than other types of sequences!-).

Just to mention that there is almost certainly a better way than writing a C extension.

First option is to use NumPy. In the comment you have on the other answer you mention that it is expensive to convert the list to an array. This may be true if the standard deviation calculation is the only bit you are doing with the data which is highly unlikely.

Barring that, I would go for Cython. Here is a comparison of Cython and NumPy. Cython underperforms NumPy in this case, but more importantly the code implemented for csum can trivially be changed to compute the standard deviation.

Have you considered using cython to write your extension. It is perfect for this type of thing

This method will be limited by the number of items in the list.

Another design would keep a running total and let you add points until you overflowed the double.

If you want simple statistics over large datasets, you can randomly sample a subset of the data and take the average and standard deviation of that. That will have a "standard error" of approximation, and the more samples you take, the smaller that will be. If you don't need high precision of the statistics, you don't need to read all the data.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow