Have you considered using a chip matched filter to perform your convolution?
http://en.wikipedia.org/wiki/Matched_filter
They are almost trivially easy to implement, as each chip / bit period can be implemented as a n add subtract delay line ( use a circular buffer )
A simple one for a square wave (will also work, but less optimal with other waveforms) of unknown sequence (but known frequency) can be implemented something like this:
// Filter class
template <int samples_per_bit>
class matchedFilter(
public:
// constructor
matchedFilter() : acc(0) {};
// destructor
~matchedFilter() {};
int filterInput(int next_sample){
int temp;
temp = sample_buffer.insert(nextSample);
temp -= next_sample;
temp -= result_buffer.insert(temp);
return temp;
};
private:
int acc;
CircularBuffer<samples_per_bit> sample_buffer;
CircularBuffer<samples_per_bit> result_buffer;
);
// Circular buffer
template <int length>
class CircularBuffer(
public:
// constructor
CircularBuffer() : element(0) {
buffer.fill(0);
};
// destructor
~CircularBuffer(){};
int insert(int new_element){
int temp;
temp = array[element_pos];
array[element_pos] = new_element;
element_pos += 1;
if (element_pos == length){
element_pos = 0;
};
return temp;
}
private:
std::array<int, length> buffer;
int element_pos;
);
As you can see, resource wise, this is relatively trivial. It there is a specific waveform you're after, you can cascade these together to give a longer correlation.