Question

I've got an array of char* in a file. The company I work for stores data in flat files.. Sometimes the data is sorted, but sometimes it's not. I'd like to sort the data in the files.

Now I could write the code to do this, from scratch. Is there an easier way?

Of course an in-place sort would be the best option. I'm working on large files and have little RAM. But I'll consider all options.

All strings are the same length.

This is some sample data:

the data is of fixed length
the Data is of fixed length
thIS data is of fixed lengt

This would represent three records of length 28. The app knows the length. Each record ends with CRLF (\r\n), though it shouldn't matter for this sort.

Was it helpful?

Solution

template<size_t length> int less(const char* left, const char* right) {
    return memcmp(left, right, length) < 0;
}

std::sort(array, array + array_length, less<buffer_length>);

OTHER TIPS

Use the GNU sort program (externally) if you can't fit the data into RAM: it will sort arbitrary sized files and the larger the file, the smaller the additional cost of creating the process.

You can use the algorithms in the STL on arrays native datatypes, not just on STL containers. The other suggestion to use std::sort won't work as posted however, because strcmp returns a value that evaluates to true for all comparisons when the strings aren't the same, not just if the left hand side is less than the right hand side -- which is what std::sort wants; a binary predicate returning true of the left hand side is less than the right hand side.

This works:

struct string_lt : public std::binary_function<bool, char, char>
{
    bool operator()(const char* lhs, const char* rhs)
    {
        int ret = strcmp(lhs, rhs);
        return ret < 0;
    }
};

int _tmain(int argc, _TCHAR* argv[])
{
    char* strings [] = {"Hello", "World", "Alpha", "Beta", "Omega"};
    size_t numStrings = sizeof(strings)/sizeof(strings[0]);

    std::sort(&strings[0], &strings[numStrings], string_lt());

    return 0;
}

boost::bind can do it:

// ascending
std::sort(c, c + size,  boost::bind(std::strcmp, _1, _2) < 0); 

// descending
std::sort(c, c + size,  boost::bind(std::strcmp, _1, _2) > 0); 

Edit: The strings are not null-terminated:

// ascending
std::sort(c, c + array_size,  boost::bind(std::memcmp, _1, _2, size) < 0); 

// descending
std::sort(c, c + array_size,  boost::bind(std::memcmp, _1, _2, size) > 0); 

Probably the easiest way is to used the old stdlib.h function qsort. This should work:

qsort( array, num_elements, sizeof( char* ), strcmp )

Please note that this is standard C and only works reliable with English text.

If you have a list of String objects, then other things are possible in C++.

If you are on Linux and writing a gtk or Qt application then I would propose that you have a look at these libraries beforehand.

If the files are large and do not fit in RAM, you can use bin/bucket sort to split the data into smaller files and finally aggregate the pieces in a result file. Other responses show you how to sort each individual bucket file.

The canonical way to sort an array of character strings in C, and therefore an available but not necessarily recommended way to do so in C++, uses a level of indirection to strcmp():

static int qsort_strcmp(const void *v1, const void *v2)
{
    const char *s1 = *(char * const *)v1;
    const char *s2 = *(char * const *)v2;
    return(strcmp(s1, s2));
}

static void somefunc(void)   // Or omit the parameter altogether in C++
{
    char **array = ...assignment...
    size_t num_in_array = ...number of char pointers in array...
    ...
    qsort(array, num_in_array, sizeof(char *), qsort_strcmp);
    ...more code...
}

A few things come to mind:

  1. If your data is too big to fit into memory, you may want to just build up an index in-memory of file offsets, then memory-mapping the file to access the strings (depends on your OS).
  2. In-place is going to require a lot of memory copies. If you can, use a shell sort. Then once you know the final order, it's much easier to reorder the strings in-place in linear time.
  3. If the strings are all the same length, you really want a radix sort. If you're not familiar with a radix sort, here's the basic idea: Comparison-based sorting (which is what std::sort, qsort, and any other general-purpose sorting) always requires O(N log N) time. Radix sorting compares a single digit at a time (starting at str[0] and ending at str[K-1] for a K-lenth string), and overall can require only O(N) time to execute.

Consult the Internetfor a much better detailed description of radix sorting algorithms than I can provide. Aside from what I've said, I would avoid all of the other solutions that use standard libarary sorting facilities. They just aren't designed your particular problem, unfortunately.

You probably want to look into memory mapped files (see http://en.wikipedia.org/wiki/Memory-mapped_file), mmap() function (http://en.wikipedia.org/wiki/Mmap) on POSIX-complaint OSes. You'll essentially get a pointer to contiguous memory representing the file's contents.

The good side is that the OS will take care of loading parts of the file into memory and unloading them again, as needed.

One downside is that you'll need to resolve to some form of file locking to avoid corruption if more than one process is likely to access the file.

Another downside is that this doesn't guarantee good performance - to do that, you'll need a sorting algorithm that tries to avoid constantly loading and unloading pages (unless of course you have enough memory to load the entire file into memory).

Hope this has given you some ideas!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top