How to add data to a bunch of sorted files

Question 1

I'd say read up that file of your sorted data, read up the file that you want to be sorted and added to there, buckle up the counters and plain overwrite the sorted data file with newly calculated one. Straight reading is largely less expensive on modern disk systems than random reading, and you will anyway need a position for every int you find, thus a single sequential read of the entire volume will be less time-consuming than ~32 reads of a single sector per number of the file to be sorted.

Also, I'd say sorting 32-bit ints is best done with result already in form of counters, especially at extra-large scale like "several hard disks", you will expect to have at least 1 in almost every bucket in 32-bit space, so storing 64 bits *2^32 could be smaller than say 2^33 32-bit zeroes then 2^32 ones then...

Question 2

If you look up 'external sorting' here (or elsewhere) you will find discussions of what you are describing. external-sorting is a tag here as well.

In the hard disk world, though, reads and writes are much more expensive than in RAM, so inserting data somewhere then shifting (rewriting) the rest of the data seems prohibitively expensive.

External sorting is for cases where you do not have sufficient memory (or enough 'per process' in most cases) to do it internally. It's not uncommon to have data sets too large to hold in memory at once. So you accept the higher runtime cost of I/O bound sorting.

Question 3

If you have space in memory to hold the file, and you have a set of numbers whose smallest element is k, you are going to have to re-write all the numbers in the file that are larger than k. There is no way around this. They are all going to have to shift at least one position.

If you are looking to exploit the fact that most of the array is already sorted, and you have the space in memory to do so, then sorting the inserted elements and merging it with the list of elements that are larger than its smallest member is a good, fast way to do this. EG:

DISK:

1 2 3 4 5 6 8 10 11 12

Insertions: 9 7 13

Sort the insertions:

7 9 13

Find the subset of the sorted list on disk that applies: 8 10 11 12

Merge the elements in (as in Mergesort:)

7 8 9 10 11 12 13

Copy them back to disk:

1 2 3 4 5 6 7 8 9 10 11 12 13

If, on the other hand, your space in memory is prohibitively smaller than the total size of the list, other techniques might be advisable. For example:

1 2 3 4 .. 1000 1002 1003... 999,998, 1,000,000...

as your list on disk and

1001, 999,999

as your insertions. In this situation, you will want to go through each element, calculate the number of elements in the insertion list which are smaller than that element, and then do so. In this simple example the naive counter is very fast - you can see for 1,000,0000 that two jumps are needed. If the number of insertions might be comparatively large, you can sort your insertions and then use binary search on this element to find where each element in your larger array might lie. This will give you information about how many items you can copy over. So, the corresponding values of jump for the top would be:

0 0 0 0 ... 0 1 1 ... 1 2

Hopefully you can see a fairly obvious method by which you might want to decide to write one of your insertion elements to disk.