Pergunta

I have a existing table (VendorMaster) with 18 columns and around 2,00,000 rows in it. Now i have built new table (NewVendorMaster) with different logic to get same data under same schema.

Schema of VendorMaster and NewVendorMaster
V1       V2    V3     V4     V5      V6       V7      V8.....   V16
data1                data2          data3   data4   data5      dataN
type3  type16        data3 data4
type3        type17        data14           data18             type20
data4        type17        data14           type45             type20

Now i want to compare those two tables. As there are 18 columns and 2,00,000 records should i go for column wise comparison or concatenate all data into one column and then compare for faster result?

Join on these two tables seems to take hours to compare.

and also would creating index on both the tables would help for faster comparison?

Foi útil?

Solução

I would advice against concatenating the fields before comparing but rather suggest to create a checksum value and work from there. I've done something similar and approached it like this: first I added an extra field to the table that holds the BINARY_CHECKSUM() of the key fields (you can use a calculated field for this!) and put an index on just that crc-field but INCLUDE() all the actual key fields. The generated values are not going to be unique, but it will be close enough to work. Mind that it might take a while for the index to be created, depending on the size it could take up quite a bit of space too.

When JOINing both tables join on the crc + on all the key fields (**). SQL will be quite good at matching just the integer to find the right row(s) and then comparing the other fields can be done from the included columns.

(**: don't rely on just the crc-value, BINARY_CHECKSUM() is fast and easy to use but will probably have collisions!)

Outras dicas

This is not a complete answer, but it may help save some time

I was doing a very similar thing - normalising a 2.5 million row one dimensional database into half a dozen tables.

What I ended up doing was using 'C' to write short programs to loåd data in and step through it and then insert new records in the new database tables because the clever SQL took all night to not actually generated a result.

Using C I could write debug output to see where things were happening, and better still, when things went wrong I could see what was gong wrong, and best of all, when the job was half done, it was half done - aborting the program halfway through at least left some good new data in the new table.

So my main point is this: dont try to be clever and construct that magic sql query that does all the work. Break it into steps and program it. It runs massively faster.

And you know far better how the data is arranged than SQL ever will.

(Adding indexes is always good on tables that dont change much and really speeds up searches. You can always remove them later on)

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top