Process large amounts of XML how to not lock the database and best performance?

https://stackoverflow.com/questions/14895207

09-03-2022
|

Domanda

I have a database procedure which processes some XML and normalizes it into multiple tables. I'm not a huge database guy and don't know the most about database locks, so I'm mostly just wondering if anything I am doing is completely wrong.

To start with, there is a results table, containing something similar to the following:

resultId,computerId(int)|rawData(xml)
-------------------------------
1|1|<installedSoftware><software name="Google Chrome" version="1.0" /><software name="Mozilla Firefox" version="3.0" /></installedSoftware>
2|2|<installedSoftware><software name="Internet Explorer" version="6" /><software name="Google Chrome" version="1.0" /></installedSoftware>

My stored procedure looks something like this:

CREATE TABLE #ResultsToProcess
(
   int resultId
)

-- Only trying to process 1000 results at a time. If I try to do to many at once, I sometimes get timeouts. Instead I do small chunks.
SELECT
  TOP 1000
  resultId
FROM
  results

CREATE TABLE #TempSoftware
(
   computerId INT,
   softwareName NVARCHAR(MAX),
   softwareVersion NVARCHAR(MAX)
)

INSERT INTO #TempSoftware
SELECT DISTINCT
  computerId,
  T(N).value('(@name[1])', 'NVARCHAR(MAX)') AS softwareName,
  T(N).value('(@version[1])', 'NVARCHAR(MAX)') AS softwareVersion,
FROM
  results CROSS APPLY results.rawData.nodes('/installedSoftware[1]/software') AS T(N)
  INNER JOIN #ResultsToProcess ON results.resultId = #ResultsToProcess.resultId

-- May need to do some additional processing on the temporary data before actually using it.

-- To reduce duplicate data, we insert into a full list of software. There is an index based on softwareName and softwareVersion. The softwareTable has an auto increment int primary key.
INSERT INTO software(softwareName,softwareVersion)
SELECT DISTINCT softwareName, softwareVersion FROM #TempSoftware
WHERE NOT EXISTS(SELECT 1 FROM software WHERE software.softwareName = #TempSoftware.softwareName AND software.softwareVersion = #TempSoftware.softwareVersion)

-- Finally we will link any software to the computer. However in this case, the temp table does not have any indexes. Would it be worth-while to add some?
INSERT INTO computer_software(computerId,softwareId)
SELECT
  #TempSoftware.computerId,
  #software.softwareId
FROM #TempSoftware INNER JOIN ON software ON #TempSoftware.softwareName = software.softwareName AND #TempSoftware.softwareVersion = #software.softwareVersion

So in additional to this, the procedure will also process other computer based attributes, all coming from the same results.rawData table/column.

My questions regarding this code would be:

During this processing, other entries could be constantly added to the results.rawData table. Selecting from the XML nodes to create my temporary table takes a bit of time, I'm worried that anything trying to insert into the table while this is happening could be forced to wait. By using the resultId column at the beginning of the procedure, I try to create a scope of data that the procedure will work on at one time.
During this processing time, other tables can be queried from (such as to find out what software exists on a computer). As I only do one short bulk insert into these tables, I'm assuming there should be no problems there.
The #TempSoftware table has no indexes and I do a join on two NVARCHAR(MAX) columns. Is it worthwhile to create indexes on this table? Or would the overhead of creating the index be worse than the join.
Am I doing anything stupid here, that I should be smacked for?

Thanks for any suggestions. Again I'm not a big database guy. Im making the assumption that doing all processing directly in the database is better than pulling the raw data back into C#, doing the processing and re-inserting back into the database.

Soluzione

My approach would be to add an addtional (datetime) column flagging data that you have inserted into a temporary working table. Process the temp working table then comeback to check for new rows. This should be quite lightweight. It should if needs be in its own transaction.

Usually no indexes = bad, relevant indexes = good. You should always have a clustered index, however this can be tightly packed or have padding depending on the randomnes of your data. 80% is a good starting point. Whist you will get an overhead copying the rows. SQL is very good a creating indexes so insert, then create index is the best order.

If you think this sort of problem will be recurring, you may want to look at SSIS http://en.wikipedia.org/wiki/SQL_Server_Integration_Services -- dont expect expect the MS documentation to be as helpful as a good google, MS doesnt change its spots in this regard. SSIS is worth get to grips with at some time

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow