Why BULK INSERT insert data in random order?

https://dba.stackexchange.com/questions/262605

26-02-2021
|

سؤال

I have a csv file with 350,000 rows. Would like to insert data to temporary table in same order as is in csv file. I am trying to BULK INSERT using:

BULK INSERT ##AllRows
    FROM @FilePath
    WITH
    (
    FIELDTERMINATOR = '\n'
    )

Unfortunately BULK INSERT inserts data in random order. My header in every execution is in different row. I ran it on SQL Server 2016. Is there a possibility that in older versions of SQL Server order manner was different?

Using the FIRSTROW option does not recognize header as first row of file. File does not have any column for which we can order. In file always header is in first row.

It could be a coincidence but even with FIRSTROW=2 it is possibility that my header will be in table. I checked it. It looks like the more rows csv file contains the more chance that insert to table will be with random order.

المحلول

The rows from the file are read in order, and added to the table in the same order.

The issue occurs when you read rows from the table. Without an ORDER BY clause on your SELECT, SQL Server is free to return rows from the table in whatever order is convenient.

Details

The question doesn't provide a definition for the table ##AllRows, but it seems certain the table is a heap (a table without a clustered index). SQL Server reads pages from a heap using Index Allocation Map (IAM) structures. This means data tends to be returned in file and page id order within each IAM chain, which will generally not reflect the order in which data was inserted. This is the underlying cause of the behaviour you are seeing.

Solutions

You need a column to indicate the order of rows in the file, then order by that column when you write your query. Unfortunately, SQL Server does not provide a built-in way to add this 'sequence' column during import.

There are a couple of common workarounds:

Pre-process the source file outside SQL Server to add a sequence number to each row. This is the most reliable method.
Assign a sequence number during the import.

The second method carries some risk because there is no documented guarantee that this will work reliably in all circumstances. Nevertheless, people have been using this idea successfully for a long time. The general idea is:

Add an IDENTITY column to the import table.
Create a view over the import table, omitting the IDENTITY column.
BULK INSERT into the view.

This will not work with a global temporary table, because a view cannot reference that type of table. You would need to use a regular table (perhaps in tempdb) instead.

Example 1

I used the following script to successfully import a csv file containing the Complete Works of Shakespeare:

The first step is to create a table in tempdb with an extra IDENTITY column:

USE tempdb;
GO
CREATE TABLE dbo.Test
(
    id integer IDENTITY PRIMARY KEY,
    line nvarchar(4000) NOT NULL
);

Now we create a view over that table, omitting the IDENTITY column:

CREATE VIEW dbo.ImportTest
WITH SCHEMABINDING
AS
SELECT
    T.line
FROM dbo.Test AS T;

Finally, we bulk insert into the view:

BULK INSERT dbo.ImportTest
FROM 'C:\Temp\shakespeare.txt'
WITH 
(
    CODEPAGE = '65001',
    DATAFILETYPE = 'char',
    ROWTERMINATOR = '\n'
);

We can now see the first few lines in file order using a SELECT with ORDER BY:

SELECT TOP (20)
    T.id,
    T.line
FROM dbo.Test AS T 
ORDER BY
    T.id ASC;

The result shows the text in the right order:

Example 2

It is also possible to use OPENROWSET with a format file. Using the same example csv file, I was able to import the data using the following format file (saved as shakespeare.xml):

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
  <FIELD ID="1" xsi:type="CharTerm" TERMINATOR="\r\n"/>
 </RECORD>
 <ROW>
  <COLUMN SOURCE="1" NAME="line" xsi:type="SQLNVARCHAR" NULLABLE="NO"/>
 </ROW>
</BCPFORMAT>

and:

INSERT dbo.Test
    WITH (TABLOCK)
    (
        line
    )
SELECT
    ORO.line
FROM OPENROWSET
(
    BULK 'C:\Temp\shakespeare.txt',
    FORMATFILE = 'C:\Temp\shakespeare.xml',
    CODEPAGE = '65001'
) AS ORO;

Notice that this method does not require a view, so you could target a global temporary table. The target table still needs the extra IDENTITY column.

نصائح أخرى

Guide Bulk Insert along your csv by use of parameters ...

https://docs.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver15

specifically :

FIRSTROW = first_row -> first line to import from file (starts counting from 1)

KEEPIDENTITY -> Specifies that identity value or values in the imported data file are to be used for the identity column. If KEEPIDENTITY is not specified, the identity values for this column are verified but not imported and SQL Server automatically assigns unique values based on the seed and increment values specified during table creation. If the data file does not contain values for the identity column in the table or view, use a format file to specify that the identity column in the table or view is to be skipped when importing data; SQL Server automatically assigns unique values for the column. For more information, see DBCC CHECKIDENT (Transact-SQL).

ORDER ( { column [ ASC | DESC ] } [ ,... n ] ) -> Specifies how the data in the data file is sorted. Bulk import performance is improved if the data being imported is sorted according to the clustered index on the table, if any. If the data file is sorted in a different order, that is other than the order of a clustered index key or if there is no clustered index on the table, the ORDER clause is ignored. The column names supplied must be valid column names in the destination table. By default, the bulk insert operation assumes the data file is unordered. For optimized bulk import, SQL Server also validates that the imported data is sorted.

n Is a placeholder that indicates that multiple columns can be specified.

Edit: Sorry to read that it didn't outright help you - but I found this near the end of said page:

Importing Data from a CSV file Beginning with SQL Server 2017 (14.x) CTP 1.1, BULK INSERT supports the CSV format, as does Azure SQL Database. Before SQL Server 2017 (14.x) CTP 1.1, comma-separated value (CSV) files are not supported by SQL Server bulk-import operations. However, in some cases, a CSV file can be used as the data file for a bulk import of data into SQL Server. For information about the requirements for importing data from a CSV data file, see Prepare Data for Bulk Export or Import (SQL Server).

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى dba.stackexchange