Question

I found a strange issue, I'm wondering if anyone knows either why this happens or if it is a known bug (yes I Googled).

I have a database where I have to import CSV files using Access. I'm making some changes to the project to allow UNICODE characters to be into the database. There is some preprocessing involved in VBA by reading and writing using ADODB.Stream.

This ADODB.Stream is the only method of processing UNICODE files that I found to be of a useful level. However, I'm writing UTF-8 files with it and obviously, as it should, it writes a BOM (Byte Order Mark) at the start of the file: .

So far so good Microsoft, way to support standards! Now when this file is saved I want to import it using an ImportExportSpecification. I specifically specify Microsoft's Codepage 65001 which is what the rest of the world calls utf-8. Then try to import and no surprise it works.. Until your file starts with a double quote, then Access complains that the record 2 (2??) could not be parsed.

The unfortunate thing is that I cannot control the source of these files and I'm not involved in the actual imports, it's automated.. So now I either have to remove the BOM, which I don't really want to do.. because I need to read and write the entire file from to disk or in memory and these files are insanely huge.. Or remove the double quote, which is actually necessary for the headers of a file (that I can't control either).

TL;DR: It appears that if your files starts with " you break Access's ImportExport module. Is it a bug, a "feature" or why the .. does it happen?

Ok, so I will remove the BOM, so Christmas doesn't have to start without me. But it's not a solution, it's a band aid. I would very much like to know how to fix this, it's itching me.. :)

Merry Christmas!

Was it helpful?

Solution 2

As Guntram Blohm pointed out:

Maybe the access import module reads the BOM, decides on UTF-16, and has a fallback to UTF-8 if the next 2 bytes are not a valid codepoint. This seems to be the case, it may be a bug, it may be that UTF-8 with BOM was not implemented completely, I don't know.. If you do feel free to drop a comment, I'm still curious.

Anyway, there are 3 ways you can "solve" the issue:

  1. Remove the BOM (rewriting the entire file, could be done in memory rather than reading & writing from disk) see here: http://axlr8r.blogspot.nl/2011/05/how-to-export-data-into-utf-8-without.html

  2. Remove the double quote or place an extra character after the BOM if you can.

  3. Save the file as UTF-16 which Access seems to have better support for. Compared to an ANSI file a UTF-16 file will be twice the size. Whereas UTF-8 would only add extra bytes per character when special characters are met. In my case the file is in an intermediary state, I'm not keeping it, after importing I delete the file. So I chose to save the file as UTF-16 and import it with Codepage 1200 (which is UTF-16 for the rest of the world).

OTHER TIPS

It appears that if your files starts with " you break Access's ImportExport module.

No, that is not true, at least for reasonably current versions of Access. Given the CSV file "bomTest.csv":

bomTest.png

and the Import Specification

ImportSpec.png

the following statement imports the file correctly in Access 2010

DoCmd.TransferText _
        TransferType:=acImportDelim, _
        SpecificationName:="BomTest Import Specification", _
        TableName:="bomTest", _
        FileName:="C:\__tmp\bomTest.csv", _
        HasFieldNames:=True

Edit re: comment

Having a BOM in your UTF-8 CSV file when using the ImportExportSpecification method is not necessarily a show-stopper, either. Using the exact same sample CSV file as above, the following code imports it without incident:

Public Sub bomImportTest()
    Dim ies As ImportExportSpecification, ies_XML As String
    On Error Resume Next
    Set ies = CurrentProject.ImportExportSpecifications("zzzTemp")
    If Err.Number = 0 Then
        ies.Delete
    End If
    Set ies = Nothing
    On Error GoTo 0
    ies_XML = ""
    ies_XML = ies_XML & "<?xml version=""1.0"" encoding=""utf-8"" ?>" & vbCrLf
    ies_XML = ies_XML & "<ImportExportSpecification Path=""C:\Users\Gord\Desktop\bomTest.csv"" xmlns=""urn:www.microsoft.com/office/access/imexspec"">" & vbCrLf
    ies_XML = ies_XML & "      <ImportText TextFormat=""Delimited"" FirstRowHasNames=""true"" FieldDelimiter="","" TextDelimiter=""{DoubleQuote}"" CodePage=""65001"" Destination=""BomTest"" >" & vbCrLf
    ies_XML = ies_XML & "           <NumberFormat DecimalSymbol=""."" />" & vbCrLf
    ies_XML = ies_XML & "             <Columns PrimaryKey="""">" & vbCrLf
    ies_XML = ies_XML & "                        <Column Name=""Col1"" FieldName=""column 1"" Indexed=""NO"" SkipColumn=""false"" DataType=""Text"" Width=""14"" />" & vbCrLf
    ies_XML = ies_XML & "                        <Column Name=""Col2"" FieldName=""column 2"" Indexed=""NO"" SkipColumn=""false"" DataType=""Text"" Width=""14"" />" & vbCrLf
    ies_XML = ies_XML & "             </Columns>" & vbCrLf
    ies_XML = ies_XML & "   </ImportText>" & vbCrLf
    ies_XML = ies_XML & "</ImportExportSpecification>" & vbCrLf
    Set ies = CurrentProject.ImportExportSpecifications.Add("zzzTemp", ies_XML)
    ies.Execute
    Debug.Print "Import complete."
End Sub
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top