Encoding problems with dBase III .dbf files on different machines

https://stackoverflow.com/questions/3845760

27-09-2019
|

Question

I'm using C# and .NET 3.5, trying to import some data from old dbf files using ODBC with Microsoft dBase Driver.

The dbf's are in dBase III format and using ibm850 encoding for strings.

Now, when I run my program on my machine, all string data read from OdbcDataReader comes out converted to UTF-16 or UTF-8 or something, idk and I save it as UTF-8 and everything is ok, but when I try to use this program on an XP box, some characters aren't converted correctly to UTF-8. 'Õ' for example. There may be some others too. Characters like 'Ä', 'Ö' and 'Ü' are ok. This is the problem. Maybe the ODBC or the driver uses some machine culture info or something to mess everything up.

Is it possible to read strings from the database as binary? Maybe some functions like CONVERT or CAST? Or where could I find some references for SQL functions and syntax which works for this dBase driver or other drivers? I searched around and couldn't find anything. I feel so blind when using ODBC and SQL.

Right now I'm using a temporary hack that replaces all σ's with Õ's.

Thanks!

Example code:

System.Data.Odbc.OdbcConnection oConn = new System.Data.Odbc.OdbcConnection();
oConn.ConnectionString = @"Driver={Microsoft dBase Driver (*.dbf)};DriverID=277;Dbq=" + dbPath + ";";
oConn.Open();

System.Data.Odbc.OdbcCommand oCmd = oConn.CreateCommand();
oCmd.CommandText = @"SELECT name FROM " + dbPath + "TABLE.DBF";

System.Data.Odbc.OdbcDataReader reader = oCmd.ExecuteReader();
reader.Read();

byte[] buf = Encoding.UTF8.GetBytes(reader.GetString(0));
BinaryWriter writer = new BinaryWriter(File.Open(@"C:\DBF\Test.txt", FileMode.Create));
writer.Write(buf);

Result:

E5 in dbf (Õ in 850)

Test.txt on pc1: C3 95 (Õ in UTF-8)

Test.txt on pc2: CF 83 (σ in UTF-8)

Solution

If you are still having a problem with these files, I may be able to help you.

What is in the "codepage byte" aka "language driver id" (LDID) at offset 29 (decimal) in the file?

I have a Python-based DBF reader which can read just about any field data type and just about any codepage -- it has a long list compiled from various sources of mappings from codepage byte to codepage number. Options are (1) believe the LDID, deliver Unicode (2) ignore the LDID, deliver undecoded bytes (3) override the LDID, decode with a specific codepage into Unicode. The Unicode can of course be then encoded into UTF-8.

The DBF reader also does a whole lot of reasonableness cross-checks which may help investigating why VFP thinks the file is corrupt.

How do you know that it's using IBM850? Another piece of Python code that I have is a prototype encoding detector, which unlike detectors like 'chardet' which are derived from Mozilla code is not web-centric and can happily recognise most old DOS codepages -- this may help.

A observation: the Greek letter lowercase sigma (σ) is 0xE5 in codepage 437, which was succeded by codepage 850 -- "pc2" seems a little outdated ...

If you think I can be of any help, feel free to e-mail me at insert_punctuation("sjmachin", "lexicon", "net")

OTHER TIPS

Try this code.

var oConn = new System.Data.Odbc.OdbcConnection();
oConn.ConnectionString = "Driver={Microsoft Visual FoxPro Driver};SourceType=DBF;SourceDB=" + dbPath;
oConn.Open();
var oCmd = oConn.CreateCommand();
oCmd.CommandText = @"SELECT name FROM " + dbPath + "TABLE.DBF";
var reader = oCmd.ExecuteReader();
reader.Read(); 
byte[] A = Encoding.GetEncoding(Encoding.Default.CodePage).GetBytes(reader.GetString(0));
string p = Encoding.Unicode.GetString((Encoding.Convert(Encoding.GetEncoding(850), Encoding.Unicode, A)));

When you read dbf file you should understand that you should take into account 3 types of encoding:

1.Encoding in which database provider reads the file. It depends on provider and current operation system. This encoding shall be used for bytes array receiving. For example on my PC:

when I use connection string "Data Source={0}; Provider=Microsoft.JET.OLEDB.4.0;Extended Properties=DBase IV;User ID=;Password=;", strings are read using 866 code page (Russian MS-DOS)
when I use connection string "Data Source={0}; Provider=vfpoledb.1;Exclusive=No;Collating Sequence=Machine", strings are read using Encoding.Default (1251 code page)

2.Encoding in which strings are written to dbf file. It can be received from 29 byte of dbf file, but in fact there is no matter what how dbf file encoding is marked, you should just know what encoding was used. This encoding shall be used as source encoding during string conversion

3.Encoding to which string shall be converted. This is UTF-8 usually.

So string conversion should look like this:

byte[] bytes = Encoding.GetEncoding(codePage1).GetBytes(reader.GetString(0));

string result = Encoding.UTF8.GetString((Encoding.Convert(Encoding.GetEncoding(codePage2), Encoding.UTF8, bytes)));

Have you tried using the Visual Foxpro driver "VFPOleDb" driver instead???

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow