Azure SQL Warehouse - Data Ingestion - Convert a huge fixed width (with commas) file to delimited

https://dba.stackexchange.com/questions/158883

05-10-2020
|

Question

I am not even sure if I am framing this question right, but I will try - I have a bunch of huge text files generated from an Oracle export on a Linux system. Each file is about 30 GB in size, and I have about 50 of them.

The goal is to export this data to Azure SQL Data Warehouse. BCP in this case is not the right approach keeping in mind the size of the data, so I had to use Polybase.

After converting from ASCII to UTF8 encoding, I ran into an issue while querying the external tables. Polybase doesn't play well with the fixed width text file with a line break in each line.

The text file looks something like this:

101,102,103,104,105,106,107
108,108,109,110,111,112,113
114,115,116,117,118,119,120
121,122,123

--nothing here, just a blank line

201,202,203,204,205,206,207
208,209,210,211,212,213,214
215,216,217

Polybase tries to process from 101 through 107 and errors out complaining there were not enough columns in this file to process.

Here is what I think is happening: The fixed width and line breaks are making it treat the line break as a row delimiter.

How do I convert this file to look like below:

101,102,103,104,105,106,107,108,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123{CR}{LF}
201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217{CR}{LF}

EDIT: Here is the sample data from a file. I opened it in git bash on a windows VM.

These files are supposed have 167 columns with , as the column separator. The problem is, since each row spawns multiple lines, its difficult to process them from Polybase external tables.

Solution 3

I ended up using sed to cleanse the file

zcat myfile.txt.gz | sed -r 's/[ ]+/vin/g'|tr -d '\n'|tr 'vinvin' '\n'|grep -v '^$' > myfile.txt

This took care of formatting issues with the source files. Once these files were uploaded to Azure blob storage, the rest was easy. I created external table pointing to files on blob via Polybase and then created internal tables by using CREATE TABLE dbo.internal AS SELECT * FROM blob.external. An Azure DWH instance with 700 DWH capacity was able to load around 50 mln rows in 5 min from external table.

OTHER TIPS

Polybase is powerful but not really that sophisticated so will not be able to deal with this weird format. You have three options as I see it:

Correct the file format at source. Instead of having weird mix of fixed-width and delimited file format, use a standard file format, such as .csv. This format you have where the column delimiter is either carriage return or comma I think is genuinely strange. Are there any tools that can read this easily? Is this a common format where you work?
Import the file specifying one of the delimiters into one row, then shred it based on the other delimiter. I started trying this with your sample data but did not get very far. Do different rows really have different numbers of columns? In your sample data, row 1 has 24 columns and row 2 has 17 columns. Please provide a small sample file eg via gist that accurately represents your data.
Write a highly customised import routine. The point of using standard data interchange formats like csv, tab-delimted, pipe-delimited, XML, JSON etc is that you should not have to write a highly custom routine every time you want to import some data. However, this could be an option, if you cannot change your file at source, or import it in stages. I've been working with Azure Data Lake Analytics (ADLA) and U-SQL recently, and that might be able to do this.

Please try and answer my questions above and provide a sample file and I will try and help.

According to a hex editor, your sample file has single line feed (0A) for some line endings and two line feeds as the delimiter between rows:

A U-SQL custom extractor might be able to handle this file, but I wonder if we will run into issues with the full 30GB file.

Instructions

Set up an Azure Data Lake Analytics (ADLA) account if you don't already have one.
Create a new U-SQL project in Visual Studio - you will need the ADLA tools.

Add a U-SQL script and add the text below to the U-SQL code-behind file:

using System.Collections.Generic;
using System.IO;
using System.Text;
using Microsoft.Analytics.Interfaces;

namespace Utilities
{
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class MyExtractor : IExtractor
{
    //Contains the row
    private readonly Encoding _encoding;
    private readonly byte[] _row_delim;
    private readonly char _col_delim;

    public MyExtractor()
    {
        _encoding = Encoding.UTF8;
        _row_delim = _encoding.GetBytes("\n\n");
        _col_delim = '|';
    }

    public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
    {
        string s = string.Empty;
        string x = string.Empty;

        foreach (var current in input.Split(_row_delim))
        {
            using (System.IO.StreamReader streamReader = new StreamReader(current, this._encoding))
            {
                while ((s = streamReader.ReadLine()) != null)
                {
                    //Strip any line feeds
                    //s = s.Replace("/n", "");

                     // Concatenate the lines
                    x += s;
                }

                //Create the output
                output.Set<string>(0, x);
                yield return output.AsReadOnly();

                // Reset
                x = string.Empty;

            }
        }
    }
}
}

Process the file using custom extractor:

@input =
EXTRACT col string
FROM "/input/input42_2.txt"
USING new Utilities.MyExtractor();


// Output the file
OUTPUT @input
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);

This produced a clean-up file which I was able to import using Polybase:

Good luck!

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange