Question

I'm using polybase to import a parquet file.

Over time, it is likely we may add or remove named columns in the file.

When I add an additional column, I get the below error:

External file access failed due to internal error: 'File test.parquet: HdfsBridge::CreateRecordReader - Unexpected error encountered creating the record reader: HadoopExecutionException: Column count mismatch. Source file has 16 columns, external table definition has 15 columns.'

This is because I added an additional column that wasn't in the external table definition.

As parquet contains a file schema and the external table knows the name of each column, is there a way it can be set to ignore the extra unused column?

Was it helpful?

Solution

Polybase does not have this ability. The source file and external table definition must match for it to work. Simply add or remove additional column(s) in your external table definition. Then when you are physically creating your table in the database eg via CTAS, add/remove the additional column(s) from your CTAS statement.

Alternately, consider the new COPY INTO statement. This does have the ability to specify a column list, eg

COPY INTO test_parquet ( col1, col2, col3, col4 )
FROM 'https://myaccount.blob.core.windows.net/myblobcontainer/folder1/*.parquet'
WITH (
    FILE_FORMAT = myFileFormat
    CREDENTIAL = ( IDENTITY = 'Shared Access Signature', SECRET = '<Your_SAS_Token>' )
)

However please note this feature is i) currently in preview, ii) for copying into physical tables (ie not external tables) and iii) I haven't tested the column list specifically with Parquet.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top