Question

I am pulling in data from all sorts of sources with sqoop and I am taking notice that many things can go wrong. I've had it happen multiple times that certain columns contain delimiters which cause extra unwanted rows which in turn cause unwanted NULL values. Unwanted characters include the bad guys like the windows delmiter \r\n all the way to the the Icalandic Thorn.

What is the best practice with dealing with these issues?

I have considered selecting everything by column and removing the evil character by use of a REPLACE-type method, but it feels like there should be a better way.

Was it helpful?

Solution

In the newer versions of sqoop you have the --hive_drop-import-delims or --hive-delims-replacement command.

See https://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html

That will deal with the \r \n and \001 in your string fields. For other replacements your workaround with the REPLACE function is the way to go

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top