Domanda

So basically I have a column of multiple emails and some of them are invalid and contain different characters/carriage returns that are not allowed.

Below is how i go about finding the invalid emails in a select statement but I have no clue on how to replace them individually for example if a carriage return is found I know i'd use a replace statement. Same with any special characters. But that would involve writing a separate query for each possible case?

Basically What I'm asking for is the most efficient way possible to iterate through my table replacing any characters in an email address that matches one of those case statements

select /*+  parallel(a,12) full(a) */  a.row_id, a.par_row_id, a.attrib_01,     a.created_by, a.last_upd_by from s_contact_xm a 
where a.type = 'Email' and (a.attrib_01 IS NULL
or a.attrib_01 like '% %'
or a.attrib_01 like '%@%@%'
or a.attrib_01 like '%..%'
or a.attrib_01 like '%;%'
or a.attrib_01 like '%:%'
or attrib_01 not like '%@%'
or a.attrib_01 like '%/%'
or a.attrib_01 like '%\%'
or a.attrib_01 like '%|%'
or a.attrib_01 like '%@.%'
or a.attrib_01 like '%@'
or a.attrib_01 like '%.'
or a.attrib_01 like '%(%'
or a.attrib_01 like '%)%'
or a.attrib_01 like '%<%'
or a.attrib_01 like '%>%'
or a.attrib_01 like '%#%'
or a.attrib_01 like '%"%'
or a.attrib_01 like '%.@%'
or a.attrib_01 like '%..%'
or a.attrib_01 like '.%'
or a.attrib_01 IS NULL
or INSTR(a.attrib_01, CHR(13)) > '0'
or INSTR(a.attrib_01, CHR(10)) > '0') and a.created_by = ‘1-XAAX5P’
È stato utile?

Soluzione

The thing is, you've got several different categaories of potential error. Some are fixable typos; some are unfixable typos; and some are just wrong. Now, is it possible to come up with some bulletproof rules for determining the category of any given error?

Perhaps.

For instance, you could convert every occurence of '%..%' to '%.%'. Likewise you could replace the carriage returns with null. Those are fixable typos.

But if somebody has included " in an email address with there's no way you can be sure they actually meant to type: do you assume they typed 2 and didn't notice they were also pressing [shift] or do you replace it with null (i.e. remove it)? That is not a fixable typo (but you might decide a guess is good enough).

If the email address doesn't contain a @ then it's not a valid email address and there's no way to fix it.

So you probably need several separate UPDATE statements. You will run one to translate the strings where you're going to attempt a one-for-one replacement. This is the technique for the things you want to replace with null, such as those carriage returns.

translate(attrib_01, '()"'||chr(13), '902')

You'll need several passes to transform multi-character strings e.g.

replace(attrib_01, '..', '.')  

Then you'll probably want to trim leading or trailing dots

trim(both '.' from attrib_01 ) 

Finally, you'll need to report on all those addresses you cannot fix, such as values with no (or several) strudels.

You may be able to compress some of these rules into fewer steps using REGEXP_REPLACE. The regular expressions will get extremely complicated. It will be easier to make things correct using the old skool Oracle replace functions. I suggest you only use regex if you really need the performance. Even then you will still need to make more than one pass through the data.


"'()"' does this mean nulls and parenthesis? "

The Oracle documentation is comprehensive, free and online. You can read all about REPLACE(). TRANSLATE() and TRIM() there.

But I'll explain the REPLACE() call a bit more. This function substitutes each character in the first string with the matching character in the second string. Any characters which lack a match are discarded. Hence ( is replaced with 9, ) is replaced with 0 and " is replaced with 2. (look at a QWERTY keyboard to understand why). chr(13) (carriage return) has no match and so is discarded (or replaced with NULL if you prefer to think of it that way).


Thinking about it, you could deploy a CASE statement in the UPDATE set clause, to apply different REPLACE(), TRIM() and TRANSLATE() calls in one execution. It depends on how impenetrable you want your code to be :)

Altri suggerimenti

You'll find many links on validating emails out there, this is not meant to be a copy/paste solution or to cover all cases for emails, just showing the approach.

I'd use regexp_replace, looking for anything that is NOT an alpha-numeric or in a list of additional acceptable chars (like @ or .)

Modify this for your rules. It shows the cleanup of a string with strange or non-printable chars:

select regexp_replace('A^b\c@de' || chr(9) || 'f.com', '[^[:alnum:]@.]','') from dual;

Abc@def.com

In an update statement:

update my_table
set email = regexp_replace(email, '[^[:alnum:]@.]','');

FULL Example (11gr2):

SQL> create table t1
(
email varchar2(100)
)
Table created.
SQL> insert into t1 values ('a^bc@#.com')
1 row created.
SQL> insert into t1 values ('a\*bc' || chr(10) || '.net')
1 row created.
SQL> commit
Commit complete.
SQL> select * from t1

EMAIL                                                                          
--------------------------------------------------------------------------------
a^bc@#.com                                                                     
a\*bc                                                                          
.net                                                                           


2 rows selected.

SQL> update t1 set email = regexp_replace(email, '[^[:alnum:]@.]','')
2 rows updated.

SQL> commit
Commit complete.
SQL> select * from t1

EMAIL                                                                           
--------------------------------------------------------------------------------
abc@.com                                                                       
abc.net                                                                         

2 rows selected.

Note that this doesn't enforce any strict email rules, it simply removes chars outside the accepted range of chars (what your OP was asking).

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top