Question

I have a string and I want to find out if it starts with \U. Here is an example

myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'

I was trying this:

myStr.startswith('\\U')

but I get False.

How can I detect \U in a string?

The larger picture:

I have a list of strings, most of them are normal English word strings, but there are a few that are similar to what I have shown in myStr, how can I distinguish them?

Was it helpful?

Solution 2

Your string:

myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'

is not a foraign language text. It is 5 Unicode characters, which are (in order):

If you want to get strings that only contain 'normal' characters, you can use something like this:

if re.search(r'[^A-Za-z0-9\s]', myStr):
    # String contained 'weird' characters.

Note that this will also trip on characters like é, which will sometimes be used in English on words with a French origin.

OTHER TIPS

The original string does not have the character \U. It has the unicode escape sequence \U0001f64c, which is a single Unicode character.

Therefore, it does not make sense to try to detect \U in the string you have given.

Trying to detect the \U in that string is similar to trying to detect \x in the C string "\x90".

It makes no sense because the interpreter has read the sequence and converted it. Of course, if you want to detect the first Unicode character in that string, that works fine.

myStr.startswith('\U0001f64c')

Note that if you define the string with a real \U, like this, you can detect it just fine. Based on some experimentation, I believe Python 2.7.6 defaults to this behavior.

myStr = r'\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
myStr.startswith('\\U') # Returns True.

Update: The OP requested a way to convert from the Unicode string into the raw string above. I will show the solution in two steps.

First observe that we can view the raw hex for each character like this.

>>> [hex(ord(x)) for x in myStr]
['0x1f64c', '0x1f60d', '0x1f4a6', '0x1f445', '0x1f4af']

Next, we format it by using a format string.

formatString = "".join(r'\U%08x' for x in myStr)
output = formatString % tuple(myChars)
output.startswith("\\U") # Returns True.

Note of course that since we are converting a Unicode string and we are formatting it this way deliberately, it guaranteed to start with \U. However, I assume your actual application is not just to detect whether it starts with \U.

Update2: If the OP is trying to differentiate between "normal English" strings and "Unicode Strings", the above approach will not work, because all characters have a corresponding Unicode representation.

However, one heuristic you might use to check whether a string looks like ASCII is to just check whether the values of each character are outside the normal ASCII range. Assuming that you consider the normal ASCII range to be between 32 and 127 (You can take a look here and decide what you want to include.), you can do something like the following.

def isNormal(myStr):
    myChars = [ord(x) for x in myStr]
    return all(x < 128 and x > 31 for x in myChars)

This can be done in one line, but I separated it to make it more readable.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top