Domanda

i have a code below which Liang Sun implemented

#Created by Liang Sun in 2013
import re
import collections
import hashlib

class Simhash(object):
    def __init__(self, value):
        self.f = 64
        self.reg = ur'[\w\ufb50-\ufdff]'
        self.value = None

        if isinstance(value, Simhash):
            self.value = value.value
        elif isinstance(value, basestring):
            self.build_by_text(unicode(value))
        elif isinstance(value, collections.Iterable):
            self.build_by_features(value)
        elif isinstance(value, long):
            self.value = value
        elif isinstance(value, Simhash):
            self.value = value.hash
        else:
            raise Exception('Bad parameter')

    def _slide(self, content, width=2):
        return [content[i:i+width] for i in xrange(max(len(content)-width+1, 1))]

    def _tokenize(self, content):
        ans = []
        content = ''.join(re.findall(self.reg, content))
        ans = self._slide(content)
        return ans

    def build_by_text(self, content):
        features = self._tokenize(content)
        return self.build_by_features(features)

    def build_by_features(self, features):
        features = set(features) # remove duplicated features
        hashs = [int(hashlib.md5(w.encode('utf-8')).hexdigest(), 16) for w in features]
        v = [0]*self.f
        for h in hashs:
            for i in xrange(self.f):
                mask = 1 << i
                v[i] += 1 if h & mask else -1
        ans = 0
        for i in xrange(self.f):
            if v[i] >= 0:
                ans |= 1 << i
        self.value = ans

    def distance(self, another):
        x = (self.value ^ another.value) & ((1 << self.f) - 1)
        ans = 0
        while x:
            ans += 1
            x &= x-1
        return ans

I want to use this code for Arabic language text, I asked Lian Sun about this and he said I should replace self.reg = ur'[\w\ufb50-\ufdff]' with the Arabic code point range. I searched and find the Arabic Unicode block on Wikipedia but I don't know how to use it.

Any help appreciated

È stato utile?

Soluzione

There is no "Arabic code-point range", there are instead 7 blocks specific to Arabic, plus other blocks that Arabic may use. See Arabic script in Unicode for a nice description of them.

If you want to match the Arabic characters available in ISO-8859-6, you only want part of one of those blocks, 0621-0652.

If you want to match the Arabic characters available in Unicode 1.0, that's the blocks 0600-06FF, 0750-077F, annd 08A0-08FF.

If you want the contextual variants, you also need the two "presentation forms" blocks (although some of these are not actually used by Arabic, only by other languages that use the Arabic script—then again, you tagged your question Farsi…), FB50-FDFF and FE70-FEFF. The fact that your original code was matching FB50-FDFF implies that you need these.

Finally, as of Unicode 6.1, there are two additional ranges that you may or may not need, primarily useful for mathematics, 10E60-10E7F and 1EE00-1EEFF.

I'm going to guess that you need the first 5 blocks, but not the last two, so, instead of this:

ur'[\w\ufb50-\ufdff]'

… do this:

ur'[\w\u0600-\u06ff\u0750-\u077f\u08a0-\u08ff\ufb50-\ufdff\ufe70-\ufeff]'

However, I'm not sure this really solves your problem. The original code was using re.findall with the presentation forms to break the text into tokens—maybe as a hacky way of splitting on end characters (which will only work on text encoded in a very particular, and obsolete, way…). Changing it to findall every run of Arabic characters will give you a very different result.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top