There is no "Arabic code-point range", there are instead 7 blocks specific to Arabic, plus other blocks that Arabic may use. See Arabic script in Unicode for a nice description of them.
If you want to match the Arabic characters available in ISO-8859-6, you only want part of one of those blocks, 0621-0652.
If you want to match the Arabic characters available in Unicode 1.0, that's the blocks 0600-06FF, 0750-077F, annd 08A0-08FF.
If you want the contextual variants, you also need the two "presentation forms" blocks (although some of these are not actually used by Arabic, only by other languages that use the Arabic script—then again, you tagged your question Farsi…), FB50-FDFF and FE70-FEFF. The fact that your original code was matching FB50-FDFF implies that you need these.
Finally, as of Unicode 6.1, there are two additional ranges that you may or may not need, primarily useful for mathematics, 10E60-10E7F and 1EE00-1EEFF.
I'm going to guess that you need the first 5 blocks, but not the last two, so, instead of this:
ur'[\w\ufb50-\ufdff]'
… do this:
ur'[\w\u0600-\u06ff\u0750-\u077f\u08a0-\u08ff\ufb50-\ufdff\ufe70-\ufeff]'
However, I'm not sure this really solves your problem. The original code was using re.findall
with the presentation forms to break the text into tokens—maybe as a hacky way of splitting on end characters (which will only work on text encoded in a very particular, and obsolete, way…). Changing it to findall
every run of Arabic characters will give you a very different result.