Вопрос

I'm parsing xml data with Python, the xml file contains urls and as you know url can't be parsed directly through regex because their format doesn't fit, there are characters which blocks the parsing like '?', '$', '@'. That's why I use the urllib.quote function from the urllib module and it works just fine, except for one url and I just can't explain this.

Before the urllib.quote the url looks like this :

https://randomurl.fr/?oslc_cm.properties=FORM_item

After the function it becomes like that :

https%3A//randomurl.fr/?oslc_cm.properties=FORM_item

So the ":" is corrected but the "?" and the "=" remain as they are, which blocks the parsing. What I find strange is that it is the only url for which it doesn't work, for the 30 others one which also contains "?" it just turns it into "%3F", "=" into "%3D". I tried to change its place in the xml file but it's still this precise url who isn't quoted well. However I noticed that if I changed FORM_item with FORM_productCmt which is the property of another url existing then it quotes it just fine. It seems pretty random to me and I can't figure out what's going on.

Does somebody see the glitch here ?

EDIT

I can't escape the characters because I'm fetching a xml file and parsing it. Here is the code I use to quote the urls :

def genElementList(self, xmldata):
        xmldata_encoded = xmldata
        p = re.compile(r'"(http.*?)"')
        urls = p.findall(xmldata)
        for url in urls:
            xmldata_encoded = str.replace(xmldata_encoded, url, urllib.quote(url))
            print xmldata_encoded + '\n'

And for each url I can see that the function worked except for one, always the same. I compared it with others urls that are correctly quoted and they're totally similar except for the part "properties=FORM_item" where another would be "properties=FORM_productCmt". That's why I can't get how it can't work.

Это было полезно?

Решение

Thanks, user2357112 You helped me see what the problem was, I solved the substring issue by setting the count parameter of the substring function to 1 :

p = re.compile(r'"(http.*?)"')
        urls = p.findall(xmldata)
        for url in urls:
            xmldata_encoded = str.replace(xmldata_encoded, url, urllib.quote(url), 1)
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top