Question

I have an API server built by Python flask. And I need a group of clients/computers to send data over to the API server by making http post request.

The data here is actually html content. (NOTE: I am not turning legit data into HTML/XML format, the data its self is HTML that I have collected form the web) which is usually about 200KB per page. And I am trying to alleviate the network load as much as I can by using serial/deserial and compression.

I am thinking about instead of send raw HTML across the network. Is there any kind of method like Serialize the html object (BeautifulSoup soup?) and deserialize on the server side. Or use some compression method to zip the file first and then post the data to the API server. On the server side, it can decompress the data once it receive the compressed one.

What I have done:

(1) I tried to turn the raw HTML text into a soup object, and then use Pickle to serialize that. However, it told me too many recursions and errorred out. I also tried pickle the raw html and the compression performance is bad... almost the same size as the raw html string.

(2) I tried zlib to compress the file beforehand and then it is 10% the size of its original one. However, is this the legit way to approach this problem?

Any thoughts?

Was it helpful?

Solution

Well, I got inspired a lot by the comments from you guys and I came up with a solution that compress the HTML content using zlib and POST the data to API server, on the Flask API server side, I extract the data and push to mongodb for storage.

Here is the part that might save some future headache.

Client Side:

myinput = "http://www.exmaple.com/001"
myoutput = "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" ... /html>"
result = {'myinput':myinput, 'myoutput': myoutput}
data = zlib.compress(str(result))
opener.open("www.host.com/senddata", data) 

Server Side:

@app.route('/contribute', methods=['POST'])
def contribute():
    try:
        data = request.stream.read()
        result = eval(zlib.decompress(data))
        db.result.insert(result)
    except:
        print sys.exc_info()
        pass
    return 'OK'

Result in mongodb:

{ 
"_id" : ObjectId("534e0d346a1b7a0e48ff9076"), 
"myoutput" : "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" ... /html>",  
"myinput" : "http://www.exmaple.com/001" 
}

(Note: As you have noticed, the final version in mongo somehow escaped all the sensible characters by putting a slash in front of them, like double quote, not sure how to change it back.)

There were some discussions about retrieving binary data in flask. Like here. So you don't have to mess up with the header if you read from request.stream directly.

Thanks!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top