Extract javascript information from url with python

https://stackoverflow.com/questions/22367473

13-06-2023
|

Вопрос

I have a URL that links to a javascript file, for example http://something.com/../x.js. I need to extract a variable from x.js

Is it possible to do this using python? At the moment I am using urllib2.urlopen() but when I use .read() I get this lovely mess:

U�(��%y�d�<�!���P��&Y��iX���O�������<Xy�CH{]^7e� �K�\�͌h��,U(9\ni�A ��2dp}�9���t�<M�M,u�N��h�bʄ�uV�\��0�A1��Q�.)�A��XNc��$"SkD�y����5�)�B�t9�):�^6��`(���d��hH=9D5wwK'�E�j%�]U~��0U�~ʻ��)�pj��aA�?;n�px`�r�/8<?;�t��z�{��n��W
�s�������h8����i�߸#}���}&�M�K�y��h�z�6,�Xc��!:'D|�s��,�g$�Y��H�T^#`r����f����tB��7��X�%�.X\��M9V[Z�Yl�LZ[ZM�F���`D�=ޘ5�A�0�){Ce�L*�k���������5����"�A��Y�}���t��X�(�O�̓�[�{���T�V��?:�s�i���ڶ�8m��6b��d$��j}��u�D&RL�[0>~x�jچ7�

When I look in the dev tools to see the DOM, the only thing in the body is a string wrapped in tags. In the regular view that string is a json element.

Решение

.read() should give you the same thing you see in the "view source" window of your browser, so something's wrong. It looks like the HTTP response might be gzipped, but urllib2 doesn't support gzip. urllib2 also doesn't request gzipped data, so if this is the problem, the server is probably misconfigured, but I'm assuming that's out of your control.

I suggest using requests instead. requests automatically decompresses gzip-encoded responses, so it should solve this problem for you.

import requests
r = requests.get('https://something.com/x.js')
r.text   # unparsed json output, shouldn't be garbled
r.json() # parses json and returns a dictionary

In general, requests is much easier to use than urllib2 so I suggest using it everywhere, unless you absolutely must stick to the standard library.

Другие советы

import json

js = urllib2.urlopen("http://something.com/../x.js").read()
data = json.loads(js)

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow