Why is a website's response in python's `urllib.request` different to a request sent directly from a web-browser?

StackOverflow https://stackoverflow.com/questions/21958330

  •  15-10-2022
  •  | 
  •  

Question

I have a program that takes a URL and gets a response from the server using urllib.request. It all works fine, but I tested it a little more and realised that when I put in a URL such as http://google.com into my browser, I got a different page (which had a doodle and a science fair promotion etc.) but with my program it was just plain Google with nothing special on it.

It is probably due to redirection, but if the request from my program goes through the same router and DNS, surely the output should be exactly the same?

Here is the code:

"""
This is a simple browsing widget that handles user requests, with the
added condition that all proxy settings are ignored. It outputs in the
default web browser.
"""

# This imports some necessary libraries.
import tkinter as tk
import webbrowser

from tempfile import NamedTemporaryFile
import urllib.request


def parse(data):
    """
    Removes junk from the data so it can be easily processed.
    :rtype : list
    :param data: A long string of compressed HTML.
    """
    data = data.decode(encoding='UTF-8')  # This makes data workable.
    lines = data.splitlines()  # This clarifies the lines for writing.
    return lines


class Browser(object):
    """This creates an object for getting a direct server response."""
    def __init__(self, master):
        """
        Sets up a direct browsing session and a GUI to manipulate it.
        :param master: Any Tk() window in which the GUI is displayable.
        """
        # This creates a frame within which widgets can be stored.
        frame = tk.Frame(master)
        frame.pack()

        # Here we create a handler that ignores proxies.
        proxy_handler = urllib.request.ProxyHandler(proxies=None)
        self.opener = urllib.request.build_opener(proxy_handler)

        # This sets up components for the GUI.
        tk.Label(frame, text='Full Path').grid(row=0)
        self.url = tk.Entry(frame)  # This takes the specified path.
        self.url.grid(row=0, column=1)
        tk.Button(frame, text='Go', command=self.browse).grid(row=0, column=2)

        # This binds the return key to calling the method self.browse.
        master.bind('<Return>', self.browse)

    def navigate(self, query):
        """
        Gets raw data from the queried server, ready to be processed.
        :rtype : str
        :param query: The request entered into 'self.url'.
        """
        # This contacts the domain and parses it's response.
        response = self.opener.open(query)
        html = response.read()
        return html

    def browse(self, event=None):
        """
        Wraps all functionality together for data reading and writing.
        :param event: The argument from whatever calls the method.
        """
        # This retrieves the input given by the user.
        location = self.url.get()
        print('\nUser inputted:', location)
        # This attempts to access the server and gives any errors.
        try:
            raw_data = self.navigate(location)
        except Exception as e:
            print(e)
        # This executes assuming there are no errors.
        else:
            clean_data = parse(raw_data)
            # This creates and executes a temporary HTML file.
            with NamedTemporaryFile(suffix='.html', delete=False) as cache:
                cache.writelines(line.encode('UTF-8') for line in clean_data)
                webbrowser.open_new_tab(cache.name)
                print('Done.')


def main():
    """Using a main function means not doing everything globally."""
    # This creates a window that is always in the foreground.
    root = tk.Tk()
    root.wm_attributes('-topmost', 1)
    root.title('DirectQuery')

    # This starts the program.
    Browser(root)
    root.mainloop()

# This allows for execution as well as for importing.
if __name__ == '__main__':
    main()

Note: I don't know if it is something to do with the fact that it is instructed to ignore proxies? My computer doesn't have any proxy settings turned on by the way. Also, if there is a way that I can get the same response/output as a web browser such as chrome would, I would love to hear it.

Was it helpful?

Solution

In order to answer your general question you need to understand how the web site in question operates, so this isn't really a Python question. Web sites frequently detect the browser's "make and model" with special detection code, often (as indicated in the comment on your question) starting with the User-Agent: HTTP header.

It would therefor make sense for Google's home page not to include any JavaScript-based functionality if the User-Agent identifies itself as a program.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top