Issue with parsing list of HTML with lxml and requests

Question 1

Try pushing out your second loop and your return line so that no redundant iteration happens and the final list is properly returned, something like the following:

from lxml import html
import requests as rq

def first_page_links(links):

    recipe_links = []
    recipe_html = []

    for link in links:
        r = rq.get(link)
        recipe_html.append(html.fromstring(r.text))

    for rhtml in recipe_html:
        recipe_links.append(rhtml.xpath('//*[@id="content"]/ul/li/a/@href'))

    return recipe_links

Let us know if this works.

EDIT:

Consider the following:

y_list = []
final_list = []
for x in x_list:
    y_list.append(x)
    for y in y_list:
        final_list.append(y)

This is your function, simplified. Assuming in x_list you have 3 URLs, what happens is the following:

x1 is appended to y_list.
y_list is processed with only x1 so far, so x1 alone is appended to final_list. final_list now contains: [x1]
x2 is appended to y_list.
y_list now contains x1 and x2. Both are processed and appended to final_list. final_list now contains: [x1, x1, x2].
x3 is appended to y_list. y_list now contains x1, x2, and x3.
See where this is going? :)

Since your second loop, which processes the items in the first list, is inside the first loop, which adds incrementally to the first list, the second loop will process your first list on every iteration of the first loop. This makes it a redundant iteration.

There are many ways to execute what you wanted to do, but if you're just appending to lists and need a one-pass loop on both, the above fix was all that's needed.

Question 2

Watch where the return is placed. You probably want to return after all the loops are finished:

def first_page_links(link):
    recipe_links = []
    recipe_html = []

    for x in link: 
        page_request = requests.get(x)
        recipe_html.append(html.fromstring(page_request.text))

        print recipe_html

        for x in recipe_html:
            recipe_links.append(x.xpath('//*[@id="content"]/ul/li/a/@href'))

    return recipe_links