Why does my HtmlAgilityPack code work with some sites but not others? [duplicate]

https://stackoverflow.com/questions/21790329

11-10-2022
|

题

With the code below, I can get paragraphs from wikipedia, but not gutenberg:

private void buttonLoadHTML_Click(object sender, EventArgs e)
{
    string url = textBoxFirstURL.Text;
    GetParagraphsListFromHtml(url);
}

public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
    var pars = new List<string>();
    var getHtmlWeb = new HtmlWeb();
    var document = getHtmlWeb.Load(sourceHtml);
    var pTags = document.DocumentNode.SelectNodes("//p");
    if (pTags != null)
    {
        foreach (var pTag in pTags)
        {
            if (!string.IsNullOrWhiteSpace(pTag.InnerText))
            {
                pars.Add(pTag.InnerText);
                MessageBox.Show(pTag.InnerText);
            }
        }
    }
    MessageBox.Show("done!");
    return pars;
}

If I enter "http://en.wikipedia.org/wiki/Web_api" in textBoxFirstURL, it works as expected: the paragraphs are displayed in a series of MessageBox invocations. However, if I enter instead http://www.gutenberg.org/files/19033/19033-h/19033-h.htm, I get this:

enter image description here

Why would that be the case and is there a way to work around it?

UPDATE

The supposedly same question linked to is not only not the same question, it does not have an answer, so that statement ("This question may already have an answer here") is not true or, at the very least, misleading.

解决方案

Project Gutenberg will redirect you to a 'Welcome Stranger' page if it doesn't recognize that you have been there before. Presumably that is through the use of a cookie. So, unless your code is maintaining a cookie collection across executions, you'll be redirected to that page.

This is the page I was redirected to when clicking your link http://www.gutenberg.org/ebooks/19033?msg=welcome_stranger

If you view the source of that page, you'll see there is only one paragraph tag in it that contains exactly the text you show in your screenshot.

You will also notice that in the comments at the top of the page you will see the following notice:

DON'T USE THIS PAGE FOR SCRAPING.

Seriously. You'll only get your IP blocked.

Download http://www.gutenberg.org/feeds/catalog.rdf.bz2 instead, which contains all Project Gutenberg metadata in one RDF/XML file.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow