extract text from a webpage file using C/C++ [duplicate]

https://stackoverflow.com/questions/22970330

30-06-2023
|

题

How to extract text from a specific area of a webpage (in Arabic not English) given the url using C/C++?

For example: given the url of this wikipedia article I want to extract the body of the article (highlighted in the image below) and throw away the other parts of the webpage like the heading, the menus on the right and on the left, etc. I only need the body to be parsed into a string.

example image

解决方案

To get only the article text from a Wikipedia page, add ?action=render to your url.

Then use e.g. curl to fetch it. Search the web for curl/c++ tutorials if you don't know how. You are looking for something like this (just to give you an idea):

#include <stdio.h>
#include <curl/curl.h>

int main(void) {

    CURL* curl;
    CURLcode result;

    curl = curl_easy_init();
    curl_easy_setopt(curl, CURLOPT_URL, "https://ar.wikipedia.org/wiki/%D8%B3%D9%8A_%D8%A5%D9%86_%D8%A5%D9%86_%D8%A7%D9%84%D8%B9%D8%B1%D8%A8%D9%8A%D8%A9?action=render");

    result = curl_easy_perform(curl);

    curl_easy_cleanup(curl);

    return 0;
}

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow