extract text from a webpage file using C/C++ [duplicate]

https://stackoverflow.com/questions/22970330

30-06-2023
|

Question

How to extract text from a specific area of a webpage (in Arabic not English) given the url using C/C++?

For example: given the url of this wikipedia article I want to extract the body of the article (highlighted in the image below) and throw away the other parts of the webpage like the heading, the menus on the right and on the left, etc. I only need the body to be parsed into a string.

example image

Solution

To get only the article text from a Wikipedia page, add ?action=render to your url.

Then use e.g. curl to fetch it. Search the web for curl/c++ tutorials if you don't know how. You are looking for something like this (just to give you an idea):

#include <stdio.h>
#include <curl/curl.h>

int main(void) {

    CURL* curl;
    CURLcode result;

    curl = curl_easy_init();
    curl_easy_setopt(curl, CURLOPT_URL, "https://ar.wikipedia.org/wiki/%D8%B3%D9%8A_%D8%A5%D9%86_%D8%A5%D9%86_%D8%A7%D9%84%D8%B9%D8%B1%D8%A8%D9%8A%D8%A9?action=render");

    result = curl_easy_perform(curl);

    curl_easy_cleanup(curl);

    return 0;
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow