C++ cURL - how to save a full webpage to a file?

https://stackoverflow.com/questions/21820511

12-10-2022
|

Question

I'm trying to save a full webpage to a .txt file with C++ (Visual Studio 2013). I'm using cURL. Everything works fine, but the website I'm trying to save - uses a lot of javascript to generate the page. So when I save the webpage with cURL - the .txt file has only ~170 lines. When I save the webpage with Google Chrome (ctrl+s) to .htm file - the .htm file has over 2000 lines. Is there any way to save a fully-loaded webpage to a file? This is the code I'm using:

struct MemoryStruct {
    char *memory;
    size_t size;
};

static size_t
WriteMemoryCallback(void *contents, size_t size, size_t nmemb, void *userp)
{
    size_t realsize = size * nmemb;
    struct MemoryStruct *mem = (struct MemoryStruct *)userp;

    mem->memory = (char*)realloc(mem->memory, mem->size + realsize + 1);
    if (mem->memory == NULL) {
        /* out of memory! */
        printf("not enough memory (realloc returned NULL)\n");
        return 0;
    }

    memcpy(&(mem->memory[mem->size]), contents, realsize);
    mem->size += realsize;
    mem->memory[mem->size] = 0;

    return realsize;
}


int main(void)
{
    CURL *curl_handle;
    CURLcode res;

    struct MemoryStruct chunk;

    chunk.memory = (char*)malloc(1);  /* will be grown as needed by the realloc above */
    chunk.size = 0;    /* no data at this point */

    curl_global_init(CURL_GLOBAL_ALL);

    /* init the curl session */
    curl_handle = curl_easy_init();

    /* specify URL to get */
    curl_easy_setopt(curl_handle, CURLOPT_URL, "http://www.example.com/");

    /* send all data to this function  */
    curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, WriteMemoryCallback);

    /* we pass our 'chunk' struct to the callback function */
    curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, (void *)&chunk);

    /* some servers don't like requests that are made without a user-agent
    field, so we provide one */
    curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "libcurl-agent/1.0");

    /* get it! */
    res = curl_easy_perform(curl_handle);

    /* check for errors */
    if (res != CURLE_OK) {
        fprintf(stderr, "curl_easy_perform() failed: %s\n",
            curl_easy_strerror(res));
    }
    else {
        /*
        * Now, our chunk.memory points to a memory block that is chunk.size
        * bytes big and contains the remote file.
        *
        * Do something nice with it!
        */

        printf("%lu bytes retrieved\n", (long)chunk.size);
    }
    std::ofstream oplik;
    oplik.open("test.txt");
    oplik << chunk.memory;
    oplik.close();

    /* cleanup curl stuff */
    curl_easy_cleanup(curl_handle);

    if (chunk.memory)
        free(chunk.memory);

    /* we're done with libcurl, so clean it up */
    curl_global_cleanup();

    return 0;
}

Thanks for help, and sorry for my bad English.

Solution

cURL can only save what is delivered by the web server.

If you want to save anything beyond that, you must include a javascript interpreter to build the web page as any web browser does.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow