Screen Scrape through login with Curl and PHP

https://stackoverflow.com/questions/19485144

01-07-2022
|

Question

I have been reading curl screen scraping information for hours and I can't seem to figure out what I am doing right or wrong. I am not sure how to tell if my login attempts work or not.

The goal is "simple", post to the login page then pull data from a page past the login.

From that I can tell from Tamper Data is the site seems to use mainly post params for webpage navigation, so I am making two curl requests. One to login, and one to get the HTML from the page. So far the dump I get is this:

string(7097) "HTTP/1.1 200 OK Set-Cookie: sp21webs=a11a060bf1DELETED000064000000; expires=Mon, 21-Oct-2013 01:47:02 GMT; path=/ Server: "" Date: Mon, 21 Oct 2013 01:37:01 GMT Content-type: text/html Last-modified: Sun, 13 Oct 2013 21:54:39 GMT Content-length: 6781 Etag: "1a7d-DELETED69f" Accept-ranges: bytes

With what looks like the login page HTML

I am not very familiar with how Curl works, here is my code:

$submit_url = "https://okbnetplaza.com/WBIG0001.html"; 

$curl = curl_init(); 
$cookie = 'cookies.txt';
$params = array (
   "__uid" => "<hidden>",
   "PIN" => "<hidden>",
   "__type" => "0001",
   "__gid" => "WBIG0001",
   "__func" => "%A3%CF%A3%CB",
   "__func2" => "%A5%ED%A5%B0%A5%A4%A5%F3",
   "RegType" => "0",
 );

curl_setopt($curl, CURLOPT_HTTPAUTH, CURLAUTH_BASIC ) ; 
curl_setopt($curl, CURLOPT_SSLVERSION,3); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($curl, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 2); 
curl_setopt($curl, CURLOPT_HEADER, true); 
curl_setopt($curl, CURLOPT_POST, true); 
curl_setopt($curl, CURLOPT_POSTFIELDS, $params ); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); 
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"); 
curl_setopt($curl, CURLOPT_URL, $submit_url); 
$result = curl_exec($curl); 

var_dump($result); 
curl_close($curl); 

echo "<h1> Login Work????</h1>";

$urltopost = "https://okbnetplaza.com/WBIG0001.html";
$datatopost = array (
   "__type" => "0033",
   "__gid" => "WBIG0005",
   "__func" => "%A3%CF%A3%CB",
   "AccountListType" => "1",
   "DispAccountInfo" => "00000000000000000000",
);

$ch = curl_init ($urltopost);
curl_setopt ($ch, CURLOPT_POST, true);
curl_setopt ($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt ($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_POSTFIELDS, $datatopost);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);

$returndata = curl_exec ($ch);
var_dump($returndata);

I am not 100% sure the URLs are correct because the site using confusing JS

The Question: My current code, does not seem to get through the login page. Do you see any issues with my curl requests that would stop the login? Do you see any want to make the login and scraping to work?

Thanks in advance

Solution

Just an FYI,

I ended up using casperjs, then calling the script via exec with a PHP script.

Not perfect, but it is the best way to mimic browsing behavior that I could find.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow