website parsing - webbrowser or httpwebresponse

https://stackoverflow.com/questions/19541338

01-07-2022
|

Question

I experienced some difficulties when I tried to parse some data out of my banking website. Basically, I would like to export my transaction history in a daily bases automatically, but the internet banking does not have any automated functionality as such. I am currently experimenting on how to simulate filling up form and clicks to get to the download page and get the CSV file where I can use for parsing.

I have tried different method and have no success, please direct me to the correct path.

 public static void getNABLogin()
    {
        try
        {
            Console.WriteLine("ENTER to begin");
            //Console.ReadLine();
            System.Net.HttpWebRequest wr = (System.Net.HttpWebRequest)System.Net.WebRequest.Create("https://ib.nab.com.au/nabib/index.jsp");
            wr.Timeout = 1000;
            wr.Method = "GET";
            wr.UserAgent = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36";
            wr.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
            wr.Headers.Add("Accept-Language", "en-GB,en-US;q=0.8,en;q=0.6");
            wr.Headers.Add("Accept-Encoding", "gzip,deflate,sdch");
            //wr.Connection = "Keep-Alive";
            wr.Host = "ib.nab.com.au";
            wr.KeepAlive = true;

            wr.CookieContainer = new CookieContainer();

            //////////This part will get me to the correct login page at least////////////////////
            // System.IO.Stream objStreamReceive ;
            // System.Text.Encoding objEncoding;
            // System.IO.StreamReader objStreamRead;
            // WebResponse objResponse;
            //string strOutput = string.Empty;

            //objResponse = wr.GetResponse();
            //objStreamReceive = objResponse.GetResponseStream();
            //objEncoding = System.Text.Encoding.GetEncoding("utf-8");
            //objStreamRead = new StreamReader(objStreamReceive, objEncoding); // Set function return value
            //strOutput = objStreamRead.ReadToEnd();
            ///////////////////////////////
            System.Net.HttpWebResponse wresp = (System.Net.HttpWebResponse)wr.GetResponse();

            System.Windows.Forms.WebBrowser wb = new System.Windows.Forms.WebBrowser();

            wb.DocumentStream = wresp.GetResponseStream();
            wb.ScriptErrorsSuppressed = true;

           wb.DocumentCompleted += (sndr, e) =>
            {
                /////////////After dumping the document text into a text file, I get a different page/////////////////
                //////////////I get the normal website instead of login page////////////////////////
               System.IO.StreamWriter file = new System.IO.StreamWriter("C:\\temp\\test.txt");
               Console.WriteLine(wb.DocumentText);
               file.WriteLine(wb.DocumentText);
               System.Windows.Forms.HtmlDocument d = wb.Document;

               System.Windows.Forms.HtmlElementCollection ctrlCol = d.GetElementsByTagName("script");
               foreach (System.Windows.Forms.HtmlElement tag in ctrlCol)
               {
                   tag.SetAttribute("src", string.Format("https://ib.nab.com.au{0}", tag.GetAttribute("src")));
               }


               ctrlCol = d.GetElementsByTagName("input");
               foreach (System.Windows.Forms.HtmlElement tag in ctrlCol)
               {
                   if (tag.GetAttribute("name") == "userid")
                   {
                       tag.SetAttribute("value", "123456");
                   }
                   else if (tag.GetAttribute("name") == "password")
                   {
                       tag.SetAttribute("value", "nabPassword");
                   }
                   file.WriteLine(tag.GetAttribute("name"));
               }

               file.Close();
               // object y = wb.Document.InvokeScript("validateLogin");
            };

           while (wb.ReadyState != System.Windows.Forms.WebBrowserReadyState.Complete)
           {
               System.Windows.Forms.Application.DoEvents();
           }
        }
        catch(Exception e)
        {
            System.IO.StreamWriter file = new System.IO.StreamWriter("C:\\temp\\error.txt");
            file.WriteLine(e.Message);
            Console.WriteLine(string.Format("error: {0}", e.Message));
            Console.ReadLine();
        }

I called this method from a thread (as you have probably know that webbrowser need to be STA thread to work). As explained in the code, I got the the login page correctly using httpwebresponse method. but when I tried to load to webbrowser using documentstream, I got to a different website.

Next question would be, what should I do next after I got to the login page, how can I simulate clicks and filling in data (my theory at the moment is trying to post some data using httpwebrequest).

Please shed some light on this. any comments or information is very much appreciated. Thank you very much in advance.

Solution

You can use selenium like browser and go to where you want to go and parse page with HtmlAgilityPack. Both has a c# support. Very simple console application can do rest

Selenium

http://www.seleniumhq.org/docs/02_selenium_ide.jsp#chapter02-reference

HtmlAgilityPack https://htmlagilitypack.codeplex.com/wikipage?title=Examples

You can fill form and post like this with selenium and c#

//Navigate to the site
 driver.Navigate().GoToUrl("http://www.google.com.au");
 // Find the text input element by its name
 IWebElement query = driver.FindElement(By.Name("q"));
 // Enter something to search for
 query.SendKeys("Selenium");
 // Now submit the form
 query.Submit();
 // Google's search is rendered dynamically with JavaScript.
 // Wait for the page to load, timeout after 5 seconds
 WebDriverWait wait = new WebDriverWait(driver, TimeSpan.FromSeconds(5));
 wait.Until((d) => { return d.Title.StartsWith("selenium"); });

And you can parse data (this example table) like this with HtmlAgility

var cols = doc.DocumentNode.SelectNodes("//table[@id='table2']//tr//td");
for (int ii = 0; ii < cols.Count; ii=ii+2)
{
    string name = cols[ii].InnerText.Trim();
    int age = int.Parse(cols[ii+1].InnerText.Split(' ')[1]);
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow