Question

I have a complex html file that I need to parse in Objective-C. The html looks like

<HTML>
<TABLE width="100%" border="0" cellpadding="0" cellspacing="0">
    <tr>
        <td width="10" align="left" valign="top"><img src="http://www.indianrail.gov.in/main_text_left_top2.gif" alt="" width="8" height="8"></td>
        <td width="100%" align="left" valign="top" class="text_rail_top"><img src="http://www.indianrail.gov.in/blank.gif" alt="" width="1" height="8"></td>
        <td width="10" align="right" valign="top"><img src="http://www.indianrail.gov.in/main_text_rgt_top2.gif"alt="" width="8" height="8" ></td>
    </tr>
    <tr>
        <td height="400" align="right" valign="top" class="text_rail_left"></td>
        <td width="100%" align="left" valign="top" class="text_back_color"><table border="0" cellPadding="0" cellSpacing="0" width="100%"><tr>
            <td align="left" valign="top"><table width="100%" border="0" cellspacing="2" cellpadding="0"><tr>      <td align="middle">        <FONT SIZE = "1">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;        Indian Railways Online Website: <b><a TITLE = "Passenger Reservation System - CONCERT" href="http://www.indianrail.gov.in/index.html" target="_blank">http://www.indianrail.gov.in</b></a>  designed and hosted by CRIS.</FONT>      </td></tr></table></td>
        </tr><tr>
            <td align="left" valign="top"><table width="100%" border="0" cellspacing="2" cellpadding="0">
                <tr>
                    <td><table border="0" width="100%" /></td>
                </tr>
                <tr>
                    <td align="center" valign="top" class="inside_heading_text" colspan="4"><br />Trains Between A Pair of Stations </td>
                </tr>
                <td colspan="4"> </td>
                </tr>
        <tr>
            <td colspan="4" align="center" valign="top" class="Enq_heading"> You Queried For <SCRIPT LANGUAGE="JavaScript" SRC= "/js/inet_srcdest.js">

                function getCookie(http://www.indianrail.gov.in/tbisip_400x400.htm)</SCRIPT>
            <link href="http://www.indianrail.gov.in/cris_google.css" media="all" rel="Stylesheet" type="text/css" />
            <script language ="JavaScript">
                var searchQuery ='MUMBAI CENTRAL  DELHI          '
                </script><FORM NAME="Accavl" METHOD="POST"  ACTION="http://www.indianrail.gov.in/cgi_bin/inet_accavl_cgi1.cgi">
                    <TR>
                        <TD valign="top"><table width="98%" border="0" align="center" cellpadding="3" cellspacing="1" class="table_border">
                            <TR class="heading_table_top">
                                <TH>Origin</TH>
                                <TH>Destination</TH>
                            </TR>
                            <TR>
                                <TD class="table_border_both">MUMBAI CENTRAL -[BCT ]</TD>
                                <TD class="table_border_both">DELHI          -[DLI ]</TD>
                            </TR>
                        </TABLE>
                        </TD></TR>
                    <TR><td> </td></TR>
                    <TR>
                        <td class="main_text">Enter Quota:</td>
                        <td><SELECT NAME="lccp_quota" SIZE="1" >
                            <OPTION VALUE="CK">Tatkal Quota
                                <OPTION VALUE="LD">Ladies Quota
                                    <OPTION VALUE="DF">Defence Quota
                                        <OPTION VALUE="FT">Foreign Tourist Quota
                                            <OPTION VALUE="SS">Lower Berth Quota$
                                                <OPTION VALUE="YU">Yuva Quota
                                                    <OPTION VALUE="DP">Duty Pass Quota
                                                        <OPTION VALUE="HP">Handicaped Quota
                                                            <OPTION VALUE="PH">Parliament House
                                                                <OPTION selected VALUE="GN">General Quota
                                                                    </SELECT></TD></tr>
                    <tr>
                        <td class="main_text">Journey Date:</td><td><INPUT NAME="lccp_day" SIZE="2" VALUE="11" onchange="return changedate()"><SELECT NAME="lccp_month" SIZE="1" onClick="return changedate()"><OPTION selected VALUE="5">May<OPTION VALUE="6">Jun<OPTION VALUE="7">Jul</SELECT></TD></tr><INPUT  TYPE="HIDDEN" NAME="lccp_classopt" SIZE="2" VALUE="ZZ"><INPUT  TYPE="HIDDEN" NAME="lccp_class1" SIZE="2" VALUE="ZZ"><INPUT  TYPE="HIDDEN" NAME="lccp_class2" SIZE="2" VALUE="ZZ"><INPUT  TYPE="HIDDEN" NAME="lccp_class3" SIZE="2" VALUE="ZZ"><INPUT  TYPE="HIDDEN" NAME="lccp_class4" SIZE="2" VALUE="ZZ"><INPUT  TYPE="HIDDEN" NAME="lccp_class5" SIZE="2" VALUE="ZZ"><INPUT  TYPE="HIDDEN" NAME="lccp_class6" SIZE="2" VALUE="ZZ"><INPUT  TYPE="HIDDEN" NAME="lccp_class7" SIZE="2" VALUE="ZZ"><INPUT  TYPE="HIDDEN" NAME="lccp_class8" SIZE="2" VALUE="ZZ"><INPUT  TYPE="HIDDEN" NAME="lccp_class9" SIZE="2" VALUE="ZZ"><INPUT  TYPE="HIDDEN" NAME="lccp_cls10" SIZE="2" VALUE="ZZ"><INPUT  TYPE="HIDDEN" NAME="lccp_age" SIZE="2" VALUE="ADULT_AGE"><tr>
                            <td>&nbsp;</td><td><INPUT TYPE="Button" CLASS="btn_style" NAME="lccp_submitacc" ONCLICK="return submitavailability(0)" VALUE="Get Availability">&nbsp;<INPUT TYPE="Button" CLASS="btn_style" NAME="lccp_submitfare" ONCLICK="return submitfare(0)" VALUE="Get Full Fare">&nbsp;<INPUT TYPE="Button" CLASS="btn_style" NAME="lccp_submitpath" ONCLICK="return submitroute(0)" VALUE="Get Schedule">&nbsp;<INPUT TYPE="BUTTON" CLASS="btn_style" NAME="lccp_submitrun" ONCLICK="return submitrun(0)" VALUE="Get Running Status"></td></tr></table><br>
        <TABLE BORDER ALIGN=center><TABLE width="98%" border="1" bordercolor="#993300" align="center" cellpadding="3" cellspacing="1" class="table_border_both_left"><tr  class="heading_table_top">
            <TH ROWSPAN = 2 width="9%" >Train No.</TH>
            <TH ROWSPAN = 2 width="20%" >Train Name</TH>
            <TH ROWSPAN = 2 width="15%" >Origin</TH>
            <TH ROWSPAN = 2 width="8%" >Dep.Time</TH>
            <TH ROWSPAN = 2 width="14%" >Destination</TH>
            <TH ROWSPAN = 2 width="7%" >Arr.Time</TH>
            <TH COLSPAN = 7 width="7%" >Days Of Run</TH>
            <TH COLSPAN = 10 width="7%">Classes</TH>
        </TR>
        <TR class="heading_table_top">
            <TH width="3%">M</TH>
            <TH width="3%">T</TH>
            <TH width="3%">W</TH>
            <TH width="3%">T</TH>
            <TH width="3%">F</TH>
            <TH width="3%">S</TH>
            <TH width="3%">S</TH>
            <TH width="3%">1A</TH>
            <TH width="3%">2A</TH>
            <TH width="3%">FC</TH>
            <TH width="3%">3A</TH>
            <TH width="3%">CC</TH>
            <TH width="3%">SL</TH>
            <TH width="3%">2S</TH>
            <TH width="3%">3E</TH>
        </TR>
        <TR><TD><INPUT TYPE="RADIO" NAME="lccp_trndtl" VALUE="19019BDTSNZM YYYYYYYY "ONCLICK="return farefill('19019BDTSNZM YYYYYYYY ','19019','BDTS',0,0,1,0,1,0,1,0,0,0,0)" CHECKED>19019</TD>
            <TD ALIGN =Center TITLE = " Please look the following same trains list also "><A HREF="#SAMETRN">+DEHRADUN EXP   </A><A NAME="BACKSAMETRN"></A>
                <TD ALIGN =Center TITLE="Station CodeBDTS">BANDRA TERMINUS</TD>
                <TD ALIGN = Center>00:05</TD>
                <TD ALIGN = Center TITLE="Station Code NZM ">H NIZAMUDDIN   </TD>
                <TD ALIGN = Center>05:25</TD>
                <TD><FONT COLOR = green><B>Y</B></TD>
                <TD><FONT COLOR = green><B>Y</B></TD>
                <TD><FONT COLOR = green><B>Y</B></TD>
                <TD><FONT COLOR = green><B>Y</B></TD>
                <TD><FONT COLOR = green><B>Y</B></TD>
                <TD><FONT COLOR = green><B>Y</B></TD>
                <TD><FONT COLOR = green><B>Y</B></TD>
                <TD>-</TD>
                <TD><INPUT TYPE="RADIO" Name="lccp_class2" VALUE="2A" ONCLICK="return deselectclass(1,0,1,0,1,0,1,0,0,0,0,'N','Y','N','N','N','N','N','N','N','N')"  CHECKED>
                    <TD>-</TD>
                    <TD><INPUT TYPE="RADIO" Name="lccp_class4" VALUE="3A" ONCLICK="return deselectclass(1,0,1,0,1,0,1,0,0,0,0,'N','N','N','Y','N','N','N','N','N','N')">
                        <TD>-</TD>
                        <TD><INPUT TYPE="RADIO" Name="lccp_class6" VALUE="SL" ONCLICK="return deselectclass(1,0,1,0,1,0,1,0,0,0,0,'N','N','N','N','N','Y','N','N','N','N')">
                            <TD>-</TD>
                            <TD>-</TD>
                            </TR></FONT>
        <TR><TD><INPUT TYPE="RADIO" NAME="lccp_trndtl" VALUE="19023BCT NDLSYYYYYYYY "ONCLICK="return farefill('19023BCT NDLSYYYYYYYY ','19023','BCT ',0,0,0,0,0,0,2,1,0,0,0)">19023</TD>
            <TD ALIGN =Center TITLE = " Please look the following same trains list also "><A HREF="#SAMETRN">+FZR JANATA EXP </A><A NAME="BACKSAMETRN"></A>
                <TD ALIGN =Center TITLE="Station CodeBCT ">MUMBAI CENTRAL </TD>
                <TD ALIGN = Center>07:25</TD>
                <TD ALIGN = Center TITLE="Station Code NDLS">NEW DELHI      </TD>
                <TD ALIGN = Center>12:45</TD>
                <TD><FONT COLOR = green><B>Y</B></TD>
                <TD><FONT COLOR = green><B>Y</B></TD>
                <TD><FONT COLOR = green><B>Y</B></TD>
                <TD><FONT COLOR = green><B>Y</B></TD>
                <TD><FONT COLOR = green><B>Y</B></TD>
                <TD><FONT COLOR = green><B>Y</B></TD>
                <TD><FONT COLOR = green><B>Y</B></TD>
                <TD>-</TD>
                <TD>-</TD>
                <TD>-</TD>
                <TD>-</TD>
                <TD>-</TD>
                <TD><INPUT TYPE="RADIO" Name="lccp_class6" VALUE="SL" ONCLICK="return deselectclass(2,0,0,0,0,0,2,1,0,0,0,'N','N','N','N','N','Y','N','N','N','N')">
                    <TD><INPUT TYPE="RADIO" Name="lccp_class7" VALUE="2S" ONCLICK="return deselectclass(2,0,0,0,0,0,2,1,0,0,0,'N','N','N','N','N','N','Y','N','N','N')">
                        <TD>-</TD>
                        </TR></FONT>
        </TABLE>
        </BODY>
</HTML>

I want to parse the html using hpple for the following output

19019
BANDRA TERMINUS
00:05
H NIZAMUDDIN
05:25
2A
3A
SL

19023
MUMBAI CENTRAL
07:25
NEW DELHI
12:45
SL
2S

I started with the following xpath query

NSString *tutorialsXpathQueryString = @"//table[@class='table_border_both_left']//td";

But it returns way to many results and is difficult to parse further. Can someone help me with the xpath query so I can parse this more efficiently.

Thanks!

Was it helpful?

Solution

You can locate table rows with this:

List<WebElement> tableRows = findElements(By.xpath("//TABLE[@class='table_border_both_left']//tr[not(@class='heading_table_top')]"));

In a row find the expected data :

for (WebElement row : tableRows) {
    String trainNo = row.findElement(By.xpath("td[1]")).getText();  //or use xpath : td[1]/text()
    String origin = row.findElement(By.xpath("td[3]")).getText();     //or use xpath : td[3]/text()
    String deptTime = row.findElement(By.xpath("td[4]")).getText();     //or use xpath : td[4]/text()
    String destination = row.findElement(By.xpath("td[5]")).getText();     //or use xpath : td[5]/text()
    String arrTime = row.findElement(By.xpath("td[6]")).getText();     //or use xpath : td[6]/text()

    List<WebElement> radioButtons = row.findElements(By.xpath("td//input[not(@name='lccp_trndtl')]"));
    // or use xpath : //TABLE[@class='table_border_both_left']//tr[not(@class='heading_table_top')]//td//input[not(@name='lccp_trndtl')]//@value

    for (WebElement radio : radioButtons) {
        String value = radio.getAttribute("value");
    }
}

Sorry for my code but I'm using Selenium WebDriver in Java. I hope the given xpath expressions will be useful.

OTHER TIPS

You can use an XPath union expression (i.e. |) to return the direct text() children of your TD elements and also the @VALUE attribute of your INPUT elements:

//TABLE[@class='table_border_both_left']//TD(text() | INPUT[@TYPE eq "RADIO"]/@VALUE)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top