Question

I am trying to parse HTML using jsoup.

I used "try jsoup" to check if parsing of the html is correct.

screenshot of the results : please open this link ^^

My code is :

    URL url = new URL("http://tw.search.bid.yahoo.com/search/ac;_ylt=AtqkyTO06sgGHho20HzmPEX3_rF8?ei=UTF-8&p=%E8%A1%A3%E6%9C%8D");
    Document doc;
    try {
        doc = Jsoup.parse(url, 3000);
        Elements descriptions = doc.select("div#srp_sl_result"+" div.att-item");

        for (Element element : descriptions) {  
            System.out.println(element.ownText());
            System.out.println("--------------");
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
 }

But the results are returning empty, I am getting following output:

--------------

--------------

--------------

I am expecting output like:

女裝手套衣服*艾爾莎*暗釦長款披風式毛衣罩衫外套S~L【TAA1166】 出價 799 元 直購 799 元 運費80元 |    
30 次 | 剩 16小時 60分 賣家:艾爾莎時尚精品 (評價 25229) 在新北市
☆意樂舖☆【塑鋼衣架】ABS強化多功能神奇魔術衣架(收納衣服.領帶.皮帶.肩帶) 出價 35 元 直購 35 元 運費
55元 | 8 次 | 1天 6小時 賣家:意樂舖(創意樂園小舖) (評價 14613) 在新北市
HappyLife【YK1324】韓國超人氣乾濕兩用衣架 防滑魔術衣架 止滑衣架 衣服衣櫃衣櫥收納 出價 25 元 直購 
25 元 運費70元 | 16 次 | 2天 3小時 賣家:HappyLife快樂生活網 (評價 14360) 在新北市

Here is some sample HTML from the search page:

 <div class="att-item item yui3-g " data-url="https://login.yahoo.com/config/login?.intl=tw&amp;.pd=c%3D3Chd7Yq72e502eh4R99sgUvi5Q--&amp;.done=https%3A%2F%2Ftw.search.bid.yahoo.com%2Fsearch%2Fauction%2Fproduct%3Fei%3DUTF-8%26p%3D%25E8%25A1%25A3%25E6%259C%258D&amp;rr=2465463942"> 
    <div class="yui3-u"> 
        <div class="srp-pdimage"> 
            <a href="https://tw.page.bid.yahoo.com/tw/auction/e79010279;_ylt=ApstmFiftkQPQ2krNhqCT3xyFbN8;_ylv=3"> <img height="120" alt=" (DAJIN達錦衣服設計中心)棒壘球帽字凸繡200元,棒球帽,帽子,棒壘球服,棒球衣 " src="https://s.yimg.com/hg/ac/30/ea/e79010279-ac-4511xf9x0430x0600-s.jpg" /> </a> 
        </div> 
     </div> 
 </div>

What should I change in my code? How to achieve my goal.

Please help me!

Was it helpful?

Solution

You should use the text() method, not ownText(), as the documentation states, it:

Gets the combined text of this element and all its children.

Here is an updated example:

public static void main(String[] args) throws MalformedURLException {
    URL url = new URL( "http://tw.search.bid.yahoo.com/search/"
            + "ac;_ylt=AtqkyTO06sgGHho20HzmPEX3_rF8?ei=UTF-8&p=%E8%A1%A3%E6%9C%8D");

    Document doc;
    try {
        doc = Jsoup.parse(url, 3000);
        Elements descriptions = doc.select("div#srp_sl_result div.att-item");

        for (Element element : descriptions) {
            System.out.println(element.text());
            System.out.println("--------------");
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

OTHER TIPS

I've visited the page you are trying to parse and in the browser console I've written:

$('div#srp_sl_result div.att-item')

The search returned a div:

<div class="att-item item yui3-u" data-url="https://login.yahoo.com/config/login?.intl=tw&amp;.pd=c%3D3Chd7Yq72e502eh4R99sgUvi5Q--&amp;.done=https%3A%2F%2Ftw.search.bid.yahoo.com%2Fsearch%2Fauction%2Fproduct%3Fei%3DUTF-8%26p%3D%25E8%25A1%25A3%25E6%259C%258D&amp;rr=3456505015" id="yui_3_14_1_3_1394093660536_452">
        <div class="wrap" id="yui_3_14_1_3_1394093660536_451">
            <div class="srp-pdimage" id="yui_3_14_1_3_1394093660536_450">
                <a href="https://tw.page.bid.yahoo.com/tw/auction/f61398121;_ylt=Ali1FeHY3kStUUeBmGO4vupyFbN8;_ylv=3?u=Y2583393636" id="yui_3_14_1_3_1394093660536_456">
                    <img width="200" alt=" HappyLife【SP323】納川6+1家庭裝真空收納袋/真空袋/壓縮袋/棉被衣物衣服收納~附吸氣管 " src="https://s.yimg.com/hg/ac/b6/51/f61398121-ac-6849xf8x0600x0400-s.jpg" id="yui_3_14_1_3_1394093660536_455">
                </a>
            </div>
            <div class="srp-pdhead">
                <div class="srp-pdinfo">
                    <a class="srp-bid" href="https://tw.page.bid.yahoo.com/tw/show/bid_hist;_ylt=Ahu0X7QeYNL6gEwV.IhDhWlyFbN8;_ylv=3?aID=f61398121">6 次</a>
                    <span>出價</span>
                    <em>399</em>
                    <span>元</span>
                    <span class="sep">|</span>
                </div>
                        <div class="srp-pdprice">
                    <span>直購</span>
                    <em>399</em>
                    <span>元</span>
                </div>
                    </div>
            <div class="srp-pdtitle">
                <a href="https://tw.page.bid.yahoo.com/tw/auction/f61398121;_ylt=AiNoFG2AOvisNBiTc.AyjgxyFbN8;_ylv=3?u=Y2583393636"> HappyLife【SP323】納川6+1家庭裝真空收納袋/真空袋/壓縮袋/棉被衣物衣服收納~附吸氣管 </a>
            </div>
            <div class="srp-pdftitle">
                <a href="https://tw.page.bid.yahoo.com/tw/auction/f61398121;_ylt=AiNoFG2AOvisNBiTc.AyjgxyFbN8;_ylv=3?u=Y2583393636"> HappyLife【SP323】納川6+1家庭裝真空收納袋/真空袋/壓縮袋/棉被衣物衣服收納~附吸氣管 </a>
            </div>
            <div class="srp-pdstore">
                                <a class="srp-ico" href="https://tw.help.yahoo.com/auct/policy/protection.html#reward" alt="享買賣家五萬保障"></a>
                                <a href="http://tw.user.bid.yahoo.com/tw/user/Y2583393636;_ylt=Akxsb34F0Y37vNFzvvX8aldyFbN8;_ylv=3">HappyLife快樂生活網</a>
            </div>
        </div>
    </div>

So I don't understand why you have so many elements returned. In any case element.ownText() returns the text of that div, excluding any inner element, so no text should be shown because that div has no text, only other elements

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top