JSoup parsing HTML
-
27-10-2019 - |
문제
I am trying to parse a non well formed DTD html file which i retrieve by a inputstream with JSOUP, and get all the data in the TD fields. How can i do that with JSoup? I already looked at the http://jsoup.org/cookbook/ but i should need som example to get it started.
Thank you in advance.
I already tried the saxparser but i can`t get the DTD to work.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="nl" lang="nl">
<TABLE class=personaltable cellSpacing=0 cellPadding=0>
<TBODY>
<TR class=alternativerow>
<TD>Nieuw beltegoed:</TD>
<TD>€ 1,00</TD></TR>
<TR>
<TD>Tegoed vorige periode:
<TD>€ 2,00</TD></TD></TR>
<TR class=alternativerow>
<TD>Tegoed tot 09-11-2011:
<TD>€ 10,00</TD></TD></TR>
<TR>
<TD>
<TD height=25></TD>
<TR class=alternativerow>
<TD>Verbruik sinds nieuw tegoed:</TD>
<TD>€ 0,33</TD></TR>
<TR>
<TD>Ongebruikt tegoed:</TD>
<TD>€ 12,00</TD></TR>
<TR class=alternativerow>
<TD class=f-Orange>Verbruik boven bundel:</TD>
<TD class=f-Orange>€ 0,00</TD></TR>
<TR>
<TD>Verbruik dat niet in de bundel zit*:</TD>
<TD>€ 0,00</TD></TR>
</TBODY>
</TABLE>
</html>
Edit: I am getting a force close, i need the JSoup in my AsyncTask. Here is the LOgcat:
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): FATAL EXCEPTION: main
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): java.lang.NullPointerException
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at com.sencide.AndroidLogin$MyTask.onPostExecute(AndroidLogin.java:276)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at com.sencide.AndroidLogin$MyTask.onPostExecute(AndroidLogin.java:1)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at android.os.AsyncTask.finish(AsyncTask.java:417)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at android.os.AsyncTask.access$300(AsyncTask.java:127)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at android.os.AsyncTask$InternalHandler.handleMessage(AsyncTask.java:429)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at android.os.Handler.dispatchMessage(Handler.java:99)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at android.os.Looper.loop(Looper.java:130)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at android.app.ActivityThread.main(ActivityThread.java:3835)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at java.lang.reflect.Method.invokeNative(Native Method)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at java.lang.reflect.Method.invoke(Method.java:507)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:847)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:605)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at dalvik.system.NativeStart.main(Native Method)
Here is the AsyncTask code:
public class MyTask extends AsyncTask<String, Integer, String> {
private Elements tdsFromSecondColumn=null;
}
protected String doInBackground(String... params) {
InputStream inputStreamActivity = response.getEntity().getContent();
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStreamActivity));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = reader.readLine()) != null) {
sb.append(line + "\n");
}
/******* CLOSE CONNECTION AND STREAM *******/
System.out.println(sb);
inputStreamActivity.close();
String kpn;
kpn = sb.toString();
Document doc = Jsoup.parse(kpn);
Elements tdsFromSecondColumn = doc.select("table.personaltable td:eq(1)");
}
@Override
protected void onPostExecute(String result) {
//publishProgress(false);
TextView tv = (TextView)findViewById(R.id.lbl_top);
for (Element tdFromSecondColumn : tdsFromSecondColumn) {
//System.out.println(tdFromSecondColumn.text());
tv.setText("");
tv.setText(tdFromSecondColumn.text());
}
}
}
해결책
So, you have an InputStream
and not an URL? You should then use the Jsoup#parse()
method which takes an InputStream
:
Document document = Jsoup.parse(inputStream, charsetName, baseUri);
// ...
The charsetName
should be the charset the document is originally encoded in. You can leave it null
to let Jsoup decide or fallback to UTF-8. The baseUri
should be the URL from which the HTML was originally served. You can leave it null
, you'll only not be able to resolve relative links.
But if you actually have the original URL, then you could also just use Jsoup#connect()
:
Document document = Jsoup.connect(url).get();
// ...
Regardless of the way you obtained the Document
, you can use CSS selectors to select elements of interest in the document. See also the Jsoup cookbook on that subject. Here's an example which extracts all the data from the 2nd column of the <table>
with a class name of personaltable
:
Elements tdsFromSecondColumn = document.select("table.personaltable td:eq(1)");
for (Element tdFromSecondColumn : tdsFromSecondColumn) {
System.out.println(tdFromSecondColumn.text());
}
which results in:
€ 1,00
€ 2,00
€ 10,00
€ 0,33
€ 12,00
€ 0,00
€ 0,00