HtmlUnit e Frammento Identità

https://stackoverflow.com/questions/4588199

14-10-2019
|

Domanda

Attualmente sto chiedendo come trattare con le identità frammento, un link che sto volendo afferrare le informazioni da, contiene un frammento di identità. Sembra come se HtmlUnit sta scartando il "# / db4mj" del mio url e quindi caricare l'URL originale.

Qualcuno sa di un modo per affrontare le identità frammento? (Posso inviare codice di esempio per spiegare ulteriormente in caso di necessità)

Modifica

Dal momento che non è stato sempre molti punti di vista (e risposte), ho intenzione di aggiungere una taglia. Ci dispiace è solo 50, ma ho avuto solo 79 per iniziare con

Modifica

Ecco un esempio di codice come richiesto.

Il nostro URL sarà: http: //browse.deviantart. com / risorse / applicazioni / psbrushes /? order = 9 & offset = 0

Quindi, se si dà un'occhiata al contenuto nel link, si vedrà più pennelli che contengono URL pure. Quindi il mio script afferra l'URL: http: // browse. deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4

Come si può vedere non v'è l'identificatore di frammento # / dbwam4 Ora cerco e afferrare il contenuto che si trova su questa pagina, ma HtmlUnit pensa ancora è su l'URL originale.

Ecco un codice di esempio nel mio script in cui viene a mancare sull'URL identificatore di frammento, ma non ha alcun problema con l'URL originale.

client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false

page = client.getPage(url)       //url with fragment identifier

//this is on the url with the fragment identifier only, not the original url
img = page.getByXPath("*[@id="gmi-ResViewSizer_img"]")

Mi aspettavo di essere in grado di afferrare alcune informazioni dall'URL con l'identificatore di frammento, ma sono in grado di accedervi di sorta.

Soluzione

There is good news and bad news.

First the good news is that HtmlUnit appears to be working just fine.

If you visit the page with the fragment identier URL in a browser with JavaScript turned off (maybe using Firefox's QuickJava plugin), you will not see the "single brush view" that you want.

So in order to acquire this page you need to use WebClient with setJavaScriptEnabled set to true.

And now the bad news:

I have not consistently been able to acquire the "single brush view" page using HtmlUnit with JavaScript turned on (I know not why). Although, I have been able to acquire the full page on occassion.

The real problem is the state of the returned HTML is so bad as to defy my attempts to parse it (I tried TagSoup, jsoup, Jaxen, etc). I therefore suspect attempting to parse the page using XPath may not work for you.

I would therefore think you need to resort to using regular expressions (which is far from ideal) or even use some variant of String.indexOf("gmi-ResViewSizer_img").

I hope this helps.

EDIT

I managed to get something that sporadically works. I'm afraid I am not converted to Groovy yet, so it will be in plain old Java.

I haven't looked at the source of HtmlUnit but it is almost as if something in the process of running the save is helping to make the parsing work?? Without the save I seem to get NullPointerExceptions.

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.WebResponse;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.util.FalsifyingWebConnection;
import java.io.File;
import java.io.IOException;

public class TestProblem {

    public static void main(String[] args) throws IOException {
        WebClient client = new WebClient(BrowserVersion.FIREFOX_3_6);
        client.setJavaScriptEnabled(true);
        client.setCssEnabled(false);
        String url = "http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4";
        client.setThrowExceptionOnScriptError(false);
        client.setThrowExceptionOnFailingStatusCode(false);
        client.setWebConnection(new FalsifyingWebConnection(client) {

            @Override
            public WebResponse getResponse(final WebRequest request) throws IOException {
                if ("www.google-analytics.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("d.unanimis.co.uk".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("edge.quantserve.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                if ("b.scorecardresearch.com".equals(request.getUrl().getHost())) {
                    return createWebResponse(request, "", "application/javascript"); // -> empty script
                }
                //
                if (request.getUrl().toString().startsWith("http://st.deviantart.net/css/v6core_jc.js")) {
                    WebResponse wr = super.getResponse(request);
                    return createWebResponse(request, wr.getContentAsString(), "application/javascript");
                }
                if (request.getUrl().toString().startsWith("http://st.deviantart.net/css/v6loggedin_jc.js")) {
                    WebResponse wr = super.getResponse(request);
                    return createWebResponse(request, wr.getContentAsString(), "application/javascript");
                }
                return super.getResponse(request);
            }
        });

        HtmlPage page = client.getPage(url);       //url with fragment identifier



        File saveFile = new File("saved.html");
        if(saveFile.exists()){
            saveFile.delete();
            saveFile = new File("saved.html");
        }
        page.save(saveFile);


        HtmlElement img = page.getElementById("gmi-ResViewSizer_img");
        System.out.println(img.toString());

    }
}

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow