How can I get the search result (in html) for a given keyword and iterate it for all search result pages in Talend?

StackOverflow https://stackoverflow.com//questions/25018691

Question

I have a working workflow which gets the keywords specified by a user in the keywords.csv. It then starts a search query for a page with the keywords iterated as part of the search url. This has been easy before, because I could display all results on one page with the url parameter li=100000. They changed it, so I have to get the single pages and extract it from there.

The problem is, I'm new with talend and got this workflow from a colleague. Here it is in overview, doing the old li=100000 and write all results into separate text files (keyword_1.txt, keyword_2.txt,....):

enter image description here

Here is a sample query from the job portal website absolventa:

http://www.absolventa.de/stellenangebote?query%5Bcity%5D=&query%5Bradius%5D=100&query%5Btext%5D=SAP&utf8=%E2%9C%93

My question is now - how can I add the function in talend to grab page1, then page 2 for all specified keywords. And add it to my process. Please explain low level, since I'm new to Talend.

Thank you SO much in advance!

Was it helpful?

Solution

You'll want to use a tLoop component set to use a for loop to loop through all the possible pages.

So you'd want to connect the tLoop in between your tFlowToIterate component and then concatenate the loop iteration variable into your URL with something like:

"http://www.absolventa.de/stellenangebote?page=" +
((Integer)globalMap.get("tLoop_1_CURRENT_ITERATION")) +
"&query%5Bcity%5D=&query%5Bradius%5D=100&query%5Btext%5D=SAP&utf8=%E2%9C%93"

I'm not sure how you could make it end when the page would return nothing (you'd need a while loop that's condition could be set post the loop in processing) but if you just set your output file to append then if nothing is returned then it won't add anything to the file. You'd still have the problem that you'd need to set the amount of loops in your tLoop component to be high enough to cover any possible occurrence but then this would be making lots of pointless requests to fetch those empty pages.

You can maybe rework it into a while loop with some extra effort but I don't have much experience in this.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top