Domanda

Summary
I am on a PHP news crawler project and want to pull RSS news feeds from nearly a hundred of news websites using wget (version 1.12) to capture whole RSS feed files, all in one directory (without hierarchy) in local server regarding:

  • Some of these websites do not have RSS feed and so I should capture and parse their HTML, but at the beginning I can just concentrate on XML feeds.
  • All feed files from all websites in one directory.
  • No extra content should be downloaded. all extra content (like images if any) should be hosted on the remote.
  • Performance is important
  • Feed files need to be renamed before save according to my convention like source.category.type.xml (each XML remote URL has its own source, category and type but not with my naming convention)
  • Some of these feeds do not include news timestamp like with <pubDate> and so I have to choose a good working approach to handle news time even with a slight difference but robust, working and always functional.
  • To automate it, I need to perform a cron job on this wget on regular basis

url-list.txt includes:

http://source1/path/to/rss1  
http://source2/diiferent/path/to/rss2  
http://source3/path/to/rss3  
.  
.  
.  
http://source100/different/path/to/rss100

I want this:

localfeed/source1.category.type.xml  
localfeed/source2.category.type.xml  
localfeed/source3.category.type.xml  
.  
.  
.  
localfeed/source100.category.type.xml

Category and type can have multiple predefined values like sport, ...


What do I have?
At the very first level I should do my wget using a list of remote URLs: According to this wget instructions:

  1. url-list.txt should consist of a series of URLs, one per line
  2. When running wget without -N, -nc, -r, or -p, downloading the same file in the same directory will result in the original copy of FILE being preserved and the second copy being named FILE.1.
  3. Use of -O like wget -O FILE is not intended to mean simply "use the name FILE instead of the one in the URL". It outputs the whole downloads into just one file.
  4. Use -N for time stamping
  5. -w SECONDS will hold on for SECONDS seconds of time before next retrieval
  6. -nd forces wget not to create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the filenames will get extensions `.n')
  7. -nH disables generation of host-prefixed directories (the behavior which -r by default does).
  8. -P PREFIX sets directory prefix to PREFIX. The "directory prefix" is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree.
  9. -k converts links for offline browsing

    $ wget -nH -N -i url-list.txt
    


Issues with (wget & cron job and php):

  1. How to handle news time? is it better to save timestamp in file names like source.category.type.timestamp.xml or fetch change time using phps stat function like this:

    $stat = stat('source.category.type.xml');
    $time = $stat('mtime');     //last modification time
    

    or any other ideas (which is always working and robust)

  2. How to handle file names? I want to save files locally on a distinct convention (source.category.type.xml) and so I think wget options like --trust-server-names or --content-disposition could not help. I think I should go to a while loop like this:

    while read url; do
      wget -nH -N -O nameConvention $url
    done < utl-list.txt
    
È stato utile?

Soluzione

I suggest to stay away from wget for your task as it makes your life really complicated for no reason. PHP is perfectly fine to fetch downloads.

I would add all URLs into a database (it might be just a text file, like in your case). Then I would use a cronjob to trigger the script. On each run I would check a fixed number of sites and put their RSS feeds into the folder. E.g. with file_get_contents and file_put_contents you are good to go. This allows you full control over what to fetch and how to save it.

The I would use another script that goes over the files and does the parsing. Separating the scripts from beginning will help you to scale later on. For a simple site, just sorting the files by mtime should do the trick. For a big scaleout, I would use a jobqueue.

The overhead in PHP is minimal while the additional complexity by using wget is a big burden.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top