質問

I am trying to download all ".m4a" podcast files from this base URL "http://runawaypodcast.com/wp-content/uploads/2014/" and ignore ones that have already been downloaded.

This is my current code (it doesn't search subdirectories)

#!/bin/bash
lynx --dump 'http://runawaypodcast.com/wp-content/uploads/2014/01/' | awk '/http/{print $2}' >> temp.txt
while read link || [[ -n "$link" ]]; do
    if [[ $link == *m4a ]]
    then
        if grep -q $link usedlinks.txt; then
            echo "This episode has already been downloaded!"
        else
            wget $link
            echo $link >> usedlinks.txt
        fi
    else
        echo "Non-audio file detected!"
    fi
done < temp.txt
rm temp.txt

(I would also like to rename the downloaded files to a certain pattern, I don't know if you could help with that, too?)

役に立ちましたか?

解決

There is no need in scripting at all. All you had to do -- to read a wget man page :)

wget -np -nd -c -A.m4a -r -k -erobots=off http://runawaypodcast.com/wp-content/uploads/2014/

For mass file rename there is a rename tool ( check which one you have as it depends on your distro before you use it)


The full instruction how to handle this download. First download can be performed with:

wget -np -nd -c -A.m4a -r -k -erobots=off http://runawaypodcast.com/wp-content/uploads/2014/ \
-o download.log

now we have all download log in file. to form a blacklist for future download we need to build a file list from a log:

v_black_list=$(sed -n '/--.*m4a/s=.*/==p' download.log | tr '\n' ',')

and to run wget with this enabled black list, you have to use -R option:

wget -np -nd -c -A.m4a -r -k -erobots=off http://runawaypodcast.com/wp-content/uploads/2014/ \
-a download.log -R$v_black_list

note, that in second run -a is used instead of -o to avoid logfile overwrite.

他のヒント

You can use wget to scrape the website for you in one line. It canl download all of the *.m4a files for you and save then in directories like they are stored on the website. Here is a basic command to get you started, but you will need to tune the options to do exactly what you want:

wget -r -H -l1 -np -N -A.m4a -erobots=off http://runawaypodcast.com/wp-content/uploads/2014/
ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top