Question

I want to execute my scrapy crawler from cron job .

i create bash file getdata.sh where scrapy project is located with it's spiders

#!/bin/bash
cd /myfolder/crawlers/
scrapy crawl my_spider_name

My crontab looks like this , I want to execute it in every 5 minute

 */5 * * * * sh /myfolder/crawlers/getdata.sh 

but it don't works , whats wrong , where is my error ?

when I execute my bash file from terminal sh /myfolder/crawlers/getdata.sh it works fine

Was it helpful?

Solution

I solved this problem including PATH into bash file

#!/bin/bash

cd /myfolder/crawlers/
PATH=$PATH:/usr/local/bin
export PATH
scrapy crawl my_spider_name

OTHER TIPS

Adding the following lines in crontab -e runs my scrapy crawl at 5AM every day. This is a slightly modified version of crocs' answer

PATH=/usr/bin
* 5 * * * cd project_folder/project_name/ && scrapy crawl spider_name

Without setting $PATH, cron would give me an error "command not found: scrapy". I guess this is because /usr/bin is where scripts to run programs are stored in Ubuntu.

Note that the complete path for my scrapy project is /home/user/project_folder/project_name. I ran the env command in cron and noticed that the working directory is /home/user. Hence I skipped /home/user in my crontab above

The cron log can be helpful while debugging

grep CRON /var/log/syslog

For anyone who used pip3 (or similar) to install scrapy, here is a simple inline solution:

*/10 * * * * cd ~/project/path && ~/.local/bin/scrapy crawl something >> ~/crawl.log 2>&1

Replace:

*/10 * * * * with your cron pattern

~/project/path with the path to your scrapy project (where your scrapy.cfg is)

something with the spider name (use scrapy list in your project to find out)

~/crawl.log with your log file position (in case you want to have logging)

Another option is to forget using a shell script and chain the two commands together directly in the cronjob. Just make sure the PATH variable is set before the first scrapy cronjob in the crontab list. Run:

    crontab -e 

to edit and have a look. I have several scrapy crawlers which run at various times. Some every 5 mins, others twice a day.

    PATH=/usr/local/bin
    */5 * * * * user cd /myfolder/crawlers/ && scrapy crawl my_spider_name_1
    * 1,13 * * * user cd /myfolder/crawlers/ && scrapy crawl my_spider_name_2

All jobs located after the PATH variable will find scrapy. Here the first one will run every 5 mins and the 2nd twice a day at 1am and 1pm. I found this easier to manage. If you have other binaries to run then you may need to add their locations to the path.

Check where scrapy is installed using "which scrapy" command. In my case, scrapy is installed in /usr/local/bin.

Open crontab for editing using crontab -e. PATH=$PATH:/usr/local/bin export PATH */5 * * * * cd /myfolder/path && scrapy crawl spider_name

It should work. Scrapy runs every 5 minutes.

does your shell script have execute permission?

e.g. can you do

  /myfolder/crawlers/getdata.sh 

without the sh?

if you can then you can drop the sh in the line in cron

in my case scrapy is in .local/bin/scrapy give the proper path of scraper and name it worK perfect

0 0 * * * cd /home/user/scraper/Folder_of_scriper/ && /home/user/.local/bin/scrapy crawl "name" >> /home/user/scrapy.log 2>&1

/home/user/scrapy.log it use to save the output and error in scrapy.log for check it program work or not

thank you.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top