Domanda

I have a csv script that runs within a sequence on a set of gathered urls like so: threaded(urls, write_csv, num_threads=5). The script writes to the csv correctly but seems to rewrite the first row for each url rather than writing to new rows for each subsequent url that is passed. The result is that the final csv has one row with the data from the last url. Do I need to add a counter and index to accomplish this or restructure the program entirely? Here's the relevant code:

import csv
from thready import threaded

def get_links():
    #gather urls
    threaded(urls, write_csv, num_threads=5)

def write_csv(url):
    #the data dict with values that were previously assigned is defined here
    data = {
            'scrapeUrl': url,
            'model': final_model_num,
            'title': final_name, 
            'description': final_description, 
            'price': str(final_price), 
            'image': final_first_image, 
            'additional_image': final_images,
            'quantity': '1', 
            'subtract': '1', 
            'minimum': '1', 
            'status': '1', 
            'shipping': '1' 
        } 
        #currently this writes the values but only to one row even though multiple urls are passed in
        with open("local/file1.csv", "w") as f:
            writer=csv.writer(f, delimiter=",")
            writer.writerows([data.keys()])
            writer.writerow([s.encode('ascii', 'ignore') for s in data.values()])

if __name__ == '__main__':
    get_links()
È stato utile?

Soluzione

It appears that one problem is this line...

with open("local/file1.csv", "w") as f:

The output file is overwritten on each function call ("w" indicates the file mode is write). When an existing file is opened in write mode it is cleared. Since the file is cleared every time the function is called it's giving the appearance of only writing one row.

The bigger issue is that it is not good practice for multiple threads to write to a single file.

You could try this...

valid_chars = "-_.() abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
filename = ''.join(c for c in url if c in valid_chars)
with open("local/%s.csv" % filename, "w") as f:
    # rest of code...

...which will write each url to a different file (assuming the urls are unique). You could then recombine the files later. A better approach would be to put the data in a Queue and write it all after the call to threaded. Something like this...

import Queue
output_queue = Queue.Queue()

def get_links():
    #gather urls
    urls = ['www.google.com'] * 25
    threaded(urls, write_csv, num_threads=5)

def write_csv(url):
    data = {'cat':1,'dog':2} 
    output_queue.put(data)

if __name__ == '__main__':

    get_links() # thready blocks until internal input queue is cleared

    csv_out = csv.writer(file('output.csv','wb'))

    while not output_queue.empty():
        d = output_queue.get()
        csv_out.writerow(d.keys())
        csv_out.writerow(d.values())

Altri suggerimenti

Opening a file in write mode erases whatever was already in the file (as documented here). If you have multiple threads opening the same file, whichever one opens the file last will "win" and write its data to the file. The others will have their data overwritten by the last one.

You should probably rethink your approach. Multithreaded access to external resources like files is bound to cause problems. A better idea is to have the threaded portion of your code only retrieve the data from the urls, and then return it to a single-thread part that writes the data sequentially to the file.

If you only have a small number of urls, you could dispense with threading altogether and just write a direct loop that iterates over the urls, opens the file once, and writes all the data.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top