Вопрос

I want to process a huge text corpus, i have written two classes which a main class is calling.I have removed some fields and methods in order to be more readable.

import queue
import threading
class Vectorizer():

 def __init__(self, numThread=10):
    self.docQueue = Queue.Queue(self.queueMaxSize)
    self.words = {}
    self.dimension = 1024
    self.modelLock = threading.Lock()
    for i in range(numThread):
       t = WorkerThread(self)
       t.setDaemon(True)
       t.start()
 def vectorize(self, text):
    '''
    add the text to docQueue. here text is the content of a 
    document 
    '''
    self.docNum += 1
    docVector = createSparseVectorForThisDoc(self.docNum)
    self.docQueue.put((docVector, text))
 def initContextVector(self):
    #return sp.zeros(self.vectorizer.dimension, dtype=int)
    return csr_matrix((1, self.vectorizer.dimension), dtype=int8 )
class WorkerThread(threading.Thread):
 def __init(self, vectorizer):
       self.vectorizer = vectorizer
 def run(self):
                  #get a document/text from queue
        contextVector, text =  self.vectorizer.docQueue.get()

        #do the work
        tokens, textLength = self.prepareTokens(text)
        #extract tokens and their context in order to vectorize them.
        for token in tokens:
            self.vectorizer.modelLock.acquire()
            self.vectorizer.words[token] =    self.vectorizer.words.get(token, self.initContextVector()) + contextVector
            self.vectorizer.modelLock.release()


        self.vectorizer.docQueue.task_done()

I measured the time spent on each statement and the most time is used on the following code which is adding to sparse matrices which are in fact two sparse (not very sparse) vectors.

self.vectorizer.words[token] =    self.vectorizer.words.get(token, self.initContextVector()) + contextVector

when i check the cores of the server using htop i don't see a good cpu utilization the overall process is using 130% of a core but it should use about 1000% when i use 10 threads. It never goes over 130% but the addition is a cpu intensive job isn't it? Is there anythin that i've done wrong?

Это было полезно?

Решение

Since you're having each thread use the lock, every thread likely has to wait on the previous thread anyways. You may consider breaking it out into processes rather than threads if you have the memory to handle it. At the moment I can't figure out what your locks are holding up behind because it's all in that one line that you highlighted separately.

Multiprocessing vs Threading Python

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top