Question

I have a python script to analyze user behavior from log file.

This script reads from several large files(about 50 GB each) by using file.readlines(), and then analyze them line by line and save the results in a dict of python object, after all lines are analyzed, the dict is wrote to the disk.

As I have a sever which has 64 cores and 96 GB memory, I start 10 processes of this script and each of which handle part of data. Besides, in order to save the time spent on IO operation, I use file.readlines(MAX_READ_LIMIT) instead of file.readline() and set MAX_READ_LIMIT = 1 GB.

After running this script on sever while using top command to show the task resource, I find that although each process of my script will occupy only about 3.5 GB memory(40 GB in total), there is only 380 MB left on the server (there is no other significant memory-consuming app running on the server at the same time).

So, I was wondering where is the memory? there should be about 96-40=36GB memory left?

please tell me if I make some mistakes on above observations.

One hypothesis is that the memory unused is NOT placed back into memory pool immediately, So I was wondering how to release unused memory explicitly and immediately.

I learned from python document that there are two complementary methods to manage memory in python: garbage collect and reference counting, and according to python doc:

Since the collector supplements the reference counting already used in Python, you can disable the collector if you are sure your program does not create reference cycles.

So, which one should I use for my case, del obj or gc.collect() ?

Was it helpful?

Solution

using file.readlines() , then analyze data line by line

This is a bad design. readlines reads the entire file and returns a Python list of strings. If you only need to process the data line-by-line, then iterate through the file without using readlines:

with open(filename) as f:
    for line in f:
        # process line

This will massively reduce the amount of memory your program requires.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top