Question

I have a series of txt files that have info for around 200 people. This info is generated and exported 5 or 6 times a day. Each txt file has average 800 lines each.

I set up a cron that calls (from php command line) a codeigniter controller that makes this process:

  • constructor loads model
  • a method get txt files from folder, removes blanks and special chars from filename and renames
  • return files' paths stored in an array
  • another method loops through files array and call $this->process($file)
  • process() reads each line from file
  • ignores blank lines and builds 1 array from each line with values in each line read: array_filter(preg_split('/\s+/',$line));
  • finally it calls model->insert_line($line)

How could I:

1- optimize code so I can lower the 2min (avg) execution time for each cron call? Each execution process 5/6 txt files with 800avg. lines each

2- setup the MySQL table so it can hold a very large qtty of records w/o trouble? Fields stored are 2: "code" int(2) and "fecha" timestamp , set both in an unique index(code,fecha)

I have a fast PC, and the table is set to InnoDB

Was it helpful?

Solution 2

First approach

Have you tried:

$this->db->insert_batch('table', $data);

Where $data is an array with the objects/information you want to insert. I do not know the internals of that method (although looking at the code should not be hard) but I'm almost sure that this method does the whole insertion in a single transaction.

The way you are doing it right now by calling an insert for each line means openening a socket/connection, doing checks and everything that each transaction needs to do in order to do it. So doing a bulk insert is the way to go in those cases, and that function from CI does exactly that, meaning that it will generate a single insert command which is going to be executed on the same transaction.

You even have the advantage to roll back it if one of the inserts failed so the people that generate those files can massage or fix the data.

Second approach

If you know that those files have a specific format you could easily use the LOAD DATA INFILE utility from mysql which is going to have better performance than any tool you can write yourself.

The beauty of it is that you might be able to call it with:

$this->db->query($bulk_insert_command);

Where $bulk_insert_command is actually a string with something like:

LOAD DATA [LOW_PRIORITY | CONCURRENT] [LOCAL] INFILE 'file_name'
    [REPLACE | IGNORE]
    INTO TABLE tbl_name
    [CHARACTER SET charset_name]
    [{FIELDS | COLUMNS}
        [TERMINATED BY 'string']
        [[OPTIONALLY] ENCLOSED BY 'char']
        [ESCAPED BY 'char']
    ]
    [LINES
        [STARTING BY 'string']
        [TERMINATED BY 'string']
    ]
    [IGNORE number {LINES | ROWS}]
    [(col_name_or_user_var,...)]
    [SET col_name = expr,...]

As shown in the provided link above. Of course you'd have a function to sanitize this string and replace filename and options and whatever you need.

And finally, make sure that whatever user you set up in database.php on your CI app has the file role permission:

GRANT FILE on *.* TO user@localhost IDENTIFIED  BY 'password';

So that the CI app does not generate an error when running such query.

OTHER TIPS

You should profile your code to determine where the bottleneck(s) are.

You can probably speed things up by splitting up the IO and the CPU tasks. There's no point in having multiple processes doing IO unless you've saved the files to multiple disks or something along those lines, so dedicate one IO process to reading in the files into memory and putting them in a queue; then you can have multiple CPU processes pull files from the queue and process them. If possible (i.e. if you have enough RAM), add this processed data to an in-memory queue, and when your IO process has finished reading all of the files into memory you can then have it write the processed data back to disk; if you don't have enough RAM to store your files + processed data in memory then have the IO process alternate between reading and writing. You should run enough CPU processes to utilize your hardware threads, which is probably the number of cores you've got on your CPU, or the number of cores * 2 if your CPU and OS support hyperthreading - run a few timing experiments with various numbers of processes to arrive at a good number.

If you profile the code and find that IO is the problem, then see if you can do something like save the files to a couple of zip files when they're first generated - this will lessen the amount of data you're reading from disk and will also make it more contiguous, at the cost of additional CPU processing when you unzip the data.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top