Question

Currently, I have an application which is using XML::Twig and parses 20 XML files. Each file amounts to 0.5GB and the processing is done in a sequential manner:

foreach (@files) {  
    my $ti = XML::Twig->new( 
        keep_encoding => 1,
        twig_handlers => {
            'section' => sub { $_->purge(); }
        }
    )->parsefile($_);
}

Is there a way with perl to run this code in parallel and if yes how can I do it? My application runs on a Windows system.

Was it helpful?

Solution

You should use Parallel::ForkManager off of CPAN. This (with a little included explanation) should allow you to fork each process and parse the files individually, in parallel. Also, be aware that Perl 5 has threads, but the performance gain will probably not be significant.

The provided code on the linked page should do what you want, but I've posted it here for your convenience. As you can see, all it really does is create a new data structure for the management of the maximum number of allowed processes and for each new data piece (or file) forks and return the child, does the work, then terminates the process:

use Parallel::ForkManager;

$pm = Parallel::ForkManager->new($MAX_PROCESSES);

foreach $data (@all_data) {
  # Forks and returns the pid for the child:
  my $pid = $pm->start and next;

  ... do some work with $data in the child process ...

  $pm->finish; # Terminates the child process
}

Be aware that you may want to use WINAPI on Windows if you want to create processes there, (as Parallel::ForkManager I believe uses Windows kernel level threading, though should still perform the task adequately). Perl also gives you the option of using Win32::API's CreateProcess() function to do multiprocessing in Perl (provided you import it). There's also the option of the Forks::Super package for multiprocessing, which works on Windows as well.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top