Question

I have a PHP web form that accepts file uploads (image and text), from which text is extracted (OCR and .pdf, .doc, etc stripped to plain text). The text extraction is performed by using exec to invoke a jar file/command line process (I am not in control of the source for either) which returns the text. While testing there is no issue, however, with 5 simultaneous PDF uploads (each about 5MB) the server load maxes out. The entire process (each upload) takes 10-15 seconds and load drops back to normal immediately after.

I am assuming the issue is with Java and allocation to the JRE for each exec call; when manually invoking the jar file from the command line it takes about 10 seconds, so nearly the same as a single upload response. Running the extraction as background processes is not possible because the HTTP response contains the 'data' processed from the uploaded files text. I considered forking the process, but that doesn't help with the server load (will probably make it worse). I am hoping to avoid rewriting the service entirely in Java.

Is there a way to pre-load the Java process JRE or pipe successive files to the same, or something of the like?

Was it helpful?

Solution

Sure, starting a JVM for each request is an extremely bad idea. That's exactly where Java is slow.

It should be pretty easy using e.g., ServerSocket. Start a process and send requests to it. It's not the fastest solution, but simple and a guaranteed huge speedup.


A JAR file is sometimes an "executable", but it's always a "library". It's actually just a renamed ZIP file, so you can easily look what's inside (and I wouldn't call it reverse engineering). There's a file called manifest containing a reference to the main class. You can write your own class calling the original main or ignoring it.

For this you don't need to modify the original JAR at all. Just make you own, but you don't even need a JAR file. For something as simple a single class should suffice. Then you call it like

java -cp "old.jar;." YourClass

assuming you're using Windows (otherwise replace ; by :), YourClass is in the main package (which is usually a bad idea, but OK for a single class project), and YourClass.class (i.e., the compiled version of your YourClass.java is in the current working directory.

I wouldn't go for a faster and more complicated solution like using ServerSocketChannels, as it's not worth it. Starting a new JVM takes time, moreover, it starts with interpreting bytecode and compiling it... that far worse than some communication overhead. You could save some more microseconds....

OTHER TIPS

If I were you, I would first look for some open source projects on converting the files within PHP. When working with one language, adding another language typically causes unnecessary work. Chances are, there is a lib for whatever you needs are; and it might even be faster than you current solution.

Given that you MUST use a Java lib:

File operations often take up the cpu, and it only gets worse with larger files. Chances are there isn't much you can do about how long it takes to process a file, other than possibly limiting the file size.

However, you can control what the server is/is-not doing. You should look into splitting the work between servers. The server that you use to convert the files should have a larger processor and shouldn't need a whole lot of RAM; while your web server is big on RAM with a smaller processor.

As for the data for each conversion, store it in a database until the conversion is done. Once the conversion is done, have your converting server connect to the database and store the relevant data, as well as a "done" flag.

From here you can just tell the client/browser to repeatedly check the database for the done flag (AJAX or Page Refresh).

Cheers!
-Nick

* Edit *

Also, you conversion server should never need to stop. Running it as a no-timeout application that constantly checks for new jobs in the database is ideal; though it also recommended that it you configure it to shutdown or hibernate during slow periods.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top