Question

I've got an application defined

{application, ps_barcode,
 [{description, "barcode image generator based on ps-barcode"},
  {vsn, "1.0"},
  {modules, [ps_barcode_app, ps_barcode_supervisor, barcode_data, wand, ps_bc]},
  {registered, [ps_bc, wand, ps_barcode_supervisor]},
  {applications, [kernel, stdlib]},
  {mod, {ps_barcode_app, []}},
  {start_phases, []}]}.

with the supervisor init looking like

init([]) ->
    {ok, {{one_for_one, 3, 10},
      [{tag1, 
        {wand, start, []},
        permanent,
        brutal_kill,
        worker,
        [wand]},
       {tag2,
        {ps_bc, start, []},
        permanent,
        10000,
        worker,
        [ps_bc]}]}}.

It's a barcode generator that uses a C component to do some of the image processing. The system errors and restarts correctly if asked to process nonexistent files, or to do it with insufficient permissions, but there's one particular error that results in a timeout from the wand module

GPL Ghostscript 9.04: Unrecoverable error, exit code 1
GPL Ghostscript 9.04: Unrecoverable error, exit code 1
wand.c barcode_to_png 65 Postscript delegate failed `/tmp/tmp.1337.95765.926102': No such file or directory @ error/ps.c/ReadPSImage/827

** exception exit: {timeout,{gen_server,call,
                    [wand,{process_barcode,"/tmp/tmp.1337.95765.926102"}]}}
     in function  gen_server:call/2 (gen_server.erl, line 180)
     in call from ps_bc:generate/3 (ps_bc.erl, line 19)

(the Imagemagick error is inaccurate there; the file exists, but it's a Postscript file with errors that therefore can't be interpreted as normal; I assume that's what generates the Ghostscript error and causes the program to hang, but I'm not sure why it fails to return at all).

The problem I've got is: even though this timeout returns an error, the wand process seems to have hanged in the background (I'm concluding this since any further call to wand returns another timeout error, including wand:stop for some reason). I'm not sure how much code to post, so I'm keeping it minimally to the wand module itself. Let me know if I need to post other pieces.

-module(wand).

-behaviour(gen_server).

-export([start/0, stop/0]).
-export([init/1, handle_call/3, handle_cast/2, handle_info/2,
     terminate/2, code_change/3]).

-export([process/1]).

process(Filename) -> gen_server:call(?MODULE, {process_barcode, Filename}).

handle_call({process_barcode, Filename}, _From, State) ->
    State ! {self(), {command, Filename}},
    receive
      {State, {data, Data}} ->
        {reply, decode(Data), State}
    end;
handle_call({'EXIT', _Port, Reason}, _From, _State) ->
    exit({port_terminated, Reason}).

decode([0]) -> {ok, 0};
decode([1]) -> {error, could_not_read};
decode([2]) -> {error, could_not_write}.

%%%%%%%%%%%%%%%%%%%% generic actions
start() -> gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).
stop() -> gen_server:call(?MODULE, stop).

%%%%%%%%%%%%%%%%%%%% gen_server handlers
init([]) -> {ok, open_port({spawn, filename:absname("wand")}, [{packet, 2}])}.
handle_cast(_Msg, State) -> {noreply, State}.
handle_info(_Info, State) -> {noreply, State}.
terminate(_Reason, Port) -> Port ! {self(), close}, ok.
code_change(_OldVsn, State, _Extra) -> {ok, State}.

EDIT: Forgot to mention and it may be relevant; the hang only seems to happen when I run the application through application:load/application:start. If I test this component on its own by doing

c(wand).
wand:start().
wand:process("/tmp/tmp.malformed-file.ps").

It still errors, but the process dies for real. That is, I can do

wand:start().
wand:process("/tmp/tmp.existing-well-formed-file.ps").

and get the expected response. When it's started through the supervisor, it hangs instead and exhibits the behavior I described earlier.

Was it helpful?

Solution 2

It seems that using receive..after instead of a plain receive when dealing with the external C program forces a kill. I'm not sure why the other measures don't work though...

...
receive
  {State, {data, Data}} ->
    {reply, decode(Data), State}
after 3000 ->
  exit(wand_timeout)
end;
...

Also, at this point you have to hope that no legitimate operation takes longer than 3000. It's not a problem in this particular case, but it might be if I added more outputs to the C program.

OTHER TIPS

Not an answer, but what I will do in such case. I will use gen_server:cast and will handle timeouts in gen_server and after all work is done I will send to requester response with result. So this changes affects requester side too.

But I'm maybe wrong in all ways.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top