hadoop multiline mixed records

https://stackoverflow.com/questions/14077713

13-12-2021
|

Question

I would like to parse logfiles produced by fidonet mailer binkd, which are multi-line and much worse - mixed: several instances can write into one logfile, for example:

      27 Dec 16:52:40 [2484] BEGIN, binkd/1.0a-545/Linux -iq /tmp/binkd.conf
    + 27 Dec 16:52:40 [2484] session with 123.45.78.9 (123.45.78.9)
    - 27 Dec 16:52:41 [2484] SYS BBSName
    - 27 Dec 16:52:41 [2484] ZYZ First LastName
    - 27 Dec 16:52:41 [2484] LOC City, Country
    - 27 Dec 16:52:41 [2484] NDL 115200,TCP,BINKP
    - 27 Dec 16:52:41 [2484] TIME Thu, 27 Dec 2012 21:53:22 +0600
    - 27 Dec 16:52:41 [2484] VER binkd/0.9.6a-173/Win32 binkp/1.1
    + 27 Dec 16:52:43 [2484] addr: 2:1234/56.78@fidonet
    - 27 Dec 16:52:43 [2484] OPT NDA CRYPT
    + 27 Dec 16:52:43 [2484] Remote supports asymmetric ND mode
    + 27 Dec 16:52:43 [2484] Remote requests CRYPT mode
    - 27 Dec 16:52:43 [2484] TRF 0 0
    *+ 27 Dec 16:52:43 [1520] done (from 2:456/78@fidonet, OK, S/R: 0/0 (0/0 bytes))*
    + 27 Dec 16:52:43 [2484] Remote has 0b of mail and 0b of files for us
    + 27 Dec 16:52:43 [2484] pwd protected session (MD5)
    - 27 Dec 16:52:43 [2484] session in CRYPT mode
    + 27 Dec 16:52:43 [2484] done (from 2:1234/56.78@fidonet, OK, S/R: 0/0 (0/0 bytes))

So the logfile is not only multi-line with unpredictable number of lines per session, but also several records can be mixed in between, like session 1520 has finished in the middle of session 2484. What would be the right direction in hadoop to parse such a file? Or shall I just parse line-by-line and then merge them somehow into a record later and write those records into a SQL database using another set of jobs later on?

Thanks.

Solution

Right direction for Hadoop will be to develop your own input format who's record reader will read input line by line and produce logical records.
Can be stated - that you actually can do it in mapper also - it might be a bit simpler. Drawback will be that it is not standard packaging of such code for hadoop and thus it is less reusable.

Other direction you mentioned is not "natural" for hadoop in my view. Specifically - why to use all complicated (and expensive) machinery of shuffling to join together several lines which are already in hands.

OTHER TIPS

First of all, parsing the file is not what you are trying to do; you are trying to extract some information from your data.

In your case you can consider multi-step MR job where first MR job will essentially (partially) sort your input by session_id (do some filtering? Some aggregation? Multiple reducers?) and then reducer or next MR job will do actual calculation.

Without explanation of what you are trying to extract from your log files it is hard to give more definitive answer.

Also if your data is small, maybe you can process it without MR machinery at all?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow