How to read multiline fwf format where row may or may not flow multiline

https://stackoverflow.com/questions/15105923

15-03-2022
|

Domanda

I get trade report from one of my broker as below in text file. I am trying to parse it to do some analysis. Problem is each record has multiple rows, including one aggregate row (marked with * for BUY or SELL and below that).

  TRADE   SETTL  AT      BUY            SELL      CONTRACT DESCRIPTION           EX TRADE PRICE CC   DEBIT(DR)/CREDIT
 ------- ------- -- -------------- -------------- ------------------------------ -- ----------- -- --------------------
 11/26/2         F1                            1  JAN 13 SOYBEAN MEAL            01   424.70    US
                                                  ELECTRONIC TRADE
                 F1                            1*                                    COMMISSION US               1.20DR
                 F1                                                     EXCHANGE & CLEARING FEE US                .81DR
                 F1                                                                     NFA FEE US                .02DR
                 F1                                                     TOTAL COMMISSION & FEES US               2.03DR
 11/28/2         F1             1                 DEC 12 SWISS FRANC             16  107.490    US
                                                  ELECTRONIC TRADE
                 F1             1*                                                   COMMISSION US               1.20DR
                 F1                                                     EXCHANGE & CLEARING FEE US                .54DR
                 F1                                                                     NFA FEE US                .02DR
                 F1                                                     TOTAL COMMISSION & FEES US               1.76DR
 11/29/2         F1             2                 MAR 13 NEW COCOA               06    24.61    US
                                                  ELECTRONIC TRADE
                 F1             2*                                                   COMMISSION US               2.40DR
                 F1                                                     EXCHANGE & CLEARING FEE US               4.00DR
                 F1                                                                     NFA FEE US                .04DR
                 F1                                                     TOTAL COMMISSION & FEES US               6.44DR
 12/03/2         F1             1                 DEC 12 IMM EURO FX             16     1.30000 US
                                                  ELECTRONIC TRADE
                 F1             1*                                                   COMMISSION US               1.20DR
                 F1                                                     EXCHANGE & CLEARING FEE US                .54DR
                 F1                                                                     NFA FEE US                .02DR
                 F1                                                     TOTAL COMMISSION & FEES US               1.76DR
 12/07/2         F1                            3  DEC 12 US $ INDEX              13    80.245   US
                                                  ELECTRONIC TRADE
 12/07/2         F1             3                 DEC 12 US $ INDEX              13    80.250   US
                                                  ELECTRONIC TRADE
                 F1             3*             3*                                    COMMISSION US               7.20DR
                 F1                                                     EXCHANGE & CLEARING FEE US               8.10DR
                 F1                                                                     NFA FEE US                .12DR
                 F1                                                     TOTAL COMMISSION & FEES US              15.42DR

At the moment I am only interested in aggregated info i.e. CONTRACT DESCRIPTION, BUY and SELL quantities with * in it and fields below i.e COMMISSION, EXCHANGE AND CLEARING FEES, NFA FEE and TOTAL COMMISSION AND FEES values as specified in last column DEBIT(DR)/CREDIT ?

Any pointers how can I go about doing this?

I tried using read.fwf but it doesn't work for me because multiline format is not same for each record.

Ultimately, if nothing works, I will have to write line by line parser, which I am trying to avoid at the moment to see if I it can be done in more elegant manner.

Soluzione

Since your data are grouped by date, I scan it and I treat it using lapply.

dat <- scan('yourfile_name',what='character')
ids <- c(grep('[0-9]+/[0-9]+/[0-9]',dat),length(dat))
lapply(head(seq_along(ids),-1),function(x)
{
  y <- dat[ids[x]:(ids[x+1]-1)]
  list( desc = paste(y[4:8] ,collapse=' '),
        dd = y[1],
       debit_credit = y[grep('.*DR',y)],
       trde_price = as.numeric(y[grep('[0-9]+[.][0-9]+$',y)])
       )
})
[[1]]
[[1]]$desc
[1] "JAN 13 SOYBEAN MEAL 01"
[[1]]$dd
[1] "11/26/2"
[[1]]$debit_credit
[1] "1.20DR" ".81DR"  ".02DR"  "2.03DR"
[[1]]$trde_price
[1] 424.7

[[2]]
[[2]]$desc
[1] "DEC 12 SWISS FRANC 16"

.....

PS: I loose the information of B/S. Hope this helps.

Altri suggerimenti

agstudy's answer looks very helpful. I'm going to suggest an alternative approach: fix the bleeping input file first. If you can't get to the source program and change the output format, at the very least you can do the following in any text editor (even, dare I say it, MicrosoftWord :-) ) .

Edit: the suggestions below are backwards, i.e. you probably want to keep only the end-of-lines which are followed by a date string. The concept is the same, but mod the search term to find "anything but..." . Sorry for the misdirection.

Do a global search and replace for a paragraph mark (end of line) followed by two digits and a "/" and replace with a tab and the same 2 digits and "/"

In Word, this would be FIND what ^13([0-9]{2,2}/) REPLACE with ^t\1 ; editors supporting regexp will do it a little differently. Now your source file has one (longish) row for each date entry and you can easily extract the columns of interest.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow