Question

I am trying to process a "segmentation file" called .TextGrid (generated by Praat program). )

The original format looks like this:

File type = "ooTextFile"
Object class = "TextGrid"
xmin = 0 
xmax = 243.761375 
tiers? <exists> 
size = 17 
item []: 
    item [1]:
        class = "IntervalTier" 
        name = "phones" 
        xmin = 0 
        xmax = 243.761 
        intervals: size = 2505 
        intervals [1]:
            xmin = 0 
            xmax = 0.4274939687384032 
            text = "_" 
        intervals [2]:
            xmin = 0.4274939687384032 
            xmax = 0.472 
            text = "v" 
        intervals [3]:
[...]

(This is then repeted to EOF, with intervals[3 to n] for n Item (layer of annotation) in a file.

Somebody proposed a solution using rPython R package.

Unfortunately :

  • I don't have a good knowledge of Python
  • The version of rPython is not available for R.3.0.2 (which I am using).
  • My aim is to develop this parser for my analysis exclusively under R environment.

Right now my aim is to segment this file into multiple data frame. Each dataframe should contain one item (layer of annotation).

# Load the Data
txtgrid <- read.delim("./xxx_01_xx.textgrid", sep=c("=","\n"), dec=".", header=FALSE)
# Erase White spaces (use stringr package)
txtgrid[,1] <- str_trim(txtgrid[,1])
# Convert row.names to numeric 
num.row<- as.numeric(row.names(txtgrid))
# Redefine the original textgrid and add those rows (I want to "keep them in case for later process)
txtgrid <- data.frame(num.row,txtgrid)
colnames(txtgrid) <- c("num.row","object", "value")
head(txtgrid)

The output of head(txtgrid) is very raw, so here is the first 20 lines of the textgrid txtgrid[1:20,]:

   num.row          object                value
1        1       File type           ooTextFile
2        2    Object class             TextGrid
3        3            xmin                   0 
4        4            xmax          243.761375 
5        5 tiers? <exists>                     
6        6            size                  17 
7        7        item []:                     
8        8       item [1]:                     
9        9           class        IntervalTier 
10      10            name              phones 
11      11            xmin                   0 
12      12            xmax             243.761 
13      13 intervals: size                2505 
14      14  intervals [1]:                     
15      15            xmin                   0 
16      16            xmax  0.4274939687384032 
17      17            text                   _ 
18      18  intervals [2]:                     
19      19            xmin  0.4274939687384032 
20      20            xmax               0.472 

Now that I pre-processed it, I can :

# Find the number of the rows where I want to split (i.e. Item)
tier.begining <- txtgrid[grep("item", txtgrid$object, perl=TRUE), ]
# And save those numbers in a variable
x <- as.numeric(row.names(tier.begining))

This variable x gives me the numbers-1 where my Data should be splitted in several dataframes.

I have 18 items -1 (the first item is item[] and include all the other items. So vector x is :

     x
    [1]     7     8 10034 14624 19214 22444 25674 28904 31910 35140 38146 38156 38566 39040 39778 40222 44800
[18] 45018

How can I tell to R : to segment this dataframe in multiple dataframes textgrids$nameoftheItem in such a way that I get as many data frame as I have of items?, for example :

textgrid$phones
         item [1]:
            class = "IntervalTier" 
            name = "phones" 
            xmin = 0 
            xmax = 243.761 
            intervals: size = 2505 
            intervals [1]:
            xmin = 0 
            xmax = 0.4274939687384032 
            text = "_" 
            intervals [2]:
            xmin = 0.4274939687384032 
            xmax = 0.472 
            text = "v" 
            [...]
            intervals [n]:
textgrid$syllable
    item [2]:
            class = "IntervalTier" 
            name = "syllable" 
            xmin = 0 
            xmax = 243.761 
            intervals: size = 1200
            intervals [1]:
            xmin = 0 
            xmax = 0.500
            text = "ve" 
            intervals [2]:
            [...]
            intervals [n]:
    textgrid$item[n]

I wanted to use

txtgrid.new <- split(txtgrid, f=x)

But this message is right :

Warning message: In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : data length is not a multiple of split variable

I don't get the desired outputed, it seems that row numbers don't follow each other and that the file is all mixed up.

I have also tried some which, daply (from plyr) & subset functions but never got them to work properly!

I am welcoming any idea to structure this data properly & efficiently. Ideally I should be able to link items (layers of annotation) between them (xmin & xmax of different layers), as well as multiple textgrid files, this is just the beginning.

Was it helpful?

Solution

The length of the split vector should be equal to the number of rows in the data.frame.

Try the following:

txtgrid.sub <- txtgrid[-(1:grep("item", txtgrid$object)[1]), ]

grep("item", txtgrid.sub$object)[-1]

splits <- unlist(mapply(rep, seq_along(grep("item", txtgrid.sub$object)),
                        diff(c(grep("item", txtgrid.sub$object), 
                               nrow(txtgrid.sub) + 1))))

df.list <- split(txtgrid.sub, list(splits))

EDIT:

You could then simplify the data by doing something like this:

l <- lapply(df.list, function(x) {
  tmp <- as.data.frame(t(x[, 3, drop=FALSE]), stringsAsFactors=FALSE)
  names(tmp) <- make.unique(make.names(x[, 2]))
  tmp
})

library(plyr)
do.call(rbind.fill, l)


  item..1..        class     name xmin    xmax intervals..size
1      <NA> IntervalTier   phones    0 243.761            2505
2      <NA> IntervalTier syllable    0 243.761            2505
  intervals..1.. xmin.1             xmax.1 text intervals..2..
1           <NA>      0 0.4274939687384032    _           <NA>
2           <NA>      0 0.4274939687384032    _           <NA>
              xmin.2 xmax.2
1 0.4274939687384032  0.472
2               <NA>   <NA>

NB: I've used dummy data for the above.

OTHER TIPS

You seem to have found a good solution elsewhere, but I thought I might as well put this here for reference:

I recently finished a first working version of a JSON converter for Praat objects that could have been used for this. You can save the TextGrid as a JSON file using the script save_as_json.praat included in this plugin (again: I am the author of that plugin).

Copied from this other answer to a similar question, once you have the plugin installed you can use the script from the Save menu in Praat or run it like this from another script:

runScript: preferencesDirectory$ + "/plugin_jjatools/save_as_json.praat",
  ..."/output/path", "Pretty printed" 

Once that is done, you can read it into R using rjson like this:

> library(rjson)
> tg <- fromJSON(file='/path/to/your_textgrid.json')
> str(tg)
List of 5
$ File type   : chr "json"
$ Object class: chr "TextGrid"
$ start       : num 0
$ end         : num 1.82
$ tiers       :List of 2
    ..$ :List of 5
    .. ..$ class    : chr "IntervalTier"
    .. ..$ name     : chr "keyword"
    .. ..$ start    : num 0
    .. ..$ end      : num 1.82
    .. ..$ intervals:List of 3
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 0
    .. .. .. ..$ end  : num 0.995
    .. .. .. ..$ label: chr ""
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 0.995
    .. .. .. ..$ end  : num 1.5
    .. .. .. ..$ label: chr "limite"
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 1.5
    .. .. .. ..$ end  : num 1.82
    .. .. .. ..$ label: chr ""
    ..$ :List of 5
    .. ..$ class    : chr "IntervalTier"
    .. ..$ name     : chr "segments"
    .. ..$ start    : num 0
    .. ..$ end      : num 1.82
    .. ..$ intervals:List of 8
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 0
    .. .. .. ..$ end  : num 0.995
    .. .. .. ..$ label: chr ""
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 0.995
    .. .. .. ..$ end  : num 1.07
    .. .. .. ..$ label: chr "l"
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 1.07
    .. .. .. ..$ end  : num 1.15
    .. .. .. ..$ label: chr "i"
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 1.15
    .. .. .. ..$ end  : num 1.23
    .. .. .. ..$ label: chr "m"
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 1.23
    .. .. .. ..$ end  : num 1.28
    .. .. .. ..$ label: chr "i"
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 1.28
    .. .. .. ..$ end  : num 1.37
    .. .. .. ..$ label: chr "t"
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 1.37
    .. .. .. ..$ end  : num 1.5
    .. .. .. ..$ label: chr "e"
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 1.5
    .. .. .. ..$ end  : num 1.82
    .. .. .. ..$ label: chr ""

Or using, for example, tg$tiers[[tier_number]]$intervals[[interval_number]].

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top