番号付き転写産物をXMLに解析します

https://stackoverflow.com/questions/9367413

28-10-2019
|

質問

からの転写産物を解析するスクレーパーを作りたいと思っていますレベソンの問い合わせ, 、次の形式でプレーンテキストとしてあります。

         1                                      Thursday, 2 February 2012

         2   (10.00 am)

         3   LORD JUSTICE LEVESON:  Good morning.

         4   MR BARR:  Good morning, sir.  We're going to start today

         5       with witnesses from the mobile phone companies,

         6       Mr Blendis from Everything Everywhere, Mr Hughes from

         7       Vodafone and Mr Gorham from Telefonica.

         8   LORD JUSTICE LEVESON:  Very good.

         9   MR BARR:  We're going to listen to them all together, sir.

        10       Can I ask that the gentlemen are sworn in, please.

        11                   MR JAMES BLENDIS (affirmed)

        12                     MR ADRIAN GORHAM (sworn)

        13                      MR MARK HUGHES (sworn)

        14                       Questions by MR BARR

        15   MR BARR:  Can I start, please, Mr Hughes, with you.  Could

        16       you tell us the position that you hold and a little bit

        17       about your professional background, please?

        18   MR HUGHES:  Yes, sure.  I'm currently head of fraud risk and

        19       security for Vodafone UK.  I have been in that position

        20       since August 2011 and I've worked in the fraud risk and

        21       security department in Vodafone since October 2006.

        22   Q.  Mr Gorham, if I could ask you the same question, please.

        23   MR GORHAM:  I'm the head of fraud and security for

        24       Telefonica O2, I've been in that role for ten years and

        25       have been in the industry for 13.


                                         1

(完全な例)

最終的に、次のように構成されたXMLファイルを構築したい：

<hearing date="2012-02-02" time="10:00">
    <quote speaker="Lord Justice Leveson" page="1" line="3">Good morning.</quote>
    <quote speaker="Mr Barr" page="1" line="4">Good morning, sir. We're going to start today with witnesses from the mobile phone companies, Mr Blendis from Everything Everywhere, Mr Hughes from Vodafone and Mr Gorham from Telefonica.</quote>
    <quote speaker="Lord Justice Leveson" page="1" line="8">Very good.</quote>
[... and on ...]
</hearing>

...何か助けがありますか？

（また、「Mr Barr：」は、特定の時点で単純に「Q.」に変わることに注意してください。）

どうもありがとう！

解決

まず、これは絶対確実なスクリプトではないと言って始めましょう。私が忘れたり見落としていたものがあるかもしれませんが、それはあなたが改善し、拡大するか、単にアイデアを得るための概念の証明です。

テキストレイアウトには、操作するのに十分な規則性があります。スクリプトが行うことは、トランスクリプトを一連の行に分割し、それらの行をいくつかのパターンと一致させることです。

スクリプトの例：

<?php
/*
Proof of Concept : Transcript to XML by Robjong

? :
    - action on date change (what to do when the date changes?)
    - what to do with lines like "MR MARK HUGHES (sworn)" (make it a note?!)
    - what to do with lines like "Questions by MR BARR" (make it a note?!)
    - detect events/notes in quotes better? (e.g: MR BLENDIS: (Nods head).)


Notes :

    - desperately needs error checking/handling!!!! (for now it just got in the way)
    - it might well be that the configuration of PHP will cause file_get_contents to fail,
      try curl or download it manually and read the local file
    - if you are running PHP < 5.2.4, change the \h in the pattern to \s or [\t ]

*/

# basic usage
// get the transcript as plain text
$txt = file_get_contents( 'http://www.levesoninquiry.org.uk/wp-content/uploads/2012/02/Transcript-of-Morning-Hearing-2-February-2012.txt' );
// convert transcript to XML
$xml = transcriptToXML_beta( $txt );
// we have the transcript as XML, now what?
file_put_contents( 'transcript.xml', $xml ); // let's write it to a file
echo $xml;


function transcriptToXML_beta( $string ) { // beta is just to emphasize lack of torough testing
    $lines = explode( "\n", $string ); // split text into an array array of lines
    if( !is_array( $lines ) ) { // the provided string was not multiline
        return false;
    }

    // these vars will hold the data we need to build our XML
    $date = ''; // the date of the transcript
    $time = ''; // transcript time
    $page = 1; // this will hold the current page number

    $linenr = ''; // this will hold the line nr
    $speaker = ''; // the name of the speaker
    $text = ''; // transcribed text attributed to the speaker
    $new = false; // will be true if a new item has been matched
    $event = ''; // this will hold notes that are in a quote but need to be stored separately (events)

    $xml = ''; // this will be the XML string
    $i = 0; // count the lines to display actual line number for debugging
    foreach( $lines as $line ) { // loop over lines
        $i++;
        if( !preg_match( "/[[:graph:]]/", $line ) ) { // line is empty, does not contain printable characters....
            continue; // ....so we skip to the next one
        }

        if( preg_match( "/^\h*\d+\h+(?P<date>[a-z]+,\h+\d+\h+[a-z]+\h\d{4})\s*$/i", $line, $match ) ) { # it looks like a date
            $date = $match['date']; // set date
            $speaker = ''; // reset vars
            $text = '';
            continue;// no need to handle this line any further
        } elseif( preg_match( "/^\h*\d+\h+([A-Z]+(?:\s+[A-Z]+){0,4}\h+\(.*?\)|(?i:questions\h+by)[A-Z\h]+)\s*$/", $line, $match ) ) { # (qued) event, uppercase text followed by text between parentheses
            $event .= "    <event page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry to que, to be added after current quote
            continue;// no need to handle this line any further
        } elseif( preg_match( "/^\h*(\d*)\h*\(\h*(?P<time>\d{1,2}[:.]\d{1,2}\h*[ap]m)\)\s*$/i", $line, $match ) ) { # seems we have a time entry
            $time = $match['time']; // set date
            $xml .= "    <time page=\"{$page}\" line=\"{$match[1]}\">" . strtoupper( str_replace( '.', ':', $match['time'] ) ) . "</time>\n"; // add time as entry
            $speaker = ''; // reset vars
            $text = '';
            continue;// no need to handle this line any further
        } elseif( preg_match( "/^\h*(\d+)\s*$/", $line, $match ) ) { # line has just one or more digits, we assume its a pagenr
            if( $match[1] <= $page ) { // if the number is lower then the current page number ignore it, this avoids triggering errors for 'empty lines' that only have a line number
                continue;
            }
            $page = (int) $match[1] + 1; // set pagenr, add one because the nr is at the bottom of the page
            continue;// no need to handle this line any further
        } elseif( preg_match( "/^\h*\d+\s+\(([[:print:]]+)\)\s*$/", $line, $match ) && !$speaker ) { # note, text is between parentheses
            $xml .= "    <event page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry as note
            continue;// no need to handle this line any further
        } elseif( preg_match( "/^\h*\d+\h+[A-Z\h]+\(.*?\)\s*$/", $line, $match ) && !$speaker ) { # note, uppercase text followed by text between parentheses, only if not in quote
            $xml .= "    <event type=\"note\" speaker=\"\" page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry as note
            continue;// no need to handle this line any further
        } elseif( preg_match("/^\h*(?P<linenr>\d+)\h+(?P<speaker>(?:\h[A-Z]+(?:\h[A-Z]+){0,4}))[:.]\h*(?P<text>[[:print:]]+?)\s*$/", $line, $match ) ) { # new speaker entry
            if( $new && $speaker ) { // if we have one open we need to add it first
                $xml .= "    <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n"; // add quote
                $new = false; // reset
                if( $event ) { // if we have a qued note we need to add that too
                    $xml .= $event; // add entry to XML string
                    $event = ''; // clear 'que'
                }
            }
            $speaker = trim( $match['speaker'] ); // assign match to speaker var
            $linenr = (int) $match['linenr']; // assign line number
            $text = trim( $match['text'] ); // assign text
            $new = true; // set new match bool
        } elseif( preg_match( "/^\h*(?P<linenr>\d+)\h+(?P<text>[[:print:]]+?)\s*$/", $line, $match ) ) { # follow up text
            $text .= ' ' . trim( $match['text'] ); // append text
        } else { # unkown line (add check for linenr only lines or remove $match[1] >= $page from the pagenr match conditional)
            // not sure what kind of line this is... add it as a separate 'type'?!
            trigger_error( "Unable to parse line {$i} \"{$line}\"" ); // throw exception / trigger error
            continue; // no need to handle this line any further
        }

        if( !$new && $speaker ) {
            $xml .= "    <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n";
            $speaker = ''; // reset vars
            $text = '';
            $new = false;
            if( $event ) { // if we have a qued note we need to add that too
                $xml .= $event; // add entry to XML string
                $event = ''; // clear 'que'
            }
        }
    }

    // if we have a match open we need to handle it, this might happen because we do not assign the match in the same iteration as we matched it
    if( $new ) {
        $xml .= "    <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n";
    }

    if( !trim( $xml ) ) { // no text found so $xml is still an empty string
        return false;
    }

    $date = new DateTime( $date ); // instantiate datetime with the time from the transcript
    $date = date( 'Y-m-d', $date->getTimestamp() ); // format date
    // now we need to wrap the nodes in a root node
    $xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<hearing date=\"{$date}\">\n{$xml}</hearing>\n";

    return $xml; // return the XML
}
?>

本日遅くにコメントとスクリプトを更新します

出力サンプル：

<hearing date="2012-02-02"> 
    <time page="1" line="2">10:00 AM</time> 
    <entry type="quote" speaker="LORD JUSTICE LEVESON" page="1" line="3">Good morning.</entry> 
    <entry type="quote" speaker="MR BARR" page="1" line="4">Good morning, sir.  We're going to start today with witnesses from the mobile phone companies, Mr Blendis from Everything Everywhere, Mr Hughes from Vodafone and Mr Gorham from Telefonica.</entry> 
    <entry type="quote" speaker="LORD JUSTICE LEVESON" page="1" line="8">Very good.</entry> 
    <entry type="quote" speaker="MR BARR" page="1" line="9">We're going to listen to them all together, sir. Can I ask that the gentlemen are sworn in, please.</entry> 
    <event page="1" line="9">MR JAMES BLENDIS (affirmed)</event> 
    <event page="1" line="9">MR ADRIAN GORHAM (sworn)</event> 
    <event page="1" line="9">MR MARK HUGHES (sworn)</event> 
    <event page="1" line="9">Questions by MR BARR</event>

ところで、好奇心からただ、これは何のために必要ですか？

他のヒント

これは一般に非常に難しい問題であり、Stackoverflowの範囲外です。とはいえ、私がこれをしなければならなかったなら、私は反復的なアプローチをとる：

テキストレイアウトの規則性を特定し、候補者の文法を考案します。
文法を使用してパーサーを書きます。解析は非常に厳格であり、一致しなかったものはすべて（エラーメッセージ付き）廃棄します。
テキスト全体で実行します
出力と不一致を調べ、文法を修正し、特別なケースを特定する
ステップ3に戻ります

それらの手順の詳細については、自分が望むものを出しているかどうかを判断できます。また、あらゆる解決策では、低周波の矛盾をクリーンアップするために、事前または後に手動介入が必要になります。

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow