analizando archivos .srt

https://stackoverflow.com//questions/11659118

11-12-2019
|

Pregunta

1
00:00:00,074 --> 00:00:02,564
Previously on Breaking Bad...

2
00:00:02,663 --> 00:00:04,393
Words...

Necesito analizar archivos srt con php e imprimir todos los subtítulos en el archivo con variables.

No pude encontrar las exps regulares correctas.Al hacer esto necesito tomar las variables de identificación, hora y subtítulos.y al imprimir no debe haber array() s o etc.debe imprimirse igual que en el archivo original.

quiero decir que debo imprimir como;

$number <br> (e.g. 1)
$time <br> (e.g. 00:00:00,074 --> 00:00:02,564)
$subtitle <br> (e.g. Previously on Breaking Bad...)

por cierto tengo este código.pero no ve las líneas.hay que editarlo pero ¿cómo?

$srt_file = file('test.srt',FILE_IGNORE_NEW_LINES);
$regex = "/^(\d)+ ([\d]+:[\d]+:[\d]+,[\d]+) --> ([\d]+:[\d]+:[\d]+,[\d]+) (\w.+)/";

foreach($srt_file as $srt){

    preg_match($regex,$srt,$srt_lines);

    print_r($srt_lines);
    echo '<br />';

}

Solución

Aquí hay una máquina de estado corta y simple para analizar la línea de archivos SRT por línea:

define('SRT_STATE_SUBNUMBER', 0);
define('SRT_STATE_TIME',      1);
define('SRT_STATE_TEXT',      2);
define('SRT_STATE_BLANK',     3);

$lines   = file('test.srt');

$subs    = array();
$state   = SRT_STATE_SUBNUMBER;
$subNum  = 0;
$subText = '';
$subTime = '';

foreach($lines as $line) {
    switch($state) {
        case SRT_STATE_SUBNUMBER:
            $subNum = trim($line);
            $state  = SRT_STATE_TIME;
            break;

        case SRT_STATE_TIME:
            $subTime = trim($line);
            $state   = SRT_STATE_TEXT;
            break;

        case SRT_STATE_TEXT:
            if (trim($line) == '') {
                $sub = new stdClass;
                $sub->number = $subNum;
                list($sub->startTime, $sub->stopTime) = explode(' --> ', $subTime);
                $sub->text   = $subText;
                $subText     = '';
                $state       = SRT_STATE_SUBNUMBER;

                $subs[]      = $sub;
            } else {
                $subText .= $line;
            }
            break;
    }
}

if ($state == SRT_STATE_TEXT) {
    // if file was missing the trailing newlines, we'll be in this
    // state here.  Append the last read text and add the last sub.
    $sub->text = $subText;
    $subs[] = $sub;
}

print_r($subs);

Resultado:

Array
(
    [0] => stdClass Object
        (
            [number] => 1
            [stopTime] => 00:00:24,400
            [startTime] => 00:00:20,000
            [text] => Altocumulus clouds occur between six thousand
        )

    [1] => stdClass Object
        (
            [number] => 2
            [stopTime] => 00:00:27,800
            [startTime] => 00:00:24,600
            [text] => and twenty thousand feet above ground level.
        )

)

Puede volver a subir la matriz de subs o acceder a ellos por Array Offset:

echo $subs[0]->number . ' says ' . $subs[0]->text . "\n";

Mostrar todos los submarinos en bucle sobre cada uno y mostrarlo:

foreach($subs as $sub) {
    echo $sub->number . ' begins at ' . $sub->startTime .
         ' and ends at ' . $sub->stopTime . '.  The text is: <br /><pre>' .
         $sub->text . "</pre><br />\n";
}

Lectura adicional: Formato de archivo de texto SUBREP

Otros consejos

que no va a igualar porque su matriz de $ srt_file puede parecer esta:

Array
([0] => '1',
[1] => '00:00:00,074 --> 00:00:02,564',
[2] => 'Previously on Breaking Bad...'.
[3] => '',
[4] => '2',
...
)

Su regular no va a igualar ninguno de esos elementos.

Si su intención es leer el archivo completo en una larga memoria-hog-of-a-string, use archivos_get_contents para obtener el contenido completo de archivos en una cadena.Luego use un PREG_MATCH_ALL para obtener todas las coincidencias regex.

De lo contrario, podría intentar bucle a través de la matriz e intente coincidir con varios patrones de regex para determinar si la línea es una identificación, un rango de tiempo o un mensaje de texto y hacer una cosa adecuadamente.Obviamente, es posible que también desee que alguna lógica se asegure de que está obteniendo valores en el orden correcto (ID, luego rango de tiempo, luego texto).

Agrupa el file() matriz en trozos de 4 usando array_chunk(), luego omite la última entrada, ya que es una línea en blanco como esta:

foreach( array_chunk( file( 'test.srt'), 4) as $entry) {
    list( $number, $time, $subtitle) = $entry;
    echo $number . '<br />';
    echo $time . '<br />';
    echo $subtitle . '<br />';
}

Hice una clase para convertir un archivo .srt a matriz. Cada entrada de la matriz tiene las siguientes propiedades:

ID: un número que representa la identificación del subtítulo (2)
inicio: flotador, la hora de inicio en segundos (24.443)
Fin: Float, la hora de finalización en segundos (27.647)
StartString: la hora de inicio en el formato legible humano (00: 00: 24.443)
Endstring: la hora final en el formato legible humano (00:00: 24.647)
Duración: la duración del subtítulo, en MS (3204)

Texto: El texto del subtítulo ( Los pavos reales gobernaron sobre la ciudad de Gongmen. )
El código es PHP7:
<?php namespace VideoSubtitles\Srt; class SrtToArrayTool { public static function getArrayByFile(string $file): array { $ret = []; $gen = function ($filename) { $file = fopen($filename, 'r'); while (($line = fgets($file)) !== false) { yield rtrim($line); } fclose($file); }; $c = 0; $item = []; $text = ''; $n = 0; foreach ($gen($file) as $line) { if ('' !== $line) { if (0 === $n) { $item['id'] = $line; $n++; } elseif (1 === $n) { $p = explode('-->', $line); $start = str_replace(',', '.', trim($p[0])); $end = str_replace(',', '.', trim($p[1])); $startTime = self::toMilliSeconds(str_replace('.', ':', $start)); $endTime = self::toMilliSeconds(str_replace('.', ':', $end)); $item['start'] = $startTime / 1000; $item['end'] = $endTime / 1000; $item['startString'] = $start; $item['endString'] = $end; $item['duration'] = $endTime - $startTime; $n++; } else { if ($n >= 2) { if ('' !== $text) { $text .= PHP_EOL; } $text .= $line; } } } else { if (0 !== $n) { $item['text'] = $text; $ret[] = $item; $text = ''; $n = 0; } } $c++; } return $ret; } private static function toMilliSeconds(string $duration): int { $p = explode(':', $duration); return (int)$p[0] * 3600000 + (int)$p[1] * 60000 + (int)$p[2] * 1000 + (int)$p[3]; } }

o échale un vistazo aquí: https://github.com/lingtalfi/videosubtitles

Puede usar este proyecto: https://github.com/captioning/captioning

Código de muestra:

<?php
require_once __DIR__.'/../vendor/autoload.php';

use Captioning\Format\SubripFile;

try {
    $file = new SubripFile('your_file.srt');

    foreach ($file->getCues() as $line) {
        echo 'start: ' . $line->getStart() . "<br />\n";
        echo 'stop: ' . $line->getStop() . "<br />\n";
        echo 'startMS: ' . $line->getStartMS() . "<br />\n";
        echo 'stopMS: ' . $line->getStopMS() . "<br />\n";
        echo 'text: ' . $line->getText() . "<br />\n";
        echo "=====================<br />\n";
    }

} catch(Exception $e) {
    echo "Error: ".$e->getMessage()."\n";
}

Salida de muestra:

> php index.php
start: 00:01:48,387<br />
stop: 00:01:53,269<br />
startMS: 108387<br />
stopMS: 113269<br />
text: ┘ç┘à╪د┘ç┘┌»█î ╪▓█î╪▒┘┘ê█î╪│ ╪ذ╪د ┌ر█î┘█î╪ز ╪ذ┘┘ê╪▒█î ┘ê ┌ر╪»┌ر x265
=====================<br />
start: 00:02:09,360<br />
stop: 00:02:12,021<br />
startMS: 129360<br />
stopMS: 132021<br />
text: .┘à╪د ┘╪ذ╪د┘è╪» ╪ز┘┘ç╪د┘è┘è ╪د┘è┘╪ش╪د ╪ذ╪د╪┤┘è┘à -
┌╪▒╪د ╪ا<br />
=====================<br />
start: 00:02:12,022<br />
stop: 00:02:14,725<br />
startMS: 132022<br />
stopMS: 134725<br />
text: ..╪د┌»┘ç ┘╛╪»╪▒╪ز -
.╪د┘ê┘ ┘ç┘è┌┘ê┘é╪ز ┘à╪ز┘ê╪ش┘ç ╪▒┘╪ز┘┘à┘ê┘ ┘┘à┘è╪┤┘ç -<br />
=====================<br />

Se puede hacer usando PHP Line-Break. Yo podría hacerlo con éxito Déjame mostrar mi código

$srt=preg_split("/\\r\\n\\r\\n/",trim($movie->SRT));
            $result[$i]['IMDBID']=$movie->IMDBID;
            $result[$i]['TMDBID']=$movie->TMDBID;

aquí $ Película-> SRT es el subtítulo de tener formato u publicado en esta pregunta. Como vemos, cada espacio de tiempo es dos nuevas líneas, Espero que te respondas.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow