Extracting text from IRC logs

https://stackoverflow.com/questions/13251791

27-11-2021
|

Question

I'd like to extract texts from irc logs. I have regular IRC log from irssi like this:

00:12 -!- Barbora [post@gw1-nat-041.roburnet.sk] has joined #post.sk
00:12 -!- mirinda [~post@195.91.55.136] has quit [Broken pipe]
00:12 -!- rogue1 [post@86-41-114-24-dynamic.b-ras2.lmk.limerick.eircom.net] has joined #post.sk
00:12 -!- Komunista is now known as Anonym9901
00:13 -!- ajka [~post@78.141.102.209] has quit [Client exited]
00:16 < blackmamba> no fuj
00:16 < blackmamba> Komunista: lol
00:16 < blackmamba> "este trochu"
00:16 < blackmamba> "je taky velky"
00:17 -!- majopo [post@adsl-d192.84-47-63.t-com.sk] has quit [Client exited]
00:19 -!- Anonym9901 is now known as Komunista
00:19 -!- dido84 [post@BSN-143-83-49.dial-up.dsl.siol.net] has quit [Client exited]
00:19 < Komunista> no?
00:20 < Komunista> ja by som*nadavka*l
00:20 < Komunista> ako pes
00:20 -!- Komunista is now known as Anonym53560

What I need is output like this:

no fuj lol este trochu je taky velky no ja by som*nadavka*l ako pes

So, just words separated by whitespace, nothing else, no nicks, no quotationmarks, questionmarks etc. I need it as input for LDA.

Nicks I will remove by postprocessing, it will be easier I think, or?

I prefer PHP with regex, I'm not good at it, that's why I ask you all for help.

Thank you for your time!

EDIT:

Now I use this code (Thank to m.buettner):

$input = ... ;
$smiles = [">:]", ":-)", ":)", ":o)", ":]", ":3", ":c)", ":>", "=]", "8)", "=)", ":}", ":^)", ">:D", ":-D", ":D", "8-D", "x-D", "X-D", "=-D", "=D", "=-3", "8-)", ">:[", ":-(", ":(", ":-c", ":c", ":-<", ":-[", ":[", ":{", ">.>", "<.<", ">.<", ">;]", ";-)", ";)", "*-)", "*)", ";-]", ";]", ";D", ";^)", ">:P", ":-P", ":P", "X-P", "x-p", ":-p", ":p", "=p", ":-Þ", ":Þ", ":-b", ":b", "=p", "=P", ">:o", ">:O", ":-O", ":O", "°o°", "°O°", ":O", "o_O", "o.O", "8-0", ">:\\", ">:/", ":-/", ":-.", ":\\", "=/", "=\\", ":S", ":'(", ";'("];

$input = str_replace($smiles, '', $input);
$resultStr = '';
preg_match_all('/^\d\d:\d\d\s+<[%|\s|@|+][_a-zA-Z0-9]*>\s([^\r\n]*)/m', $input, $matches);
$resultStr = implode(' ', $matches[1]);
$resultStr = preg_replace('/[^\w\s*]+/', '', $resultStr);

preg_match_all('/<[%|\s|@|+][_a-zA-Z0-9]*>/m', $input, $nicks);
$nicks[0] = str_replace(['<', '>', ' ', '%', '+', '$', '@'], '', $nicks[0]);
$resultStr = str_replace($nicks[0], '', $resultStr);

Any suggestions to improve it will be appreciated ;)

Solution

Something like this?

preg_match_all('/^\d\d:\d\d\s+<[^>]*>([^\r\n]*)/m', $input, $matches);

$resultStr = implode(' ', $matches[1]);
$resultStr = preg_replace('/[^\w\s*]+/', '', $resultStr);

First we match everything after hh:mm < name> until the end of the line. Then we join those results together with spaces, and then we remove all non-word, non-space, non-asterisk characters. Add other character you want to keep to the character class in the preg_replace.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow