Finding the maximum occurring string within a text file

https://stackoverflow.com/questions/10259631

02-06-2021
|

Question

So I've seen questions asked before that are along the lines of finding the maximum occurence of a string within a file but all of those rely on knowing what to look for.

I have what you might almost call a flat file database that grabs a bunch of input data and basically wraps different parts of it in html span tags with referencing ids.

Each line comes out in this kind of fashion:

<p>
<span class="ip">58.106.**.***</span> 
Wrote <span class='text'>some text</span>
<span class='effect1'> and caused seizures </span>
<span class='time'>23:47</span> 
</p>

How would I then go about finding the #test contents that occurs the most times.

i.e if I had

<p>
    <span class="ip">58.106.**.***</span> 
    Wrote <span id='text'>woof</span>
    <span class='effect1'> and caused seizures </span>
    <span class='time'>23:47</span> 
    </p>

<p>
    <span class="ip">58.106.**.***</span> 
    Wrote <span class='text'>meow</span>
    <span class='effect1'> and caused mind-splosion </span>
    <span class='time'>23:47</span> 
    </p>

<p>
    <span class="ip">58.106.**.***</span> 
    Wrote <span class='text'>meow</span>
    <span class='effect1'> and used no effect </span>
    <span class='time'>23:47</span> 
    </p>

<p>
    <span class="ip">58.106.**.***</span> 
    Wrote <span class='text'>meow</span>
    <span class='effect1'> and used no effect </span>
    <span class='time'>23:47</span> 
    </p>

the output would be 'meow'.

How would I accomplish this in php?

Solution

First off: Your format is not conducive to this type of data manipulation; you might want to consider changing it.

That said, based on this structure the logical solution would be to leverage DOMXPath as Dani says. This could have been problematic because of all the duplicate ids in there, but in practice it works (after emitting a boatload of warnings, which is one more reason that the data structure affords revision).

Here's some code to go with the idea:

$input = '<body>'.get_input().'</body>';
$doc = new DOMDocument;
$doc->loadHTML($input); // lots of warnings, duplicate ids!
$xpath = new DOMXPath($doc);
$result = $xpath->query("//*[@id='text']/text()");

$occurrences = array();
foreach ($result as $item) {
    if (!isset($occurrences[$item->wholeText])) {
        $occurrences[$item->wholeText] = 0;
    }
    $occurrences[$item->wholeText]++;
}

// Sort the results and produce final answer    
arsort($occurrences);
reset($occurrences);

echo "The most common text is '".key($occurrences).
     "', which occurs ".current($occurrences)." times.";

See it in action.

Update (seeing as you fixed the duplicate id issue): You would simply change the xpath query to "//*[@class='text']/text()" so that it continues to match. However this way of doing things remains inefficient, so if one or more of these apply:

you are going to do this all the time
you have lots of data
you need it to be really fast

then changing the data format is a good idea.

OTHER TIPS

Have a look at DOMXPath, you can use an XPath query to get all the #text and then find the most used one with php.
There is a problem that you used the same id few times which is not valid HTML so DOM might break.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow