Find arbitary patterns common to a group of strings

https://stackoverflow.com/questions/10151336

31-05-2021
|

Question

Background:

I am developing a program in that iterates over all the movies & tv series episodes stored on my computer, rates them (using rotten tomatoes) and sorts them in order of rating.

I extract the movie name by removing all the unneccessary text such as '.avi', '720p' etc. from the file name.

I am using Java.

Problem:

Some folders contain movie files such as:

Episode 301 Rainforest Schmainforest.avi

Episode 302 Spontaneous Combustion.avi

The word 'Episode' and numbers are valid and are common words in movies, so I can't simply remove them. However, It is clear from the repetitive nature of the names that 'Episode' and '3XX' should be removed.

Aother folder might be:

720p.S5.E1.cripple fight.avi

720p.S5.E2.towelie.avi

Many arbitary patterns like these exist in different groups of files, and I need something to recongise these arbitary patterns so I can extract the keywords. It would be unfeasible to write regex for each case.

Summary:

Is there a tool or API that I can use to find complex repetitive patterns (must be able to match sequences of numbers)? [something like a longest common sequence library]

Solution

Well, you could simply take all the filtered names in your dir, and do a simple word-count. You could give extra weight to words that occur in (roughly) the same spot every time.

In the end you'd end up with a count and a weight, and you need to decide what lines to draw. It's probably not every file in the dir (because of maybe images or samples), but if most have a certain word, it's not "the" or something like that, and mabye they all appear "at the start" or "on the second spot", you can filter them.

But this wouldn't work for, random example, Friends episodes. THey're all called "The one where.....". That would be filtered in every sane version of your sought-after algorithm

The bottom line is: I don't think you can because of the friends-episode-problem. There just not enough distinction between wanted repetition and unwanted repetition.

Only thing you can do is make a blacklist of stuff you want to filter, like you allready seem to do with the avi / 720 thing.

OTHER TIPS

I believe that what you are asking for is not trivial. Pattern extraction, as opposed to mere recognition, is well within the fields of artificial intelligence and knowledge discovery. I have encountered several related libraries for Java, but most need a lot of additional code to define even the simplest task.

Since this is a rather hot research area, you might want to perform a cursory search in Google Scholar, using appropriate keywords.

Disclaimer: before you use any library or algorithm found via the Internet, you should investigate its legal status. Unfortunately quite a few of the algorithms that are developed in active research areas are often encumbered by patents and such...

I have a kind-of answer posted here
http://pastebin.com/Eb0cQyKd

I wanted to remove non-unique parts of file names such as'720dpi', 'Episode', 'xvid' 'ac3' without specifying in advance what they would be. But I wanted to keep information like S01E01. I had created a huge black list but it wasn't convenient because the list kept on changing.

The code linked above uses Python (not Java) to remove all non-unique words in a file name. Basically it creates a list of all the words used in the file names, and any word which comes up for most of the files it puts into a dictionary. Then it iterates through the files and deletes all these dictionary words from them.

The script also does some cleaning: some movies use underscores ('_') or periods ('.') to separate words in the filenames. I convert all these to spaces.

I have used it a lot recently and it works well.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow