Analyzing, categorizing and indexing metadata

https://stackoverflow.com/questions/533036

22-08-2019
|

Question

I have a large (~2.5M records) data base of image metadata. Each record represents an image and has a unique ID, a description field, a comma-separated list of keywords (say 20-30 keywords per image), and some other fields. There's no real database schema, and I have no way of knowing which keywords exists in the database without iterating over every image and counting them. Also, the metadata comes from several different suppliers, who each have their own ideas about how to fill out the different fields.

There are some things I would like to do with this metadata, but since I'm totally new to this kind of algorithms I don't even know where to begin looking.

Some of these images have certain usage restrictions on them (given in text), but each supplier phrase them differently, and there is no way to guarantee consistency. I'd like to have a simple test I could apply to an image that gives an indication if that image is free from restrictions or not. It doesn't have to be perfect, just 'good enough'. I suspect I could use some kind Bayesian filter for this, right? I could train the filter with a corpus of images that I know are either restricted or restriction-free, and then the filter would be able to make predictions for the rest of the images? Or are there better ways?
I would also like to be able to index these images according to 'keyword likeness', so that if I have one image, I could quickly tell which other images it shares the most keywords with. Ideally, the algorithm would also take into account that some keywords are more significant than others and weigh them differently. I don't even know where to start looking here, and would be very glad for any pointers :)

I'm working primarily in Java, but language choice is irrelevant here. I'm more interested in learning what approaches would be best for me to start reading up on. Thanks in advance :)

Solution

(1) Looks like a classification problem with words in your text as features, and "Restricted" and "Not Restricted" as your labels. Bayesian filtering or any classification algorithm should do the trick.

(2) Looks like a clustering problem. First you want to come up with a good similarity function that returns a similarity score for two images bases on their keywords. Cosine similarity might be a good starting point, since you are comparing keywords. From there you can compute a similarity matrix and just remember a list of 'nearest neighbors' for each image in your dataset, or you can go further and use a clustering algorithm to come up with actual clusters of images.

Since you have so many records, you might want to skip computing the entire similarity matrix, and just compute clusters for a small, random sample of your dataset. You can then add the other data points to the appropriate clusters. If you want to preserve more similarity information you can look into soft clustering.

Hopefully that will get you started.

OTHER TIPS

definitely you have to start by turning your 'list of keywords' field into a real tagging scheme. the easiest one is a table of tags, and a 'Many-to-Many' relationship with the image table (that is, a third table where each record has a foreign key to an image and another foreign key to a keyword). it makes real fast to find all images with a certain set of keywords.

the bayesian filter to detect restriction phrasing, is interesting. i'd say go for it, unless you're pressed for time. if that's the case, a few simple pattern matching should pick up more than 90-95% of cases, and the rest could be quickly finished by hand by a couple of operators.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow