How to delete almost duplicate files

https://stackoverflow.com/questions/17913251

04-06-2022
|

質問

Edit 2:

Solved, see my answer waaaaaaay below.

Edit:

After banging my head a few times, I almost did it. Here's my (not cleaned up, you can tell I was troubleshooting a bunch of stuff) code:

http://pastebin.com/ve4Qkj2K

And here's the problem: It works sometimes and other times not so much. For example, it will work perfectly with some files, then leave one of the longest codes instead of the shortest one, and for others it will delete maybe 2 out of 5 duplicates, leaving 3 behind. If it just performed reliably, I might be able to fix it, but I don't understand the seemingly random behavior. Any ideas?

Original Post:

Just so you know, I'm just beginning with python, and I'm using python 3.3

So here's my problem:

Let's say I have a folder with about 5,000 files in it. Some of these files have very similar names, but different contents and possible different extensions. After a readable name, there is a code, always with a "(" or a "[" (no quotes) before it. The name and code are separated by a space. For example:

    something (TZA).blah
    something [TZZ].another
    hello (YTYRRFEW).extension
    something (YJTR).another_ext

I'm trying to only get one of the something's.something, and delete the others. Another fact which may be important is that there are usually more than one code, such as "something (THTG) (FTGRR) [GTGEES!#!].yet_another_random_extension", all separated by spaces. Although it doesn't matter 100%, it would be best to save the one with the least codes.

I made some (very very short) code to get a list of all files:

    import glob
    files=[]
    files=glob.glob("*")

but after this I'm pretty much lost. Any help would be appreciated, even if it's just pointing me in the right direction!

解決 2

I got it! The version I ended up with works (99%) perfectly. Although it needs to make multiply passes, reading,analyzing, and deleting over 2 thousand files took about 2 seconds on my pitiful, slow notebook. My final version is here:

http://pastebin.com/i7SE1mh6

The only tiny bug is that if the final item in the list has a duplicate, it will leave it there (and no more than 2). That's very simple to manually correct so I didn't bother to fix it (ain't nobody got time fo that and all).

Hope sometime in the future this could actually help somebody other than me.

I didn't get too many answers here, but it WAS a pretty unusual problem, so thanks anyway. See ya.

他のヒント

I would suggest creating separate array of bare file names and check the condition if any element exists in any other place by taking array with all indices excluding the current checked in loop iteration. The

    if str_fragment in name

condition simply finds any string fragment in any string-type name. It can be useful as well.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow