What's the best way to deduplicate with the info I have?

https://stackoverflow.com/questions/13074762

14-07-2021
|

Domanda

I need to find and remove duplicate files (.pst) and eventually get the unique emails. Currently, I am using Powershell to recursively go through folders to find only .pst files and then export specific metadata into a .csv file. It has been suggested to me to import the .csv into SQL to do comparisons (name, dates on the files, etc...). After that, I'm stuck.

What language or program would be best suited to get the files I need and delete the rest of them? I'm pretty much working in VB.Net (could attempt C#) and powershell.

Soluzione

I'll assume that you did import the .csv into an SQL database. Let's say the table name is psts.

First, to find out how many records have the same email address,

SELECT email, count(*)
FROM psts
GROUP BY email

Next, you don't want to see the ones which have one value, so,

SELECT email, count(*)
FROM psts
GROUP BY email
HAVING count(*) >= 2

to get a list of those records,

SELECT out.email, count(*), out.<other fields you want to see>
FROM psts out
  INNER JOIN psts pst_count ON out.email = pst_count.email
GROUP BY out.email
HAVING count(*) >= 2

Altri suggerimenti

You can use whatever you want really, as long as it has a way to edit the filesystem and can execute SQL queries.

I don't know if VB.net has SQL server support, but since it's a .net language I'm pretty sure it does.

I'd say go with whatever language you know that supports those two features :)

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow