Domanda

I am very noob to Powershell and have small amounts of Linux bash scripting experience. I have been looking for a way to get a list of files that have Social Security Numbers on a server. I found this in my research and it performed exactly as I had wanted when testing on my home computer except for the fact that it did not return results from my work and excel test documents. Is there a way to use a PowerShell command to get results from the various office documents as well? This server is almost all Word and excel files with a few PowerPoints.

PS C:\Users\Stephen> Get-ChildItem -Path C:\Users -Recurse -Exclude *.exe, *.dll | `
Select-String "\d{3}[-| ]\d{2}[-| ]\d{4}"

Documents\SSN:1:222-33-2345
Documents\SSN:2:111-22-1234
Documents\SSN:3:111 11 1234

PS C:\Users\Stephen> Get-childitem  -rec | ?{ findstr.exe /mprc:. $_.FullName } | `
select-string "[0-9]{3}[-| ][0-9]{2}[-| ][0-9]{4}"

Documents\SSN:1:222-33-2345
Documents\SSN:2:111-22-1234
Documents\SSN:3:111 11 1234

È stato utile?

Soluzione

Is there a way to use a PowerShell command to get results from the various office documents as well? This server is almost all Word and excel files with a few PowerPoints.

When interacting with MS Office files, the best way is to use COM interfaces to grab the information you need.

If you are new to Powershell, COM will definitely be somewhat of a learning curve for you, as very little "beginner" documentation exists on the internet.

Therefore I strongly advise starting off small :

  • First focus on opening a single Word doc and reading in the contents into a string for now.
  • Once you have this ready, focus on extracting relevant info (The Powershell Match operator is very helpful)
  • Once you are able to work with a single Word doc, try to locate all files named *.docx in a folder and repeat your process on them: foreach ($file in (ls *.docx)) { # work on $file }

Here's some reading (admittedly, all this is for Excel as I build automated Excel charting tools, but the lessons will be very helpful for automating any Office application)

Altri suggerimenti

When you only want to restrict this to docx and xlsx, you might also want to consider plain unzipping and then searching through the contents, ignoring any XML tags (so allow between each digit one or more XML elements).

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top