There is a lot of things to say but I'll try to stay concise.
PEP8: Style Guide for Python Code
You should use lower case with underscores for local variables. take a look at the PEP8: Style Guide for Python Code.
File objects and with
statement
Use the with
statement to open a file, see: File Objects: http://docs.python.org/2/library/stdtypes.html#bltin-file-objects
Escape Windows filenames
Backslashes in Windows filenames can cause problems in Python programs. You must escape the string using double backslashes or use raw strings.
For example: if your Windows filename is "dir\notUsed.txt"
, you should escape it like this: "dir\\notUsed.txt"
or use a raw string r"dir\notUsed.txt"
. If you don't do that, the "\n"
will be interpreted as a newline!
Note: if you need to support Unicode filenames, you can use Unicode raw strings: ur"dir\notUsed.txt"
.
See also the question 19065115 in StockOverFlow.
store the filenames in a set
: it is an optimized collection without duplicates
not_used_path = ur"dir\sNotUsed.txt"
with open(not_used_path) as not_used_file:
not_used_set = set([line.strip() for line in not_used_file])
Compile your regex
It is more efficient to compile a regex when used numerous times. Again, you should use raw strings to avoid backslashes interpretation.
pattern = re.compile(r"\w/([\w\-]+)")
Warning: os.listdir()
function return a list of filenames not a list of full paths. See this function in the Python documentation.
In your example, you read a desktop directory 'C:\Users\xxx\Desktop\dir'
with os.listdir()
. And then you want to open each XML file in this directory with open(files, "r+")
. But this is wrong, until your current working directory isn't your desktop directory. The classic usage is to used os.path.join()
function like this:
desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
desktop_path = os.path.join(desktop_dir, filename)
If you want to extract the filename's extension, you can use the os.path.splitext()
function.
desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
if os.path.splitext(filename)[1].lower() != '.xml':
continue
desktop_path = os.path.join(desktop_dir, filename)
You can simplify this with a comprehension list:
desktop_dir = r'C:\Users\xxx\Desktop\dir'
xml_list = [os.path.join(desktop_dir, filename)
for filename in os.listdir(desktop_dir)
if os.path.splitext(filename)[1].lower() == '.xml']
Parse a XML file
How to parse a XML file? This is a great question! There a several possibility: - use regex, efficient but dangerous; - use SAX parser, efficient too but confusing and difficult to maintain; - use DOM parser, less efficient but clearer... Consider using lxml package (@see: http://lxml.de/)
It is dangerous, because the way you read the file, you don't care of the XML encoding. And it is bad! Very bad indeed! XML files are usually encoded in UTF-8. You should first decode UTF-8 byte stream. A simple way to do that is to use codecs.open() to open an encoded file.
for xml_path in xml_list:
with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
content = xml_file.read()
With this solution, the full XML content is store in the content
variable as an Unicode string. You can then use a Unicode regex to parse the content.
Finally, you can use a set intersection to find if a given XML file contains commons names with the text file.
for xml_path in xml_list:
with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
content = xml_file.read()
actual_set = set(pattern.findall(content))
print(not_used_set & actual_set)