Match two files against each other and write output as file - Python

Question

There is a lot of things to say but I'll try to stay concise.

PEP8: Style Guide for Python Code

You should use lower case with underscores for local variables. take a look at the PEP8: Style Guide for Python Code.

File objects and `with` statement

Use the with statement to open a file, see: File Objects: http://docs.python.org/2/library/stdtypes.html#bltin-file-objects

Escape Windows filenames

Backslashes in Windows filenames can cause problems in Python programs. You must escape the string using double backslashes or use raw strings.

For example: if your Windows filename is "dir\notUsed.txt", you should escape it like this: "dir\\notUsed.txt" or use a raw string r"dir\notUsed.txt". If you don't do that, the "\n" will be interpreted as a newline!

Note: if you need to support Unicode filenames, you can use Unicode raw strings: ur"dir\notUsed.txt".

See also the question 19065115 in StockOverFlow.

store the filenames in a set: it is an optimized collection without duplicates

not_used_path = ur"dir\sNotUsed.txt"
with open(not_used_path) as not_used_file:
    not_used_set = set([line.strip() for line in not_used_file])

Compile your regex

It is more efficient to compile a regex when used numerous times. Again, you should use raw strings to avoid backslashes interpretation.

pattern = re.compile(r"\w/([\w\-]+)")

Warning: os.listdir() function return a list of filenames not a list of full paths. See this function in the Python documentation.

In your example, you read a desktop directory 'C:\Users\xxx\Desktop\dir' with os.listdir(). And then you want to open each XML file in this directory with open(files, "r+"). But this is wrong, until your current working directory isn't your desktop directory. The classic usage is to used os.path.join() function like this:

desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
    desktop_path = os.path.join(desktop_dir, filename)

If you want to extract the filename's extension, you can use the os.path.splitext() function.

desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
    if os.path.splitext(filename)[1].lower() != '.xml':
        continue
    desktop_path = os.path.join(desktop_dir, filename)

You can simplify this with a comprehension list:

desktop_dir = r'C:\Users\xxx\Desktop\dir'
xml_list = [os.path.join(desktop_dir, filename)
            for filename in os.listdir(desktop_dir)
            if os.path.splitext(filename)[1].lower() == '.xml']

Parse a XML file

How to parse a XML file? This is a great question! There a several possibility: - use regex, efficient but dangerous; - use SAX parser, efficient too but confusing and difficult to maintain; - use DOM parser, less efficient but clearer... Consider using lxml package (@see: http://lxml.de/)

It is dangerous, because the way you read the file, you don't care of the XML encoding. And it is bad! Very bad indeed! XML files are usually encoded in UTF-8. You should first decode UTF-8 byte stream. A simple way to do that is to use codecs.open() to open an encoded file.

for xml_path in xml_list:
    with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
        content = xml_file.read()

With this solution, the full XML content is store in the content variable as an Unicode string. You can then use a Unicode regex to parse the content.

Finally, you can use a set intersection to find if a given XML file contains commons names with the text file.

for xml_path in xml_list:
    with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
        content = xml_file.read()
    actual_set = set(pattern.findall(content))
    print(not_used_set & actual_set)