Question

I'm new to Python. My second time coding in it. The main point of this script is to take a text file that contains thousands of lines of file names (sNotUsed file) and match it against about 50 XML files. The XML files may contain up to thousands of lines each and are formatted as most XML's are. I'm not sure what the problem with the code so far is. The code is not fully complete as I have not added the part where it writes the output back to an XML file, but the current last line should be printing at least once. It is not, though.

Examples of the two file formats are as follows:

TEXT FILE:

fileNameWithoutExtension1
fileNameWithoutExtension2
fileNameWithoutExtension3
etc.

XML FILE:

<blocks> 

<more stuff="name"> 
     <Tag2> 
        <Tag3 name="Tag3">
                 <!--COMMENT-->
                 <fileType>../../dir/fileNameWithoutExtension1</fileType>
                 <fileType>../../dir/fileNameWithoutExtension4</fileType>
</blocks>

MY CODE SO FAR:

import os
import re

sNotUsed=list()
sFile = open("C:\Users\xxx\Desktop\sNotUsed.txt", "r") # open snotused txt file
for lines in sFile:
    sNotUsed.append(lines)
#sNotUsed = sFile.readlines() # read all lines and assign to list
sFile.close() # close file

xmlFiles=list() # list of xmlFiles in directory
usedS=list() # list of S files that do not match against sFile txt

search = "\w/([\w\-]+)"

# getting the list of xmlFiles
filelist=os.listdir('C:\Users\xxx\Desktop\dir')
for files in filelist:
    if files.endswith('.xml'):
        xmlFile = open(files, "r+") # open first file with read + write access
        xmlComp = xmlFile.readlines() # read lines and assign to list
        for lines in xmlComp: # iterate by line in list of lines
            temp = re.findall(search, lines)
            #print temp
            if temp:
                if temp[0] in sNotUsed:
                    print "yes" # debugging. I know there is at least one match for sure, but this is not being printed.

TO HELP CLEAR THINGS UP: Sorry, I guess my question wasn't very clear. I would like the script to go through each XML line by line and see if the FILENAME part of that line matches with the exact line of the sNotUsed.txt file. If there is match then I want to delete it from the XML. If the line doesn't match any of the lines in the sNotUsed.txt then I would like it be part of the output of the new modified XML file (which will overwrite the old one). Please let me know if still not clear.

EDITED, WORKING CODE

import os
import re
import codecs

sFile = open("C:\Users\xxx\Desktop\sNotUsed.txt", "r") # open sNotUsed txt file
sNotUsed=sFile.readlines() # read all lines and assign to list
sFile.close() # close file

search = re.compile(r"\w/([\w\-]+)")

sNotUsed=[x.strip().replace(',','') for x in sNotUsed]
directory=r'C:\Users\xxx\Desktop\dir'
filelist=os.listdir(directory) # getting the list of xmlFiles
# for each file in the list
for files in filelist:
    if files.endswith('.xml'): # make sure it is an XML file
        xmlFile = codecs.open(os.path.join(directory, files), "r", encoding="UTF-8") # open first file with read
        xmlComp = xmlFile.readlines() # read lines and assign to list
        print xmlComp
        xmlFile.close() # closing the file since the lines have already been read and assigned to a variable
        xmlEdit = codecs.open(os.path.join(directory, files), "w", encoding="UTF-8") # opening the same file again and overwriting all existing lines
        for lines in xmlComp: # iterate by line in list of lines
            #headerInd = re.search(search, lines) # used to get the headers, comments, and ending blocks
            temp = re.findall(search, lines) # finds all strings that match the regular expression compiled above and makes a list for each
            if temp: # if the list is not empty
                if temp[0] not in sNotUsed: # if the first (and only) value in each list is not in the sNotUsed list
                    xmlEdit.write(lines) # write it in the file
            else: # if the list is empty
                xmlEdit.write(lines) # write it (used to preserve the beginning and ending blocks of the XML, as well as comments)
Was it helpful?

Solution

There is a lot of things to say but I'll try to stay concise.

PEP8: Style Guide for Python Code

You should use lower case with underscores for local variables. take a look at the PEP8: Style Guide for Python Code.

File objects and with statement

Use the with statement to open a file, see: File Objects: http://docs.python.org/2/library/stdtypes.html#bltin-file-objects

Escape Windows filenames

Backslashes in Windows filenames can cause problems in Python programs. You must escape the string using double backslashes or use raw strings.

For example: if your Windows filename is "dir\notUsed.txt", you should escape it like this: "dir\\notUsed.txt" or use a raw string r"dir\notUsed.txt". If you don't do that, the "\n" will be interpreted as a newline!

Note: if you need to support Unicode filenames, you can use Unicode raw strings: ur"dir\notUsed.txt".

See also the question 19065115 in StockOverFlow.

store the filenames in a set: it is an optimized collection without duplicates

not_used_path = ur"dir\sNotUsed.txt"
with open(not_used_path) as not_used_file:
    not_used_set = set([line.strip() for line in not_used_file])

Compile your regex

It is more efficient to compile a regex when used numerous times. Again, you should use raw strings to avoid backslashes interpretation.

pattern = re.compile(r"\w/([\w\-]+)")

Warning: os.listdir() function return a list of filenames not a list of full paths. See this function in the Python documentation.

In your example, you read a desktop directory 'C:\Users\xxx\Desktop\dir' with os.listdir(). And then you want to open each XML file in this directory with open(files, "r+"). But this is wrong, until your current working directory isn't your desktop directory. The classic usage is to used os.path.join() function like this:

desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
    desktop_path = os.path.join(desktop_dir, filename)

If you want to extract the filename's extension, you can use the os.path.splitext() function.

desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
    if os.path.splitext(filename)[1].lower() != '.xml':
        continue
    desktop_path = os.path.join(desktop_dir, filename)

You can simplify this with a comprehension list:

desktop_dir = r'C:\Users\xxx\Desktop\dir'
xml_list = [os.path.join(desktop_dir, filename)
            for filename in os.listdir(desktop_dir)
            if os.path.splitext(filename)[1].lower() == '.xml']

Parse a XML file

How to parse a XML file? This is a great question! There a several possibility: - use regex, efficient but dangerous; - use SAX parser, efficient too but confusing and difficult to maintain; - use DOM parser, less efficient but clearer... Consider using lxml package (@see: http://lxml.de/)

It is dangerous, because the way you read the file, you don't care of the XML encoding. And it is bad! Very bad indeed! XML files are usually encoded in UTF-8. You should first decode UTF-8 byte stream. A simple way to do that is to use codecs.open() to open an encoded file.

for xml_path in xml_list:
    with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
        content = xml_file.read()

With this solution, the full XML content is store in the content variable as an Unicode string. You can then use a Unicode regex to parse the content.

Finally, you can use a set intersection to find if a given XML file contains commons names with the text file.

for xml_path in xml_list:
    with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
        content = xml_file.read()
    actual_set = set(pattern.findall(content))
    print(not_used_set & actual_set)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top