Question

I have got a Tab seprated file and I have to convert it into an xml with relevant child nodes. The file looks like this -

Miscellaneous           
Ceremonial      
    Test1           
    Test2
    Test3
Sport       
    Athletics   
    Basketball  
    Biathlon    
    Boxing  
    Canoeing    
    Clay Pigeon Shooting    
    Climbing    
    Cricket 
    Cycling 
    Diving  
    Football    
    Football    
    Freefall    
    Gliding 
    Hill Walking    
    Hockey  
    Martial Arts    
        Karate
        Judo
        Jujitsu
    Modern Pentathlon   
    Mountaineering  
    Orienteering    
    Parachuting 
    Paragliding 
    Parascending    
    Polo    
    Rugby   
    Rugby League    
    Rugby Union 
    Soccer  

I am stuck at the 3rd level of node i.e. Martial Arts.

Here is the code which I have written and works fine till 2nd level.

Could anyone please tell me what to fix to make it wok for 3rd and more levels -

<cfif structKeyExists(form, "xlsfile") and len(form.xlsfile)>

<!--- Destination outside of web root --->
<cfset dest = getTempDirectory() />
<cffile action="upload" destination="#dest#" filefield="xlsfile" result="upload" nameconflict="makeunique">
<cfset theFileUploaded = upload.serverDirectory & "/" & upload.serverFile />
<cffile action="read" file="#theFileUploaded#" variable="theFile">
<cfset CrLf = chr(10) & chr(13) />
<cfset counter = 0 />

<cfset dataStr = structNew()>
<cfset isRoot = false>
<cfset tabCount = 0>
<cfset counter = 1>
<cfset childCounter = 1>
<cfset previousResult = 1>

<cfloop list="#theFile#" index="run" delimiters="#CrLf#">
    <!--- The test value. --->
    <cfset strTest = #Rtrim(run)# />
    <!--- The instance counter. --->
    <cfset intCount = 0 />
    <!--- Get the initial position. --->
    <cfset intPosition = Find( chr(9), strTest, 0 ) />
    <!--- Keep searching till no more instances are found. --->
    <cfloop condition="intPosition">
        <!--- Increment instance counter. --->
        <cfset intCount = (intCount + 1) />
        <!--- Get the next position. --->
        <cfset intPosition = Find(chr(9), strTest, (intPosition + Len( chr(9) ))) />
    </cfloop>

    <!--- Output the number of target instances.
     <cfoutput>
        --- #intCount# --- <br/>
    </cfoutput>         --->
    <cfset childNode = "Child" & counter>
    <cfdump var="#intCount-tabCount#">
    <!--- Root --->
    <cfif intCount eq 0>
        <cfset dataStr.root = strTest>
        <cfset tabCount = intCount>
    <!--- Child at level 1 ---> 
    <cfelseif tabCount eq 0 >
        <cfset tabCount = intCount>
        <cfset dataStr[childNode] = StructNew()>
        <cfset dataStr[childNode].root = strTest>
    <!--- Child at sub levels --->  
    <cfelseif ((intCount-tabCount) eq 0) or ((intCount-tabCount) eq 1)>

        <cfif previousResult eq 0 and intCount-tabCount eq 1>
            <cfdump var="#strTest#">
        </cfif> 

            <cfset tabCount = intCount>         
            <cfset tabCount = intCount>
            <cfset subChildNode = "Child" & childCounter>
            <cfset dataStr[childNode][subChildNode] = strTest>      
            <cfset childCounter = childCounter+1>
            <cfset previousResult = intCount-tabCount>

    <cfelseif previousResult eq 0>
        <cfset counter = counter+1>
        <cfset childNode = "Child" & counter>
        <cfset childCounter = 1>
        <cfset tabCount = intCount>
        <cfset dataStr[childNode] = StructNew()>
        <cfset dataStr[childNode].root = strTest>                       
    <cfelse>
        <cfset counter = counter+1>
        <cfset childNode = "Child" & counter>
        <cfset childCounter = 1>
        <cfset tabCount = intCount>
        <cfset dataStr[childNode] = StructNew()>
        <cfset dataStr[childNode].root = strTest>
    </cfif>


</cfloop>

<cfdump var="#dataStr#">

Was it helpful?

Solution

I'm going to answer this on the assumption that you are struggling with the concept of recursion and hierarchical (parent-child) data structures, as you did not make it clear in your question what exactly is the problem.

Your loop within a loop to get the first two levels is fine, but you can see by your own code already that it has become cumbersome and unruly to have to manage...and if your tabbed txt file suddenly gets a fourth or fifth level of children--you'll have to continually update your code.

The solution to this is to write a recursive function; that is, a function that calls itself.

First, set up a base struct which will be your "root" xml node, and we'll arbitrarily call the root document "Categories":

<cfset XmlDoc = XmlNew(true) />
<cfset XmlDoc.xmlRoot = XmlElemNew(XmlDoc, "Categories") />

Let's also read in the contents of your txt file (the one you've provided above which is tabbed reflecting the hierarchy of parents-to-children):

<cffile action="read" file="c:\workspace\nodes.txt" variable="nodes">

Obviously, the txt file can come from anywhere, so I'll leave that to you to adjust, just take note that we end up with a variable named "nodes" which contains the context of your tabbed txt file above.

Next, you're going to pass the XmlDoc, along with the current node (which, to start, will be the root, and the parsed content, into a new function you will write:

<cfset parseNodes( XmlDoc, XmlDoc['Categories'], nodes ) />

Now, you're going to write the recursive function that processes your 'nodes' variable, converting what it finds into xml elements and attaching them to the root xml element which was passed in to start, that being 'categories'. Let's look at this function in detail before I blast the entire thing at you:

<cffunction name="parseNodes" returntype="string">
    <cfargument name="rootXml" type="xml" required="true" />
    <cfargument name="parentNode" type="xml" required="true" />
    <cfargument name="content" type="string" required="true" />
    <cfargument name="level" type="numeric" required="false" default="0" />

Argument 1 is the root xml document, which you will continue to pass via recursive calls, as its required for xml node generation (via XmlElemNew())

Argument 2 is the parent xml node you'll be attaching children to.

Argument 3 is the current content (what remains of your parsed tabbed txt file) which you'll see in a moment we eat away at during processing.

Argument 4 is marker we'll use to keep track of what "layer" we're currently at in the parent-child hierarchy. To start, we'll at the highest level (0), since we did not specify the argument when we called the parseNodes() function above.

<cfset var thisLine = "" />
<cfset var localContent = arguments.content />

We'll set some local vars so we don't accidentally overwrite a value that CF implicitly converted to global as we recurse upon ourselves.

<cfloop condition="#Len(localContent)#">

We'll next begin to loop on a condition: Loop until there is no more length to the localContent variable. We do this because as we recursively call ourselves, we're going to need to continue to "eat up" the content we've already processed, which will prevent us from re-processing it over and over as we enter-and-exit the recursed function call.

<cfset thisLine = ListGetAt(localContent, 1, Chr(13) & Chr(10)) />

We'll grab the first line in the txt file, using a new line as a delimeter.

<cfif CountIt(thisLine, chr(9)) eq arguments.level>

Here, we are going to count the number of tabs discovered in the current line we are processing. The CountIt() function is another external UDF which is available on CFLib.org; I will include it below in the final code preview. We count the number of tabs to determine if the current level we are working matches the correct place in the parent-child hiearchy. So, for example, if we are at root (0), and we count 1 tab--we know right away, we're not at the right level, and therefore, need to recurse down.

<cfset arguments.parentNode.XmlChildren[ArrayLen(arguments.parentNode.XmlChildren)+1] = XmlElemNew(arguments.rootXml, '#Replace(Trim(thisLine),' ','_','ALL')#') />
<cfset arguments.parentNode.XmlChildren[ArrayLen(arguments.parentNode.XmlChildren)].XmlText = Trim(thisLine) />

We've determined we are at the correct level, so we add a new element to the XmlChildren Array, and set its XmlName equal to the value we parsed out (held in thisLine). Notice that when the actual XmlElemNew() function is called that we Trim() thisLine to be safe, and convert any blank spaces to underscores, as white space is invalid in the name of an XML element (ie. <My Xml Node> would produce an error).

<cfset localContent = ListDeleteAt(localContent, 1, chr(10) & chr(13)) />

Here is where we "eat up" the line of content in your txt file that we've processed, so that it won't get processed again. We treat the content as a list again (using a CRLF as a delimeter) and delete the first (topmost) item.

Now, the next two lines are what we do if we determine we are NOT on the correct level of the parent-child hierarchy:

<cfelseif CountIt(thisLine, chr(9)) gt arguments.level>

  <cfset localContent = parseNodes( arguments.rootXml, arguments.parentNode.XmlChildren[ArrayLen(arguments.parentNode.XmlChildren)], localContent, arguments.level + 1 ) />

Here we determine that the count of tabs in the current line is greater than the level we are working on, and therefore must recurse down. That happens in the next line, in which the parseNodes() function, which we are already in, is called again, but with slightly updated paramters:

  1. We still pass the root xml document.
  2. We now pass the most recently created child element as the new root.
  3. We pass in our current txt content (remember, this is the one we are "eating up" as we go)
  4. We pass in the current level in the hierachy plus 1, indicating that when we arrive within the body of the function again, we are working on the correct level.

Finally, and most importantly, notice that the return of the method updates the localContent variable. This is important! Recursive calls to the function are also going to "eat up" the parsed txt file, so it is important to make certain that each outside call also works with the most up-to-date parsed (and eaten up) content.

The last condition executes if the count of tabs is less than the current tier, which means we need to exit the current recursive iteration, and return to the parent, being sure to return the "eaten up" content we've processed thus far in this iteration:

    <cfelse>

        <cfreturn localContent />

    </cfif>

</cfloop>

<cfreturn '' />

</cffunction>

You now have a single function that can recursively call itself and handle any number of tiers of parent-child relationships.

COMPLETED CODE

<cfset nl = chr(10) & chr(13) />
<cfset tab = chr(9) />

<cfscript>
//@author Peini Wu (pwu@hunter.com) 
function CountIt(str, c) {
    var pos = findnocase(c, str, 1);
    var count = 0;

    if(c eq "") return 0;

    while(pos neq 0){
        count = count + 1;
        pos = findnocase(c, str, pos+len(c));
    }

    return count;
}
</cfscript>

<cffunction name="parseNodes" returntype="string">
    <cfargument name="rootXml" type="xml" required="true" />
    <cfargument name="parentNode" type="xml" required="true" />
    <cfargument name="content" type="string" required="true" />
    <cfargument name="level" type="numeric" required="false" default="0" />

    <cfset var thisLine = "" />
    <cfset var localContent = arguments.content />

    <!--- we will loop until the localContent is entirely processed/eaten up, and we'll trim it as we go --->
    <cfloop condition="#Len(localContent)#">

        <cfset thisLine = ListGetAt(localContent, 1, nl) />

        <!--- handle everything at my level (as specified by arguments.level) --->      
        <cfif CountIt(thisLine, tab) eq arguments.level>

            <cfset arguments.parentNode.XmlChildren[ArrayLen(arguments.parentNode.XmlChildren)+1] = XmlElemNew(arguments.rootXml, '#Replace(Trim(thisLine),' ','_','ALL')#') />

            <cfset arguments.parentNode.XmlChildren[ArrayLen(arguments.parentNode.XmlChildren)].XmlText = Trim(thisLine) />         

            <!--- this line has been processed, so strip it away --->
            <cfset localContent = ListDeleteAt(localContent, 1, nl) />

        <!--- the current line is the next level down, so we must recurse upon ourselves --->           
        <cfelseif CountIt(thisLine, tab) gt arguments.level>

            <cfset localContent = parseNodes( arguments.rootXml, arguments.parentNode.XmlChildren[ArrayLen(arguments.parentNode.XmlChildren)], localContent, arguments.level + 1 ) />

        <!--- the current level is completed, and the next line processed is determined as a "parent", so we return what we have processed thus far, allowing the recursed parent function
        to continue processing from that point --->     
        <cfelse>

            <cfreturn localContent />

        </cfif>

    </cfloop>

    <!--- at the very end, we've processed the entire text file, so we can simply return an empty string --->
    <cfreturn '' />
</cffunction>

<cffile action="read" file="c:\workspace\cf\sandbox\nodes.txt" variable="nodes">

<cfset XmlDoc = XmlNew(true) />
<cfset XmlDoc.xmlRoot = XmlElemNew(XmlDoc, "Categories") />
<cfset parseNodes( XmlDoc, XmlDoc['Categories'], nodes ) />

<cfdump var=#xmlDoc#>

<textarea rows="40" cols="40">
<cfoutput>#xmlDoc#</cfoutput>
</textarea>

CAVEAT

You didn't make clear in your question what format of final XML you'd like, so this process here creates a somewhat redundant structure of nodes which match their values (which isn't very useful):

<?xml version="1.0" encoding="UTF-8"?>
<Categories>
  <Miscellaneous>Miscellaneous</Miscellaneous>

This is probably NOT what you are going to want down the road, but unless you specify further, I must guess and come up with assumptions to keep the example simple.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top