Parsing structure/hierarchy of html and recreating it in a different form with javascript

StackOverflow https://stackoverflow.com/questions/23146998

Вопрос

I am trying to crawl some webpages with javascript to gather information about content's hierarchy. I'm using casperjs to do the crawling, that is working ok so far.

The information that I want to parse is structured like this:

<ul>
    <a></a>
    <li>
        <h3>
            <a>
                Category
                <span>Description for Category</span>
            </a>
        </h3>
        <div>
            <ul>
                <li>
                    <a>SubCategory</a>
                </li>
            </ul>
        </div>
    </li>
</ul>

But what I want to end up with is this.

<ul>
    <li>Category
        <ul>
            <li>SubCategory</li>
        </ul>
    </li>
</ul>

I want to use the above html in a different webpage, so basically I'll be writing it to a file from casperjs so that I can then copy paste it into another document. I'm crawling because it's a tedious thing to do manually (90 some pages and lots of data per page).

What's the best way to deconstruct/parse a hierarchy, and then recreate it? Stay within the DOM and restructure using JQuery? Pull it out into a tree structure and rebuild it later?

Это было полезно?

Решение

Please note that this is a particular solution and will only work for the specific layout of code that you provided:

I created a parser in jQuery that receives HTML markup in a textarea and converts it into the format that you are using:

$(function(){

    $("button").click(function(){
        //Read in HTML
        $("#parser").html($("textarea").val());

        //Parse
        var categories = $("#parser > ul").find("li h3 a");
        $(categories).find("span").remove();

        //Output result
        var output = "&lt;ul&gt;\n";
        for(var i = 0; i < categories.length; i++)
        {
             //Get subcategories for this category
             var subCategories = $($(categories[i])).closest("h3").siblings("div").find("ul li a");

             //Add markup to output
             output += "\t&lt;li&gt;" + minimize($($(categories[i])).html()) + "\n\t\t&lt;ul&gt;\n";

             for(var j = 0; j < subCategories.length; j++)
             {
                 output += "\t\t\t&lt;li&gt;"+$($(subCategories[j])).html() + "&lt/li&gt;\n"
             }

            output += "\t\t&lt;/ul&gt;\n\t&lt;/li&gt;\n&lt;/ul&gt;\n"
        }

        $("#result").html(output);
    });

});

//Removes all white-space characters from the string.
function minimize(str)
{
    return str.replace(/\s{2,}/g, '');
}

JSFiddle


It was a lot of work and is very customized. As I said earlier, if you look at the different selectors that are used here, this code is very tailored to this specific code layout.

Example:

var categories = $("#parser > ul").find("li h3 a");

This looks for a ul element just below parser that contains <a>s inside of <h3>s inside of <li>s to find the categories and then later uses

$($(categories[i])).closest("h3").siblings("div").find("ul li a");

which looks for an <h3> above the category <a> that has a sibling <div> with children <ul><li><a></a></li></ul>

So if the format is not this:

<ul>
    <li>
        <h3>
            <a>Category</a>
        </h3>
        <div>
            <ul>
                <li>
                    <a>Subcategory</a>
                </li>
            </ul>
        </div>
    </li>
</ul>

It will not work.

Другие советы

I ended up going with this approach:

  1. Scrape the tags from the existing website and assemble them into an array nested javascript object.
  2. Writing them out with JSON.stringify to a file
  3. Loading them into a new page as Javascript objects, and build the ul/li structure with a recursive function that traversed the javascript object.

I found it too hard to get my head around modifying the DOM as with the other answer(s). It was easier to break it down into multiple steps, with a well structured javascript object in the middle.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top