Question

I need to create an array which contains all text from a page without jQuery. This is my html:

<html>
<head>
    <title>Hello world!</title>
</head>
<body>
    <h1>Hello!</h1>
    <p>
        <div>What are you doing?</div>
        <div>Fine, and you?</div>
    </p>
    <a href="http://google.com">Thank you!</a>
</body>
</html>

Here is what i want to get

text[1] = "Hello world!";
text[2] = "Hello!";
text[3] = "What are you doing?";
text[4] = "Fine, and you?";
text[5] = "Thank you!";

Here is what i have tried but seems to not work correctly in my browser:

var elements = document.getElementsByTagName('*');
console.log(elements);

PS. I need to use document.getElementsByTagName('*'); and exclude "script" and "style".

Was it helpful?

Solution

  var array = [];

    var elements = document.body.getElementsByTagName("*");

    for(var i = 0; i < elements.length; i++) {
       var current = elements[i];
        if(current.children.length === 0 && current.textContent.replace(/ |\n/g,'') !== '') {
           // Check the element has no children && that it is not empty
           array.push(current.textContent);
        }
    } 

You could do something like this

Demo

result = ["What are you doing?", "Fine, and you?"]

or you could use document.documentElement.getElementsByTagName('*');

Also make sure your code is inside this

document.addEventListener('DOMContentLoaded', function(){

   /// Code...
});

If it's just the title you need, you may aswell do this

array.push(document.title);

Saves looping through scripts & styles

OTHER TIPS

If you want the contents of the entire page, you should be able to use

var allText = document.body.textContent;

In Internet Explorer before IE9, there was the property innerText which is similar but not identical. The MDN page about textContent has more detail.

Now one problem here is that textContent will get you the content of any <style> or <script> tags, which may or may not be what you want. If you don't want that, you could use something like this:

function getText(startingPoint) {
  var text = "";
  function gt(start) {
    if (start.nodeType === 3)
      text += start.nodeValue;
    else if (start.nodeType === 1)
      if (start.tagName != "SCRIPT" && start.tagName != "STYLE")
        for (var i = 0; i < start.childNodes.length; ++i)
          gt(start.childNodes[i]);
  }
  gt(startingPoint);
  return text;
}

Then:

var allText = getText(document.body);

Note: this (or document.body.innerText) will get you all the text, but in a depth-first order. Getting all the text from a page in the order that a human actually sees it once the page is rendered is a much more difficult problem, because it'd require the code to understand the visual effects (and visual semantics!) of the layout as dictated by CSS (etc).

edit — if you want the text "stored into an array", I suppose on a node-by-node basis (?), you'd simply substitute array appends for the string concatenation in the above:

function getTextArray(startingPoint) {
  var text = [];
  function gt(start) {
    if (start.nodeType === 3)
      text.push(start.nodeValue);
    else if (start.nodeType === 1)
      if (start.tagName != "SCRIPT" && start.tagName != "STYLE")
        for (var i = 0; i < start.childNodes.length; ++i)
          gt(start.childNodes[i]);
  }
  gt(startingPoint);
  return text;
}

Seems to be a one line solution (fiddle):

document.body.innerHTML.replace(/^\s*<[^>]*>\s*|\s*<[^>]*>\s*$|>\s*</g,'').split(/<[^>]*>/g)

This may fail if there are complicated scripts in the body, though, and I know that parsing HTML with regular expressions is not a very clever idea, but for simple cases or for demo purposes it still can be suitable, can't it? :)

Walk the DOM tree, get all the text nodes, get the nodeValue of the text node.

var result = [];
var itr = document.createTreeWalker(
    document.getElementsByTagName("html")[0],
    NodeFilter.SHOW_TEXT,
    null, // no filter
    false);
while(itr.nextNode()) {
    if(itr.currentNode.nodeValue != "")
        result.push(itr.currentNode.nodeValue);
}
alert(result);

Alternate method: Split on the HTML tag's textContent.

var result = document.getElementsByTagName("html")[0].textContent.split("\n");
for(var i=0; i<result.length; i++)
    if(result[i] == "")
        result.splice(i, 1);
alert(result);
    <html>
    <head>
            <title>Hello world!</title>
    </head>
    <body>
            <h1>Hello!</h1>
            <p>
                    <div>What are you doing?</div>
                    <div>Fine, 
                        <span> and you? </span>
                    </div>
            </p>
            <a href="http://google.com">Thank you!</a>
            <script type="text/javascript">
                function getLeafNodesOfHTMLTree(root) {
                    if (root.nodeType == 3) {
                        return [root];
                    } else {
                        var all = [];
                        for (var i = 0; i < root.childNodes.length; i++) {
                            var ret2 = getLeafNodesOfHTMLTree(root.childNodes[i]);
                            all = all.concat(ret2);
                        }
                        return all;
                    }
                }
                var allnodes = getLeafNodesOfHTMLTree(document.getElementsByTagName("html")[0]);
                console.log(allnodes);
                 //in modern browsers that surport array filter and map
                allnodes = allnodes.filter(function (node) {
                    return node && node.nodeValue && node.nodeValue.replace(/\s/g, '').length;
                });
                allnodes = allnodes.map(function (node) {
                    return node.nodeValue
                })
                 console.log(allnodes);
            </script>
    </body>
    </html>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top