Question

please help i am doing a html parsing using MSHTML. My code for getting all attributes of a particular tag is like this

void GetAttributes(MSHTML::IHTMLElementPtr pColumnInnerElement)
{
    IHTMLDOMNode *pElemDN = NULL;
    LONG lACLength;
    MSHTML::IHTMLAttributeCollection *pAttrColl;
    IDispatch* pACDisp;
    VARIANT vACIndex;
    IDispatch* pItemDisp;
    IHTMLDOMAttribute* pItem;
    BSTR bstrName;
    VARIANT vValue;
    VARIANT_BOOL vbSpecified;
    pColumnInnerElement->QueryInterface(IID_IHTMLDOMNode, (void**)&pElemDN);
    if (pElemDN != NULL)
    {
        pElemDN->get_attributes(&pACDisp);
        pACDisp->QueryInterface(IID_IHTMLAttributeCollection, (void**)&pAttrColl);
        pAttrColl->get_length(&lACLength);
        vACIndex.vt = VT_I4;
        for (int i = 0; i < lACLength; i++)
        {

            vACIndex.lVal = i;
            pItemDisp = pAttrColl->item(&vACIndex);
            if (pItemDisp != NULL)
            {
               pItemDisp->QueryInterface(IID_IHTMLDOMAttribute, (void**)&pItem);
               pItem->get_specified(&vbSpecified);
               pItem->get_nodeName(&bstrName);
               pItem->get_nodeValue(&vValue);

               if (vbSpecified)
                cout<<_com_util::ConvertBSTRToString(bstrName)<<" :"<<_com_util::ConvertBSTRToString(vValue.bstrVal)<<endl;
               pItem->Release();
            }
            pItemDisp->Release();

        }
        pElemDN->Release();
        pACDisp->Release();
        pAttrColl->Release();
    }
}

The problem is for given tag <input id="Switch l_id2" class="pointer" name="Switch" onclick='SetControl("Switch l",1)' type="button" value="OK"> it prints all attributes except value attribute. The get_specified function is returning false for value attribute.

My output is

id :Switch l_id2
class :pointer
onclick :SetControl("Switch l",1)
type :button
name :Switch

Any idea why? Also which other attributes may have this problem??

Note

I tried like this. Its showing the correct attribute results for value.

        if (strcmp(_com_util::ConvertBSTRToString(bstrName), "value") == 0)
        {
            cout<<_com_util::ConvertBSTRToString(bstrName)<<" :"<<_com_util::ConvertBSTRToString(vValue.bstrVal)<<endl;
        }
Was it helpful?

Solution

If you are working in managed(CLI) VC++ then you can consider the HTML Agility Pack, available via nuget.

If sticking to MSHTML is not necessary then probably you can opt for parsing the HTML documents as XML documents. That way you would be able to parse all the tags and attributes with a lot of flexibility. There are plenty of XML parsers available for C++.

This library looks compact simple and efficient (available for multiple platforms): https://github.com/leethomason/tinyxml2

Another one is: http://pugixml.org/

This link may help you if you want to get rid of MSHTML dependency: http://www.codeproject.com/Articles/30342/Remove-Microsoft-mshtml-dependency

OTHER TIPS

Do you really care about the flag of specified? You said you want to process all attributes, I think if this is the case you don't need to care about the specified flag, just process all attributes.

Other thing is if I were you, I'll use CComPtr to instead of all naked com pointer.

I've never worked with this before, but according to the library docs and DOM specs, it seems that get_nodeValue() does different things depending on the type of "node object". Try calling get_nodeValue() or get_nodeName() on the IHTMLDOMNode object. It seems clear that some properties like "value", "ID" and "Name" are not part of the attribute collection under the DOM.


MSHTML docs:

DOM spec:

check for the input type, then query for the IID_IHTMLInputElement interface, then use get_value.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top