Cómo navegar por el DOM usando Nokogiri

https://stackoverflow.com/questions/657468

19-08-2019
|

Pregunta

Estoy tratando de llenar las variables. parent_element_h1 y parent_element_h2.¿Alguien puede ayudarme a usar? Nokogiri para obtener la información que necesito en esas variables?

require 'rubygems'
require 'nokogiri'

value = Nokogiri::HTML.parse(<<-HTML_END)
  "<html>
    <body>
      <p id='para-1'>A</p>
      <div class='block' id='X1'>
        <h1>Foo</h1>
        <p id='para-2'>B</p>
      </div>
      <p id='para-3'>C</p>
      <h2>Bar</h2>
      <p id='para-4'>D</p>
      <p id='para-5'>E</p>
      <div class='block' id='X2'>
        <p id='para-6'>F</p>
      </div>
    </body>
  </html>"
HTML_END

parent = value.css('body').first

# start_here is given: A Nokogiri::XML::Element of the <div> with the id 'X2
start_here = parent.at('div.block#X2')

# this should be a Nokogiri::XML::Element of the nearest, previous h1.
# in this example it's the one with the value 'Foo'
parent_element_h1 = 

# this should be a Nokogiri::XML::Element of the nearest, previous h2. 
# in this example it's the one with the value 'Bar'
parent_element_h2 =

Tenga en cuenta:El start_here El elemento podría estar en cualquier lugar dentro del documento.Los datos HTML son sólo un ejemplo.Dicho esto, los encabezados <h1> y <h2> podría ser un hermano de start_here o un hijo de un hermano de start_here.

El siguiente método recursivo es un buen punto de partida, pero no funciona en <h1> porque es hijo de un hermano de start_here:

def search_element(_block,_style)
  unless _block.nil?
    if _block.name == _style
      return _block
    else
      search_element(_block.previous,_style)
    end
  else
    return false
  end
end

parent_element_h1 = search_element(start_here,'h1')
parent_element_h2 = search_element(start_here,'h2')

Después de aceptar una respuesta, se me ocurrió mi propia solución.Funciona de maravilla y creo que es genial.

Solución

Supongo que esto pasó unos años demasiado tarde, supongo, pero me sentí obligado a publicar porque todas las otras soluciones son demasiado complicadas.

Es una declaración única con XPath:

start = doc.at('div.block#X2')

start.at_xpath('(preceding-sibling::h1 | preceding-sibling::*//h1)[last()]')
#=> <h2>Foo</h2>    

start.at_xpath('(preceding-sibling::h2 | preceding-sibling::*//h2)[last()]')
#=> <h2>Bar</h2>

Esto se adapta a hermanos anteriores directos o hijos de hermanos anteriores. Independientemente de cuál coincida, el predicado last() garantiza que obtenga la coincidencia anterior más cercana.

Otros consejos

El enfoque que tomaría (si entiendo su problema) es usar XPath o CSS para buscar su " start_here " elemento y el elemento principal en el que desea buscar. Luego, camina recursivamente por el árbol comenzando por el padre, deteniéndote cuando tocas el & "; Start_here &"; y mantener el último elemento que coincida con tu estilo en el camino.

Algo así como:

parent = value.search("//body").first
div = value.search("//div[@id = 'X2']").first

find = FindPriorTo.new(div)

assert_equal('Foo', find.find_from(parent, 'h1').text)
assert_equal('Bar', find.find_from(parent, 'h2').text)

Donde FindPriorTo es una clase simple para manejar la recursividad:

class FindPriorTo
  def initialize(stop_element)
    @stop_element = stop_element
  end

  def find_from(parent, style)
    @should_stop = nil
    @last_style  = nil

    recursive_search(parent, style)
  end

  def recursive_search(parent, style)
    parent.children.each do |ch|
      recursive_search(ch, style)
      return @last_style if @should_stop

      @should_stop = (ch == @stop_element)
      @last_style = ch if ch.name == style
    end

    @last_style    
  end

end

Si este enfoque no es lo suficientemente escalable, es posible que pueda optimizar las cosas reescribiendo recursive_search para no usar la recursividad, y también pasar los dos estilos que está buscando y realizar un seguimiento de la última búsqueda, para que no tenga que atravesar el árbol un tiempo extra.

También diría que intente conectar el Nodo de parches de mono para enganchar cuando se analiza el documento, pero parece que todo está escrito en C. Quizás le sea mejor usar algo que no sea Nokogiri que tiene un Ruby nativo Analizador SAX (tal vez REXML ), o si la velocidad es su verdadera preocupación, haga el busque la porción en C / C ++ usando Xerces o similar. Sin embargo, no sé qué tan bien se ocuparán de analizar HTML.

Quizás esto lo haga. No estoy seguro sobre el rendimiento y si podría haber algunos casos en los que no haya pensado.

def find(root, start, tag)
    ps, res = start, nil
    until res or (ps == root)
        ps  = ps.previous || ps.parent
        res = ps.css(tag).last
        res ||= ps.name == tag ? ps : nil
    end
    res || "Not found!"
end

parent_element_h1 =  find(parent, start_here, 'h1')

Esta es mi propia solución (¡felicitaciones a mi compañero de trabajo por ayudarme en esto!) utilizando un método recursivo para analizar todos los elementos independientemente de ser un hermano o un hijo de otro hermano.

require 'rubygems'
require 'nokogiri'

value = Nokogiri::HTML.parse(<<-HTML_END)
  "<html>
    <body>
      <p id='para-1'>A</p>
      <div class='block' id='X1'>
        <h1>Foo</h1>
        <p id='para-2'>B</p>
      </div>
      <p id='para-3'>C</p>
      <h2>Bar</h2>
      <p id='para-4'>D</p>
      <p id='para-5'>E</p>
      <div class='block' id='X2'>
        <p id='para-6'>F</p>
      </div>
    </body>
  </html>"
HTML_END

parent = value.css('body').first

# start_here is given: A Nokogiri::XML::Element of the <div> with the id 'X2
@start_here = parent.at('div.block#X2')

# Search for parent elements of kind "_style" starting from _start_element
def search_for_parent_element(_start_element, _style)
  unless _start_element.nil?
    # have we already found what we're looking for?
    if _start_element.name == _style
      return _start_element
    end
    # _start_element is a div.block and not the _start_element itself
    if _start_element[:class] == "block" && _start_element[:id] != @start_here[:id]
      # begin recursion with last child inside div.block
      from_child = search_for_parent_element(_start_element.children.last, _style)
      if(from_child)
        return from_child
      end
    end
    # begin recursion with previous element
    from_child = search_for_parent_element(_start_element.previous, _style) 
    return from_child ? from_child : false
  else
    return false
  end
end

# this should be a Nokogiri::XML::Element of the nearest, previous h1.
# in this example it's the one with the value 'Foo'
puts parent_element_h1 = search_for_parent_element(@start_here,"h1")

# this should be a Nokogiri::XML::Element of the nearest, previous h2. 
# in this example it's the one with the value 'Bar'
puts parent_element_h2 = search_for_parent_element(@start_here,"h2")

Puede copiarlo / pegarlo y ejecutarlo como si fuera un script de ruby.

Si no conoce la relación entre los elementos, puede buscarlos de esta manera (en cualquier parte del documento):


# html code
text = "insert your html here"
# get doc object
doc = Nokogiri::HTML(text)
# get elements with the specified tag
elements = doc.search("//your_tag")

Sin embargo, si necesita enviar un formulario, debe usar mecanizar:


# create mech object
mech = WWW::Mechanize.new
# load site
mech.get("address")
# select a form, in this case, I select the first form. You can select the one you need 
# from the array
form = mech.page.forms.first
# you fill the fields like this: form.name_of_the_field
form.element_name  = value
form.other_element = other_value

Puede buscar los descendientes de un Nokogiri HTML::Element utilizando selectores CSS. Puedes atravesar ancestros con el método .parent.

parent_element_h1 = value.css("h1").first.parent
parent_element_h2 = value.css("h2").first.parent

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow