Existe-t-il un moyen plus pythonique de fusionner deux lignes d'en-tête HTML avec des colspans?

https://stackoverflow.com/questions/277187

07-07-2019
|

Question

J'utilise BeautifulSoup en Python pour analyser du code HTML. L'un des problèmes que je traite est que j'ai des situations où les colspans sont différents d'une ligne à l'autre. (Les lignes d'en-tête sont les lignes qui doivent être combinées pour obtenir les en-têtes de colonne dans mon jargon). C'est-à-dire qu'une colonne peut s'étendre sur un certain nombre de colonnes au-dessus ou au-dessous et les mots doivent être ajoutés ou ajoutés en fonction de l'étendue. Vous trouverez ci-dessous une routine pour le faire. J'utilise BeautifulSoup pour extraire les colonnes et pour extraire le contenu de chaque cellule de chaque ligne. longHeader est le contenu de la ligne d'en-tête avec le plus grand nombre d'éléments, spanLong est une liste avec les colspans de chaque élément de la ligne. Cela fonctionne mais cela ne semble pas très pythonique.

Alos-ça ne fonctionnera pas si le diff est < 0, je peux y remédier avec la même approche que celle que j'avais l'habitude de faire fonctionner. Mais auparavant, je me demande si quelqu'un peut rapidement regarder cela et suggérer une approche plus pythonique. Je suis un programmeur SAS de longue date et j'ai donc du mal à casser le moule. J'écrirai du code comme si j'écrivais une macro SAS.

longHeader=['','','bananas','','','','','','','','','','trains','','planes','','','','']
shortHeader=['','','bunches','','cars','','trucks','','freight','','cargo','','all other','','']
spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3]
combinedHeader=[]
sumSpanLong=0
sumSpanShort=0
spanDiff=0
longHeaderCount=0

for each in range(len(shortHeader)):
    sumSpanLong=sumSpanLong+spanLong[longHeaderCount]
    sumSpanShort=sumSpanShort+spanShort[each]
    spanDiff=sumSpanShort-sumSpanLong
    if spanDiff==0:
        combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
        longHeaderCount=longHeaderCount+1
        continue
    for i in range(0,spanDiff):
            combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
            longHeaderCount=longHeaderCount+1
            sumSpanLong=sumSpanLong+spanLong[longHeaderCount]
            spanDiff=sumSpanShort-sumSpanLong
            if spanDiff==0:
                combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
                longHeaderCount=longHeaderCount+1
                break

print combinedHeader

La solution

Vous avez vraiment beaucoup à faire dans cet exemple.

vous avez & "sur-traité &"; les objets Beautiful Soup Tag pour faire des listes. Laissez-les comme des tags.
Tous ces types d'algorithmes de fusion sont difficiles. Il est utile de traiter les deux éléments fusionnés de manière symétrique.

Voici une version qui devrait fonctionner directement avec les objets Beautiful Soup Tag. En outre, cette version ne suppose rien sur la longueur des deux lignes.

def merge3( row1, row2 ):
    i1= 0
    i2= 0
    result= []
    while i1 != len(row1) or i2 != len(row2):
        if i1 == len(row1):
            result.append( ' '.join(row1[i1].contents) )
            i2 += 1
        elif i2 == len(row2):
            result.append( ' '.join(row2[i2].contents) )
            i1 += 1
        else:
            if row1[i1]['colspan'] < row2[i2]['colspan']:
                # Fill extra cols from row1
                c1= row1[i1]['colspan']
                while c1 != row2[i2]['colspan']:
                    result.append( ' '.join(row2[i2].contents) )
                    c1 += 1
            elif row1[i1]['colspan'] > row2[i2]['colspan']:
                # Fill extra cols from row2
                c2= row2[i2]['colspan']
                while row1[i1]['colspan'] != c2:
                    result.append( ' '.join(row1[i1].contents) )
                    c2 += 1
            else:
                assert row1[i1]['colspan'] == row2[i2]['colspan']
                pass
            txt1= ' '.join(row1[i1].contents)
            txt2= ' '.join(row2[i2].contents)
            result.append( txt1 + " " + txt2 )
            i1 += 1
            i2 += 1
    return result

Autres conseils

Voici une version modifiée de votre algorithme. zip est utilisé pour parcourir les longueurs et les en-têtes courts et un objet de classe est utilisé pour compter et parcourir le long éléments, ainsi que combiner les en-têtes. while est plus approprié pour la boucle interne. (pardonnez les noms trop courts).

class collector(object): def __init__(self, header): self.longHeader = header self.combinedHeader = [] self.longHeaderCount = 0 def combine(self, shortValue): self.combinedHeader.append( [self.longHeader[self.longHeaderCount]+' '+shortValue] ) self.longHeaderCount += 1 return self.longHeaderCount def main(): longHeader = [ '','','bananas','','','','','','','','','','trains','','planes','','','',''] shortHeader = [ '','','bunches','','cars','','trucks','','freight','','cargo','','all other','',''] spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3] spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3] sumSpanLong=0 sumSpanShort=0 combiner = collector(longHeader) for sLen,sHead in zip(spanShort,shortHeader): sumSpanLong += spanLong[combiner.longHeaderCount] sumSpanShort += sLen while sumSpanShort - sumSpanLong > 0: combiner.combine(sHead) sumSpanLong += spanLong[combiner.longHeaderCount] combiner.combine(sHead) return combiner.combinedHeader

Peut-être regardons-nous la fonction zip pour certaines parties du problème:

>>> execfile('so_ques.py') [[' '], [' '], ['bananas bunches'], [' '], [' cars'], [' cars'], [' cars'], [' '], [' trucks'], [' trucks'], [' trucks'], [' '], ['trains freight'], [' '], ['planes cargo'], [' '], [' all other'], [' '], [' ']] >>> zip(long_header, short_header) [('', ''), ('', ''), ('bananas', 'bunches'), ('', ''), ('', 'cars'), ('', ''), ('', 'trucks'), ('', ''), ('', 'freight'), ('', ''), ('', 'cargo'), ('', ''), ('trains', 'all other'), ('', ''), ('planes', '')] >>>

enumerate peut aider à éviter une partie de l'indexation complexe avec des compteurs:

>>> diff_list = [] >>> for place, header in enumerate(short_header): diff_list.append(abs(span_short[place] - span_long[place])) >>> for place, num in enumerate(diff_list): if num: new_shortlist.extend(short_header[place] for item in range(num+1)) else: new_shortlist.append(short_header[place]) >>> new_shortlist ['', '', 'bunches', '', 'cars', 'cars', 'cars', '', 'trucks', 'trucks', 'trucks', '',... >>> z = zip(new_shortlist, long_header) >>> z [('', ''), ('', ''), ('bunches', 'bananas'), ('', ''), ('cars', ''), ('cars', ''), ('cars', '')...

D'autres noms pythoniques peuvent ajouter de la clarté:

for each in range(len(short_header)): sum_span_long += span_long[long_header_count] sum_span_short += span_short[each] span_diff = sum_span_short - sum_span_long if not span_diff: combined_header.append...

Je suppose que je vais répondre à ma propre question, mais j'ai reçu beaucoup d'aide. Merci pour toute l'aide. J'ai répondu à S.LOTT après quelques petites corrections. (Ils peuvent être si petits qu'ils ne sont pas visibles (blague)). Alors maintenant la question est pourquoi est-ce plus Pythonic? Je pense que je vois qu'il est moins dense / fonctionne avec les entrées brutes au lieu de dérivations / je ne peux pas juger s'il est plus facile à lire --- & Gt; bien qu'il soit facile à lire

Réponse de S.LOTT corrigée

row1=headerCells[0] row2=headerCells[1] i1= 0 i2= 0 result= [] while i1 != len(row1) or i2 != len(row2): if i1 == len(row1): result.append( ' '.join(row1[i1]) ) i2 += 1 elif i2 == len(row2): result.append( ' '.join(row2[i2]) ) i1 += 1 else: if int(row1[i1].get("colspan","1")) < int(row2[i2].get("colspan","1")): c1= int(row1[i1].get("colspan","1")) while c1 != int(row2[i2].get("colspan","1")): txt1= ' '.join(row1[i1]) # needed to add when working adjust opposing case txt2= ' '.join(row2[i2]) # needed to add when working adjust opposing case result.append( txt1 + " " + txt2 ) # needed to add when working adjust opposing case print 'stayed in middle', 'i1=',i1,'i2=',i2, ' c1=',c1 c1 += 1 i1 += 1 # Is this the problem it elif int(row1[i1].get("colspan","1"))> int(row2[i2].get("colspan","1")): # Fill extra cols from row2 Make same adjustment as above c2= int(row2[i2].get("colspan","1")) while int(row1[i1].get("colspan","1")) != c2: result.append( ' '.join(row1[i1]) ) c2 += 1 i2 += 1 else: assert int(row1[i1].get("colspan","1")) == int(row2[i2].get("colspan","1")) pass txt1= ' '.join(row1[i1]) txt2= ' '.join(row2[i2]) result.append( txt1 + " " + txt2 ) print 'went to bottom', 'i1=',i1,'i2=',i2 i1 += 1 i2 += 1 print result

Eh bien, j'ai une réponse maintenant. J'y réfléchissais et j'ai décidé d'utiliser des éléments de chaque réponse. J'ai encore besoin de savoir si je veux une classe ou une fonction. Mais j'ai l'algorithme que je pense est probablement plus Pythonic que tous les autres. Mais, il emprunte énormément aux réponses fournies par des personnes très généreuses. Je les apprécie beaucoup car j’en ai appris beaucoup.

Pour gagner du temps sur les tests, je vais coller le code complet dans lequel je saute dans IDLE et le suivre avec un exemple de fichier HTML. Autre que de prendre une décision sur la classe / fonction (et je dois réfléchir à la façon dont j'utilise ce code dans mon programme), je serais heureux de voir les améliorations qui rendraient le code plus pythonique.

from BeautifulSoup import BeautifulSoup original=file(r"C:\testheaders.htm").read() soupOriginal=BeautifulSoup(original) all_Rows=soupOriginal.findAll('tr') header_Rows=[] for each in range(len(all_Rows)): header_Rows.append(all_Rows[each]) header_Cells=[] for each in header_Rows: header_Cells.append(each.findAll('td')) temp_Header_Row=[] header=[] for row in range(len(header_Cells)): for column in range(len(header_Cells[row])): x=int(header_Cells[row][column].get("colspan","1")) if x==1: temp_Header_Row.append( ' '.join(header_Cells[row][column]) ) else: for item in range(x): temp_Header_Row.append( ''.join(header_Cells[row][column]) ) header.append(temp_Header_Row) temp_Header_Row=[] combined_Header=zip(*header) for each in combined_Header: print each

Vous trouverez ci-dessous le contenu correct du fichier de test. Désolé, j'ai essayé de les joindre, mais je n'y suis pas parvenu:

<TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%"> <TR valign="bottom"> <TD width="40%"> </TD> <TD width="5%"> </TD> <TD width="3%"> </TD> <TD width="3%"> </TD> <TD width="1%"> </TD> <TD width="5%"> </TD> <TD width="3%"> </TD> <TD width="3%"> </TD> <TD width="1%"> </TD> <TD width="5%"> </TD> <TD width="3%"> </TD> <TD width="1%"> </TD> <TD width="1%"> </TD> <TD width="5%"> </TD> <TD width="3%"> </TD> <TD width="1%"> </TD> <TD width="1%"> </TD> <TD width="5%"> </TD> <TD width="3%"> </TD> <TD width="3%"> </TD> <TD width="1%"> </TD> </TR> <TR style="font-size: 10pt" valign="bottom"> <TD> </TD> <TD> </TD> <TD> </TD> <TD> </TD> <TD> </TD> <TD> </TD> <TD> </TD> <TD> </TD> <TD> </TD> <TD> </TD> <TD nowrap align="right" colspan="2">FOODS WE LIKE</TD> <TD> </TD> <TD> </TD> <TD nowrap align="right" colspan="2"> </TD> <TD> </TD> <TD> </TD> <TD nowrap align="right" colspan="2"> </TD> <TD> </TD> </TR> <TR style="font-size: 10pt" valign="bottom"> <TD> </TD> <TD> </TD> <TD nowrap align="CENTER" colspan="6">SILLY STUFF</TD> <TD> </TD> <TD> </TD> <TD nowrap align="right" colspan="2">OTHER THAN</TD> <TD> </TD> <TD> </TD> <TD nowrap align="CENTER" colspan="6">FAVORITE PEOPLE</TD> <TD> </TD> </TR> <TR style="font-size: 10pt" valign="bottom"> <TD> </TD> <TD> </TD> <TD nowrap align="right" colspan="2">MONTY PYTHON</TD> <TD> </TD> <TD> </TD> <TD nowrap align="right" colspan="2">CHERRYPY</TD> <TD> </TD> <TD> </TD> <TD nowrap align="right" colspan="2">APPLE PIE</TD> <TD> </TD> <TD> </TD> <TD nowrap align="right" colspan="2">MOTHERS</TD> <TD> </TD> <TD> </TD> <TD nowrap align="right" colspan="2">FATHERS</TD> <TD> </TD> </TR> <TR style="font-size: 10pt" valign="bottom"> <TD nowrap align="left">Name</TD> <TD> </TD> <TD nowrap align="right" colspan="2">SHOWS</TD> <TD> </TD> <TD> </TD> <TD nowrap align="right" colspan="2">PROGRAMS</TD> <TD> </TD> <TD> </TD> <TD nowrap align="right" colspan="2">BANANAS</TD> <TD> </TD> <TD> </TD> <TD nowrap align="right" colspan="2">PERFUME</TD> <TD> </TD> <TD> </TD> <TD nowrap align="right" colspan="2">TOOLS</TD> <TD> </TD> </TR> </TABLE>

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow