Question

Je veux une expression régulière pour extraire le titre d'une page HTML. À l'heure actuelle, j'ai ceci:

title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '') 

Y at-il une expression régulière pour extraire uniquement le contenu de donc je n'ai pas supprimer les balises? </P> </div> </div> </div> <div id="boxRight" class="tab-content col-xl-6"> <div id="boxSoluzioneDescrizione" class="boxArticolo"> <div class="row"> <div class="col-md-6"> <div class="row justify-content-start"> <div class="col-md-12"> <form id="feedback" action="/fr/articolo/feedback" method="post"> <input type="hidden" name="_csrf" value="gcbzLx98GAkAsS4BjX8eegJKHz3OtOQUnSVb7z3z6InumbReTyp3fEj4azD8HC4SRw9XU73hgHjcbB-9d8ebsA=="> <div class="hidden" style="display:none;"> <div class="form-group field-feedbackform-pagina required"> <input type="hidden" id="feedbackform-pagina" class="pagina" name="FeedbackForm[pagina]" value="/articolo/details"> <p class="help-block help-block-error"></p> </div> <div class="form-group field-feedbackform-idargomento"> <input type="hidden" id="feedbackform-idargomento" class="idArgomento" name="FeedbackForm[idArgomento]" value="374591"> <p class="help-block help-block-error"></p> </div> </div> <div> Était-ce utile? <div class="example-block text-center"> <label class="radio-inline" for="happy" style="font-size:1.5em;cursor:pointer;color:green;"> <i class="far fa-thumbs-up" title="This answer is useful"></i> <!-- fas per effetto mano piena --> <!--<img class="votoImg" src="https://img.icons8.com/color/100/000000/bored.png" width="84" height="84" loading="lazy" fetchpriority="high"> --> </label> <input type="radio" id="happy" class="voto input-hidden" name="FeedbackForm[voto]" value="10"> </div> <div class="example-block text-center"> <label class="radio-inline" for="sad" style="font-size:1.5em;cursor:pointer;color:red;"> <i class="far fa-thumbs-down" title="This answer is not useful"></i> <!-- fas per effetto mano piena --> <!--<img class="votoImg" src="https://img.icons8.com/color/100/000000/boring.png" width="84" height="84" loading="lazy" fetchpriority="high">--> </label> <input type="radio" id="sad" class="voto input-hidden" name="FeedbackForm[voto]" value="0"> </div> <!--<div class="col-auto example-block text-center"> <label class="radio-inline"> <input type="radio" name="voto" id="exicetd" class="input-hidden" /> <img class="votoImg" src="https://img.icons8.com/color/100/000000/smiling.png " width="84" height="84" loading="lazy" fetchpriority="high"> </label> </div>--> </div> <div class="row footer justify-content-between"> <div class="col"> <button type="button" class="btn btn-primary" data-dismiss="modal">Nous faire parvenir</button> </div> </div> </form> </div> </div> </div> <div class="col-md-6"> </div> </div> <div class="row "> <div class="col-md-12"> <p class="title" style="background-color:green;"> <i class="far fa-thumbs-up"></i> La solution </p> <div class="testo"> <P> Utilisez <code>(</code> <code>)</code> dans regexp et <a href="https://docs.python.org/2/library/re.html#re.MatchObject.group" rel="noreferrer"> <code>group(1)</code> </a> python pour récupérer la chaîne capturée (<a href="https://docs.python.org/2/library/re.html#re.search" rel="noreferrer"> <code>re.search</code> </a> retournera <code>None</code> si ne trouve pas le résultat, donc <em> ne pas utiliser directement <code>group()</code> </em>): </p> <pre><code>title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE) if title_search: title = title_search.group(1) </code></pre> </div> </div> </div> </div> </div> </div> <div class="row mt-4 adv"> <div class="col-12 text-center"> <ins class="adsbygoogle" style="display:block; text-align:center;" data-ad-layout="in-article" data-ad-format="fluid" data-ad-client="ca-pub-5108424997424987" data-ad-slot="1879801491"></ins> <script defer async crossorigin="anonymous"> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> </div> <div class="row mt-4 adv"> <div class="col-12 text-center"> </div> </div> <div class="row mt-4"> <div class="col-12"> <div id="boxSoluzioniAvanzate" class="boxArticolo soluzioni"> <p class="title" style="background-color:black;"><i class="fas fa-file-alt"></i> Autres conseils</p> <div class="testo"> <div id="alt1490811" class="boxBorderTop row noMargin pt-4"> <div class="col-md-12 text-left"> <i class="far fa-newspaper fa-2x mb-2" style="display:block;color:gray;"></i> </div> <div class="col-md-12"> <P> Essayez d'utiliser des groupes de capture: </p> <pre><code>title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1) </code></pre> </div> </div> <div id="alt1490812" class="boxBorderTop row noMargin pt-4"> <div class="col-md-12 text-left"> <i class="far fa-newspaper fa-2x mb-2" style="display:block;color:gray;"></i> </div> <div class="col-md-12"> <P> Notez que le démarrage <code>Python 3.8</code>, et l'introduction de <a expressions href="https://www.python.org/dev/peps/pep-0572/" d'affectation de rel="noreferrer"> (PEP 572) </a> (opérateur <code>:=</code>), il est possible d'améliorer un peu sur la solution <a href="https://stackoverflow.com/a/1327389/9297144"> Krzysztof Krason </a> en capturant le résultat du match directement dans le cas état comme une variable et la réutilisation dans le corps de la condition: </p> <pre><code># pattern = '<title>(.*)</title>' # text = '<title>hello</title>' if match := re.search(pattern, text, re.IGNORECASE): title = match.group(1) # hello </code></pre> </div> </div> <div id="alt1490813" class="boxBorderTop row noMargin pt-4"> <div class="col-md-12 text-left"> <i class="far fa-newspaper fa-2x mb-2" style="display:block;color:gray;"></i> </div> <div class="col-md-12"> <P> <code>re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)</code> </p> </div> </div> <div id="alt1490814" class="boxBorderTop row noMargin pt-4"> <div class="col-md-12 text-left"> <i class="far fa-newspaper fa-2x mb-2" style="display:block;color:gray;"></i> </div> <div class="col-md-12"> <P> Les pièces fournies de code ne pas faire face à <code>Exceptions</code> Puis-je suggérer </p> <pre><code>getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0] </code></pre> <P> retourne une chaîne vide par défaut si le modèle n'a pas été trouvé, ou le premier match. </P> </div> </div> <div id="alt1490815" class="boxBorderTop row noMargin pt-4"> <div class="col-md-12 text-left"> <i class="far fa-newspaper fa-2x mb-2" style="display:block;color:gray;"></i> </div> <div class="col-md-12"> <P> Essayez: </p> <pre><code>title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1) </code></pre> </div> </div> <div id="alt1490816" class="boxBorderTop row noMargin pt-4"> <div class="col-md-12 text-left"> <i class="far fa-newspaper fa-2x mb-2" style="display:block;color:gray;"></i> </div> <div class="col-md-12"> <P> Je vous recommande Beautiful Soup. La soupe est un très bon lib pour analyser l'ensemble de votre document html. </P> <pre><code>soup = BeatifulSoup(html_doc) titleName = soup.title.name </code></pre> </div> </div> <div id="alt1490817" class="boxBorderTop row noMargin pt-4"> <div class="col-md-12 text-left"> <i class="far fa-newspaper fa-2x mb-2" style="display:block;color:gray;"></i> </div> <div class="col-md-12"> <P> Je pense que cela devrait suffire: </p> <pre><code>#!python import re pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE) pattern.search(text) </code></pre> <P> ... en supposant que votre texte (HTML) est dans une variable nommée "texte". </P> <P> Cela suppose aussi qu'il n'y a pas d'autres balises HTML qui peuvent être légalement intégrés à l'intérieur d'une balise HTML TITLE et aucun moyen d'intégrer légalement tout autre caractère <dans un tel conteneur / bloc. </P> <P> <strong> Cependant </strong> ... </p> <P> Ne pas utiliser des expressions régulières pour l'analyse syntaxique HTML en Python. Utilisez un analyseur HTML! (À moins que vous allez écrire un analyseur complet, ce qui serait un travail supplémentaire si divers HTML, XML et SGML parseurs sont déjà dans les bibliothèques standard. </P> <P> Si votre traitement "monde réel" <strong> soupe balise </strong> HTML (qui est souvent non conforme à un validateur SGML / XML) puis utilisez la balise <a href = "https: //www.crummy. com / logiciel / BeautifulSoup / » rel = "nofollow noreferrer"> BeautifulSoup </a> package. Il est pas dans les bibliothèques standard (encore) mais il est très recommandé à cet effet. </P> <P> Une autre option est: <a href="http://lxml.de/" rel="nofollow noreferrer"> lxml </a> ... qui est écrit pour bien structuré (normes conforme) HTML. Mais il a une option de repli à l'utilisation BeautifulSoup comme analyseur:. <a href="http://lxml.de/elementsoup.html" rel="nofollow noreferrer"> ElementSoup </a> </p> </div> </div> </div> </div> </div> </div> <div class="row mt-4"> <div class="col-12"> <div class="attribution"> <div>Licencié sous: <a href="https://creativecommons.org/licenses/by-sa/3.0/" target="_blank">CC-BY-SA</a> avec <a href="https://stackoverflow.blog/2009/06/25/attribution-required/" target="_blank">attribution</a></div> <div>Non affilié à <a href="https://stackoverflow.com/" target="_blank">StackOverflow</a></div> </div> </div> </div> <div id="share"></div> </div> <div class="row mb-4 adv"> <div class="col-md-12 text-center"> <!-- GeneraCodice - Footer pagina --> <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-5108424997424987" data-ad-slot="5412049179" data-ad-format="auto" data-full-width-responsive="true"></ins> <script defer async crossorigin="anonymous"> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> </div> </div> </div> <aside id="bannerRight" class="col-xs-12 col-md-4 col-lg-3 text-center" > <div class="container mt-4"> <div class="row mb-4 adv"> <div class="col-md-12"> <!-- GeneraCodice - Barra laterale --> <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-5108424997424987" data-ad-slot="1592207755" data-ad-format="auto" data-full-width-responsive="true"></ins> <script defer async crossorigin="anonymous"> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> </div> <div class="row adv"> <div class="col-md-12"> <!-- GeneraCodice - Barra laterale 2 --> <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-5108424997424987" data-ad-slot="8889943968" data-ad-format="auto" data-full-width-responsive="true"></ins> <script defer async crossorigin="anonymous"> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> </div> <div class="row topArticoli justify-content-center"> <div class="col-md-12 col-lg-10 pt-4"> </div> </div> </div> </aside> </div> </section> <!-- DA INSERIRE COLLEGAMENTO A GENERANEWS E GRATISFORGRATIS.COM --> <footer class="site-footer"> <div class="section-free d-block d-md-flex"> <div class="section-newsletter col"> </div> <div class="col content-free-projects mb-2"> <div> <p class="my-3">Liens utiles</p> </div> <div class="d-flex justify-content-around"> <div></div> <div> <a class="nav-link" href="https://www.generacodice.com/fr/tag">Mots clés</a> <a class="nav-link" href="https://www.generacodice.com/fr/site/aboutus">À propos de nous</a> <a class="nav-link" href="https://www.generacodice.com/fr/site/contacts">Contacts</a> <a class="nav-link" href="https://www.generacodice.com/fr/site/privacy">Confidentialité</a> </div> <div> <a class="nav-link social fb" href="https://www.facebook.com/generacodice" target="_blank"><i class="fab fa-facebook"></i> Facebook</a> <a class="nav-link social instagram" href="https://www.instagram.com/genera_codice" target="_blank"><i class="fab fa-instagram"></i> Instagram</a> </div> <div></div> </div> <div class="small-footer-link d-flex align-items-center justify-content-center"> <form action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top"> <input type="hidden" name="cmd" value="_s-xclick" /> <input type="hidden" name="hosted_button_id" value="42ZKUPRLM66J2" /> <input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donate_SM.gif" border="0" name="submit" title="PayPal - The safer, easier way to pay online!" alt="Donate with PayPal button" /> </form> </div> </div> </div> <div class="row m-0 justify-content-center text-center p-2"> <div class="col-md-5"> <p>Le contenu est autorisé sous Creative Commons.</p> <p class="mb-0">Si vous trouvez des violations du droit d'auteur, vous pouvez nous contacter à <a href="mailto:info@generacodice.com"> info@generacodice.com </a> pour demander la suppression du contenu.</p> </div> </div> </footer> <div id="scroll-to-top" style="display: block;background:none;"> <img src="https://www.generacodice.com/img/icone/scroll-top.svg" alt="scroll top" style="width:48px;height:48px;background-color:#fff;" /> </div> <!-- Google Analytics --> <!-- Google tag (gtag.js) --> <script src="https://www.googletagmanager.com/gtag/js?id=G-PNYLV6VWJG" async crossorigin="anonymous"></script> <script crossorigin="anonymous" defer> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'G-PNYLV6VWJG'); </script> <!-- Visualizzare barra ricerca su google --> <script type="application/ld+json" crossorigin="anonymous"> { "@context": "https://schema.org", "@type": "WebSite", "url": "https://www.generacodice.com/", "potentialAction": { "@type": "SearchAction", "target": "https://www.generacodice.com/articolo?ricerca={search_term_string}", "query-input": "required name=search_term_string" } } </script> <!-- Yandex.Metrika counter <script type="text/javascript" defer crossorigin="anonymous"> (function(m,e,t,r,i,k,a){m[i]=m[i]||function(){(m[i].a=m[i].a||[]).push(arguments)}; m[i].l=1*new Date();k=e.createElement(t),a=e.getElementsByTagName(t)[0],k.async=1,k.src=r,a.parentNode.insertBefore(k,a)}) (window, document, "script", "https://mc.yandex.ru/metrika/tag.js", "ym"); ym(79291009, "init", { clickmap:true, trackLinks:true, accurateTrackBounce:true, webvisor:true }); </script> <noscript><div><img src="https://mc.yandex.ru/watch/79291009" style="position:absolute; left:-9999px;" alt="" /></div></noscript> <!-- /Yandex.Metrika counter --> <script>var lingua = "https://www.generacodice.com/fr";</script> <script src="/lib/wow.min.js" preload></script> <script src="/lib/js.cookie.min.js" preload></script> <script src="https://cdn.jsdelivr.net/npm/cookie-bar/cookiebar-latest.min.js?customize=1&tracking=1&thirdparty=1&always=1&noGeoIp=1&showNoConsent=1&showPolicyLink=1&privacyPage=https%3A%2F%2Fwww.generacodice.com%2Fsite%2Fprivacy" preload></script> <script src="/js/form_ricerca.js" preload></script> <script src="https://kit.fontawesome.com/99a60a9345.js" preload></script> <script src="/js/ads.js" defer="defer" preload></script> <script src="/js/main.js?timestamp=20221207" defer="defer" preload></script> <script src="/assets/44258436/yii.js"></script> <script src="/assets/44258436/yii.validation.js"></script> <script src="/assets/44258436/yii.activeForm.js"></script> <script src="/js/feedback.js" defer></script> <script src="/js/articulate.min.js"></script> <script src="/js/playerTesto.js?202310021220"></script> <script src="/lib/jssocials/jssocials.min.js"></script> <script src="/js/sceditor/minified/sceditor.min.js"></script> <script src="/js/sceditor/minified/formats/xhtml.js"></script> <script src="/js/articolo/details.js?202309292139"></script> <script>jQuery(function ($) { jQuery('#feedback').yiiActiveForm([{"id":"feedbackform-pagina","name":"pagina","container":".field-feedbackform-pagina","input":"#feedbackform-pagina","error":".help-block.help-block-error","validate":function (attribute, value, messages, deferred, $form) {yii.validation.required(value, messages, {"message":"Pagina ne peut être vide."});}},{"id":"feedbackform-idargomento","name":"idArgomento","container":".field-feedbackform-idargomento","input":"#feedbackform-idargomento","error":".help-block.help-block-error","validate":function (attribute, value, messages, deferred, $form) {yii.validation.number(value, messages, {"pattern":/^[+-]?\d+$/,"message":"Id Argomento doit être un entier.","skipOnEmpty":1});}}], []); jQuery('#w0').yiiActiveForm([], []); });</script> <script> window.addEventListener('load', function() { var is_adsense_load = 0 window.addEventListener('scroll', function() { if (is_adsense_load == 0) { is_adsense_load = 1; var ele = document.createElement('script'); ele.async = true; ele.src = 'https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js' var sc = document.getElementsByTagName('script')[0] sc.parentNode.insertBefore(ele, sc); (adsbygoogle = window.adsbygoogle || []).push({ google_ad_client: "ca-pub-5108424997424987", enable_page_level_ads: true }); } }) }) </script> </body> </html>