앵커 태그에서 앵커 텍스트와 URL을 추출하는 자바스크립트 정규식

https://stackoverflow.com/questions/369147

21-08-2019
|

문제

'input_content'라는 자바스크립트 변수에 텍스트 단락이 있고 해당 텍스트에 여러 앵커 태그/링크가 포함되어 있습니다.모든 앵커 태그를 일치시키고 앵커 텍스트와 URL을 추출하여 다음과 같은(또는 유사한) 배열에 넣고 싶습니다.

Array
(
    [0] => Array
        (
            [0] => <a href="http://yahoo.com">Yahoo</a>
            [1] => http://yahoo.com
            [2] => Yahoo
        )
    [1] => Array
        (
            [0] => <a href="http://google.com">Google</a>
            [1] => http://google.com
            [2] => Google
        )
)

나는 그것을 깨뜨렸다 (http://pastie.org/339755), 하지만 이 지점을 넘어서면 난처해집니다.도와 주셔서 감사합니다!

해결책

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1, 4))
});

이는 앵커가 항상 다음과 같은 형식이라고 가정합니다. <a href="...">...</a> 즉.다른 속성(예: target).이를 수용하기 위해 정규식을 개선할 수 있습니다.

정규식을 분석하려면:

/ -> start regular expression
  [^<]* -> skip all characters until the first <
  ( -> start capturing first token
    <a href=" -> capture first bit of anchor
    ( -> start capturing second token
        [^"]+ -> capture all characters until a "
    ) -> end capturing second token
    "> -> capture more of the anchor
    ( -> start capturing third token
        [^<]+ -> capture all characters until a <
    ) -> end capturing third token
    <\/a> -> capture last bit of anchor
  ) -> end capturing first token
/g -> end regular expression, add global flag to match all anchors in string

익명 함수를 호출할 때마다 두 번째, 세 번째, 네 번째 인수로 세 개의 토큰, 즉 인수[1], 인수[2], 인수[3]을 받습니다.

인수[1]은 전체 앵커입니다.
인수[2]는 href 부분입니다.
인수[3]은 내부 텍스트입니다.

우리는 해킹을 사용하여 이 세 가지 인수를 새로운 배열로 기본 배열에 푸시할 것입니다. matches 정렬.그만큼 arguments 내장 변수는 진정한 JavaScript 배열이 아니므로 다음을 적용해야 합니다. split 원하는 항목을 추출하기 위한 배열 방법:

Array.prototype.slice.call(arguments, 1, 4)

그러면 다음에서 항목이 추출됩니다. arguments 인덱스 1에서 시작하여 인덱스 4에서 끝납니다(포함되지 않음).

var input_content = "blah \
    <a href=\"http://yahoo.com\">Yahoo</a> \
    blah \
    <a href=\"http://google.com\">Google</a> \
    blah";

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1, 4));
});

alert(matches.join("\n"));

다음을 제공합니다:

<a href="http://yahoo.com">Yahoo</a>,http://yahoo.com,Yahoo
<a href="http://google.com">Google</a>,http://google.com,Google

다른 팁

웹 브라우저에서 JavaScript를 실행하고 있기 때문에 Regex는 이것에 대한 나쁜 생각처럼 보입니다. 단락이 처음 페이지에서 나온 경우 컨테이너에 대한 핸들을 받으십시오. .getElementsByTagName() 앵커를 얻은 다음 원하는 값을 추출하십시오.

불가능한 경우 새 HTML 요소 객체를 작성하고 텍스트를 .innerhtml 속성에 할당 한 다음 .getElementsByTagName().

나는 Joel이 그 권리를 가지고 있다고 생각합니다. Regexes는 마크 업으로 나쁘게 플레이하는 것으로 유명합니다. 단순히 고려해야 할 가능성이 너무 많기 때문입니다. 앵커 태그에 다른 속성이 있습니까? 그들은 어떤 순서에 있습니까? 분리 된 공백이 항상 단일 공간입니까? 이미 브라우저의 HTML이있는 것처럼 보입니다 파서 대신 작동하는 것이 가장 좋습니다.

function getLinks(html) {
    var container = document.createElement("p");
    container.innerHTML = html;

    var anchors = container.getElementsByTagName("a");
    var list = [];

    for (var i = 0; i < anchors.length; i++) {
        var href = anchors[i].href;
        var text = anchors[i].textContent;

        if (text === undefined) text = anchors[i].innerText;

        list.push(['<a href="' + href + '">' + text + '</a>', href, text];
    }

    return list;
}

링크가 저장되는 방식에 관계없이 설명하는 배열과 같은 배열을 반환합니다. 매개 변수 이름을 "컨테이너"로 변경하고 처음 두 줄을 제거하여 텍스트 대신 전달 된 요소로 작동하도록 함수를 변경할 수 있습니다. TextContent/InnerText 속성은 링크에 표시된 텍스트를 가져와 모든 마크 업 (Bold/Italic/Font/…)을 제거합니다. .textContent를 .innerhtml로 바꾸고 마크 업을 보존하려는 경우 내부 if () 문을 제거 할 수 있습니다.

제 생각에는 jQuery 최선의 방법이 될 것입니다. 이것은 최고의 대본이 아니며 다른 사람들이 더 나은 것을 줄 수 있다고 확신합니다. 그러나 이것은 당신이 찾고있는 정확한 배열을 만듭니다.

<script type="text/javascript">
    // From http://brandonaaron.net Thanks!
    jQuery.fn.outerHTML = function() {
        return $('<div>').append( this.eq(0).clone() ).html();
    };    

    var items = new Array();
    var i = 0;

    $(document).ready(function(){
        $("a").each(function(){
            items[i] = {el:$(this).outerHTML(),href:this.href,text:this.text};
            i++;      
        });
    });

    function showItems(){
        alert(items);
    }

</script>

URL을 추출하려면 :

var 패턴 = /.href = "(.) ".*/; var url = string.replace (패턴, '$ 1');

데모:

//var string = '<a id="btn" target="_blank" class="button" href="https://yourdomainame.com:4089?param=751&amp;2ndparam=2345">Buy Now</a>;'
//uncomment the above as an example of link.outerHTML

var string = link.outerHTML
var pattern = /.*href="(.*)".*/;
var href = string.replace(pattern,'$1');
alert(href)

"앵커 텍스트"의 경우 사용하지 않는 이유는 무엇입니까?link.innerHtml

검색 자의 이익을 위해 : 앵커 태그에서 추가 속성으로 작동 할 무언가를 만들었습니다. Regex에 익숙하지 않은 사람들의 경우 달러 ($ 1 등) 값은 Regex 그룹 경기입니다.

var text = 'This is my <a target="_blank" href="www.google.co.uk">link</a> Text';
var urlPattern = /([^+>]*)[^<]*(<a [^>]*(href="([^>^\"]*)")[^>]*>)([^<]+)(<\/a>)/gi;
var output = text.replace(urlPattern, "$1___$2___$3___$4___$5___$6");
alert(output);

일하는 것을 참조하십시오 jsfiddle 그리고 Regex101.

또는 다음과 같은 그룹에서 정보를 얻을 수 있습니다.

var returnText = text.replace(urlPattern, function(fullText, beforeLink, anchorContent, href, lnkUrl, linkText, endAnchor){
                    return "The bits you want e.g. linkText";
                });

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow