문제

나는 마침내 Wikipedias Wiki 텍스트를 통해 구문 분석하고 있습니다. 여기에는 다음 유형의 텍스트가 있습니다.

{{Airport-list|the Solomon Islands}}

* '''AGAF''' (AFT) – [[Afutara Airport]] – [[Afutara]]
* '''AGAR''' (RNA) – [[Ulawa Airport]] – [[Arona]], [[Ulawa Island]]
* '''AGAT''' (ATD) – [[Uru Harbour]] – [[Atoifi]], [[Malaita]]
* '''AGBA''' – [[Barakoma Airport]] – [[Barakoma]]

패턴으로 시작하는 단일 배열로 모든 라인을 검색해야합니다.

* '''

나는 정기적 인 표현이 여기서 주문하도록 부름을받을 것이라고 생각하지만, 나는 정기적 인 표현 부분에 정말로 엉망이된다.

또한 다른 예에서는 다음 텍스트가 있습니다.

{{otheruses}}
{{Infobox Settlement
|official_name          = Doha
|native_name        = {{rtl-lang|ar|الدوحة}} ''ad-Dawḥa''
|image_skyline          = Doha Sheraton.jpg
|imagesize              = 
|image_caption          = West Bay at night
|image_map              = QA-01.svg
|mapsize                = 100px
|map_caption            = Location of the municipality of Doha within [[Qatar]].
|pushpin_map            =
|pushpin_label_position = 
|pushpin_mapsize        = 
|subdivision_type       = [[Countries of the world|Country]]
|subdivision_name       = [[Qatar]]
|subdivision_type1      = [[Municipalities of Qatar|Municipality]]
|subdivision_name1      = [[Ad Dawhah]]
|established_title      = Established
|established_date       = 1850
|area_total_km2         = 132
|area_total_sq_mi       = 51
|area_land_km2          = 
|area_land_sq_mi        = 
|area_water_km2         = 
|area_water_sq_mi       = 
|area_water_percent     = 
|area_urban_km2         = 
|area_urban_sq_mi       =
|area_metro_km2         = 
|area_metro_sq_mi       = 
|population_as_of       = 2004
|population_note        = 
|population_footnotes = <ref name=poptotal>[http://www.planning.gov.qa/Qatar-Census-2004/Flash/introduction.html Qatar 2004 Census]</ref>
|population_total       = 339847
|population_metro       = 998651
|population_density_km2 = 2574
|population_density_sq_mi = 6690
|latd=25 |latm=17 | lats=12 |latNS=N 
|longd=51|longm=32 | longs=0| longEW=E 
|coordinates_display    = inline,title
|coordinates_type       = type:city_region:QA
|timezone               = [[Arab Standard Time|AST]]
|utc_offset             = +3
|website                = 
|footnotes              = 
}} <!-- Infobox ends -->
'''Doha''' ({{lang-ar|الدوحة}}, ''{{transl|ar|ad-Dawḥa}}'' or ''{{unicode|ad-Dōḥa}}'') is the [[capital city]] of [[Qatar]].  It has a population of 400,051 according to the 2005 census,<ref name="autogenerated1">[http://www.hotelrentalgroup.com/Qatar/Sheraton%20Doha%20Hotel%20&%20Resort.htm Sheraton Doha Hotel & Resort | Hotel discount bookings in Qatar<!-- Bot generated title -->]</ref> and is located in the [[Ad Dawhah]] municipality on the [[Persian Gulf]].  Doha is Qatar's largest city, with over 80% of the nation's population residing in Doha or its surrounding [[suburbs]], and is also the economic center of the country. 
It is also the seat of government of Qatar, which is ruled by [[Sheikh Hamad bin Khalifa Al Thani]]–the current ruling Emir of Qatar. 

여기서 인포 부스를 추출해야합니다. InfoBox는 첫 번째 발생 사이의 모든 텍스트를 포함하며

{{Infobox Settlement

첫 번째 발생으로 끝납니다

}} <!-- Infobox ends -->

정규 표현에 관해서는 완전히 길을 잃었고 여기서 도움을 사용할 수 있습니다. PHP를 사용하고 있습니다.


편집하다! 돕다!

나는 40 시간 동안 싸우고 있었고 어리석은 정규 표현이 제대로 작동하지 않을 수 없습니다 :( 지금까지 나는 이것을 가지고 있습니다.

{infobox [^ b ( r | n)}}} ( r | n) b]*[ b ( r | n)}} ( r | n) ( r | n) b

그러나 작동하지 않습니다. {{infobox 사이의 모든 문자열 데이터를 읽고 a n}} n으로 끝납니다.

나는 PHP를 사용하고 있고 이것을 작동시킬 수 없다 :(}}의 첫 번째 사건을 반환한다} 앞의 선 피드로 검색하고 싶다는 사실을 무시한다. : '(

도움이 되었습니까?

해결책

I need to extract the infobox ...

Try this, this time making sure dotall mode is enabled:

\{\{Infobox.*?(?=\}\} <!-- Infobox ends -->)


And again, explanation for that:

(?xs)    # x=comment mode, s=dotall mode
\{\{     # two opening braces (special char, so needs escaping here.)
Infobox  # literal text
.*?      # any char (including newlines), non-greedily match zero or more times.
(?=      # begin positive lookahead
\}\}     # two closing braces
<!-- Infobox ends --> # literal text
)        # end positive lookahead

This will match upto (but excluding) the the ending expression - you could remove the lookahead itself and include just the contents to have it include the ending, if necessary.

Update, based on comment to answer:

\{\{Infobox.*?(?=\n\}\}\n)

Same as above, but lookahead looks for two braces on their own line.

To optionally allow the comment also, use:

\{\{Infobox.*?(?=\n\}\}(?: <!-- Infobox ends-->)?\n)

다른 팁

MediaWiki is open-source. Have a look at their source code ... ;-)

I think the best way is to merge all lines into one string, especially for the infobox.

Then something along the lines of

$reg = "\n(\* '''[^\n]*)";

for the first part (everything after a new line that start with * ''' and is not a new line).

And for the second part I'm not quire sure right now, but this is a nice place to play around a bit: http://www.solmetra.com/scripts/regex/index.php

And here is a short reference for regular expression syntax: http://www.regular-expressions.info/reference.html

I need to retrieve all lines in a single array which start with the pattern * '''

Enable multiline mode and ensure dotall mode is disabled, and use this:

^\* '''.*$


That expression dissected is:

(?xm-s) # Flags:
        # x enables comment mode (spaces ignore, hashes start comments)
        # m enables multiline mode (^$ match lines)
        # -s disables dotall (. matches newline)
^       # start of line
\*      # literal asterisk
[ ]     # literal space (needs braces in comment mode, but not otherwise)
'''     # three literal apostrophes
.*      # any character (excluding newline), greedily matched zero or many times.
$       # end of line
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top