Domanda

I have a string such as this

Advances in the field of radiotherapy

I want the common stop words such as "in", "the", "of" etc. removed from the string and join the resultant string with an "OR". So, it will look like

Advances OR field OR radiotherapy

The list of stop words can grow, so I don't want to use a replace() function to remove the stop words. Is there a way that I can keep a list of all the stop words and use that list to process strings?

I can use a XSLT 2.0 solution.

È stato utile?

Soluzione

You can define a gobal parameter with your stop words e.g.

<xsl:param name="stop-words" select="'in', 'the', 'of'"/>

and then use analyze-string e.g.

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:param name="stop-words" select="'in', 'the', 'of'"/>

<xsl:param name="rep" select="'OR'"/>

<xsl:variable name="regex" 
  select="concat('(^|\W)(', string-join($stop-words, '|'), ')', '(\W(', string-join($stop-words, '|'), '))*($|\W)')"/>

<xsl:template match="@* |  node()">
  <xsl:copy>
    <xsl:apply-templates select="@* , node()"/>
  </xsl:copy>
</xsl:template>

<xsl:template match="text()" priority="5">
  <xsl:analyze-string select="." regex="{$regex}">
    <xsl:matching-substring>
      <xsl:value-of select="concat(regex-group(1), $rep, regex-group(5))"/>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:value-of select="."/>
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:template>

</xsl:stylesheet>

With the input being

<text>Advances in the field of radiotherapy</text>

the output with Saxon 9.5 is

<text>Advances OR field OR radiotherapy</text>

Based on your comment I think you simply want

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:mf="http://example.com/mf"
  exclude-result-prefixes="mf">

<xsl:param name="stop-words" select="'in', 'the', 'of'"/>

<xsl:param name="rep" select="' OR '"/>

<xsl:variable name="regex" 
  select="concat('(^|\W)(', string-join($stop-words, '|'), ')', '(\W(', string-join($stop-words, '|'), '))*($|\W)')"/>

<xsl:function name="mf:process">
  <xsl:param name="input"/>
  <xsl:sequence select="replace(replace($input, $regex, '$1$5'), '\s+', $rep)"/>
</xsl:function>

<xsl:template match="@* |  node()">
  <xsl:copy>
    <xsl:apply-templates select="@* , node()"/>
  </xsl:copy>
</xsl:template>

<xsl:template match="text()[normalize-space()]" priority="5">
  <xsl:value-of select="mf:process(.)"/>
</xsl:template>

</xsl:stylesheet>

which transforms

<root>
<text>Advances in the field of radiotherapy</text>
<text>Advances made in the field of radiotherapy</text>
</root>

into

<root>
<text>Advances OR field OR radiotherapy</text>
<text>Advances OR made OR field OR radiotherapy</text>
</root>

It might even be possible to simplify the pattern more, I will leave that as an exercise.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top