Question

I'm building an application with eXist-db which works with TEI files and transform them into html.

For the search function I configured lucene to ignore some of the tags.

<collection xmlns="http://exist-db.org/collection-config/1.0" xmlns:teins="http://www.tei-c.org/ns/1.0">
    <index xmlns:xs="http://www.w3.org/2001/XMLSchema">

       <fulltext default="none" attributes="false"/>

        <lucene>
        <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
        <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
            <text match="//teins:TEI">

                <inline qname="p"/>
                <inline qname="text"/>

                <ignore qname="teins:del"/>
                <ignore qname="teins:sic"/>
                <ignore qname="teins:index"/>
                <ignore qname="teins:term"/>
                <ignore qname="teins:note"/>

            </text>
        </lucene>


    </index>
</collection>

Well, that kinda works out, the elements don't show up in the search results directly, but in the snippets before and after the matched text, which are returned by the kwic module. Is there a way to remove them or to apply a XSL transformation before indexing?

example TEI:

...daß er sie zu entwerten sucht. Wie 
                   <index>
                        <term>Liebe</term>
                        <index>
                            <term>und Hass</term>
                        </index>
                    </index>
Liebe Ausströmung inneren Wertes ist,... 

When I search for "Ausströmung", the query results into

 ....sucht. Wie Liebe und Hass Liebe    Ausströmung     inneren Wertes ist,...

But should result into

 ....sucht. Wie Liebe   Ausströmung     inneren Wertes ist,...

When I search for "Hass" this text snippet does not shows up in the results.

For the search functions: I'm strictly sticking to the Shakespeare example in the documentation.

Was it helpful?

Solution

Let's take point of departure in eXist-db's Shakespeare app. Say you have index entries there. You do not want hits in the index terms - this the index configuration takes care of - but you also do not want them output to the KWIC display - this you have to code yourself.

If you look in app.xql, you will see there is a function named app:filter called from app:show-hits. This you can use to remove parts of the output to the KWIC display, based on the name of the parent of the text node that is output.

This will give what you want:

declare %private function app:filter($node as node(), $mode as xs:string) as xs:string? {
    let $ignored-elements := doc('/db/system/config/db/apps/shakespeare/collection.xconf')//*:ignore/@qname/string()
    let $ignored-elements := 
        for $ignored-element in $ignored-elements
        let $ignored-element := substring-after($ignored-element, ':')
        return $ignored-element
    return
        if (local-name($node/parent::*) = ('speaker', 'stage', 'head', $ignored-elements)) 
        then ()
        else 
            if ($mode eq 'before') 
            then concat($node, ' ')
            else concat(' ', $node)
};

You can of course hard-code the elements to ignore, as in ('speaker', 'stage', 'head', 'sic', 'term', 'note') ('index' is not needed here since you must always use 'term'), but I wanted to show that you do not have to. However, if you do not hard-code the elements to ignore, you should certainly move the assignment of $ignored-elements out of the function, for instance to a variable declared in the query prolog, so the database (collection.xconf) does not get called for every text node you encounter: this really is stupid, but I have put in all in one function for the sake of simplicity.

PS: namespace prefixes can be anything you choose, but the standard namespace prefix for the http://www.tei-c.org/ns/1.0 namespace is "tei", and changing it to "teins" can only lead to confusion.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top