Sitecore 7 Lucene: strip HTML from computed field

https://stackoverflow.com/questions/23219964

07-07-2023
|

Question

I am pasting together all "paragraph" child nodes from an "article" node in a computed field. This is to achieve that an article can be searched & found by its paragraph contents.

To achieve this, I did the following, under the <fields hint="raw:AddComputedIndexField"> node:

<field fieldName="Paragraphs" storageType="YES" indexType="TOKENIZED">
    MyWebsite.ComputedFields.Paragraphs,MyWebsite
</field>

In this computed field, I concat the paragraph HTML bodies together. I was assuming Sitecore would strip the HTML for me (like it does for rich text fields), but it does noet.

For "rich text" fields, it is probably the RichTextFieldReader that strips the HTML tags out. Decompiling the code confirms this. The RichTextFieldReader is configured in the FieldReaders section. Trying to add a raw:AddFieldReaderByFieldNamesection below, does not seem to do anything.

The full section looks as follows, but does not work in this setup:

<FieldReaders type="Sitecore.ContentSearch.FieldReaders.FieldReaderMap, Sitecore.ContentSearch">
    <mapFieldByTypeName hint="raw:AddFieldReaderByFieldTypeName">
    ....default stuff here...
    </mapFieldByTypeName>
    <mapFieldByFieldName hint="raw:AddFieldReaderByFieldName">
        <fieldReader fieldName="Paragraphs" fieldReaderType="Sitecore.ContentSearch.FieldReaders.RichTextFieldReader, Sitecore.ContentSearch"></fieldReader>
    </mapFieldByFieldName>
</FieldReaders>

Any other clues on how to achieve this (by config, not by using HTML agility pack etc)

Solution

The problem is the mapFieldByFieldName is expecting to match a field with that name from the Sitecore item, not a custom computed field in your index so the field reader is never called.

I don't know how to achieve this from config, but if you do not want to directly use HAP but are willing to use some code then after you paste your fields together in your computed field class just do what Sitecore does in the GetPlainText() method:

string input = "concatenated string";
return HttpUtility.HtmlDecode(Regex.Replace(input, "<[^>]*>", string.Empty));

or use the util method Sitecore.StringUtil.RemoveTags(text)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow