Configuring LucidWorks Include Paths to only crawl certain file types

https://stackoverflow.com//questions/12691190

12-12-2019
|

Question

I'm trying to configure the LucidWorks web data source to only index certain file types. However, when I set Include paths to .*\.html to only crawl .html files (as a simplified example), it only ends up indexing the top level folder. Crawl depth is set to -1 and when I leave Include paths blank, it crawls the whole sub-tree as expected.

I've looked at their documentation for creating a web data source, and for Using Regular Expressions, and can't find a reason why .*\.html would not work, since .* should match any character.

Solution

As I was proofreading the question, I had an idea which was the correct solution. Posting it here for posterity.

The content being crawled is a file share, so it relies on directory listing of the web server, which was filtered out because it doesn't have a .html extension. So simply adding .*/ to the Include paths fixed the problem.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow