Solr - Is there a way to speed up my import

https://stackoverflow.com/questions/17336613

01-06-2022
|

Frage

I have a relational database model This is the basics of my data-config.xml

<entity name="MyMainEntity" pk="pID" query="select ... from [dbo].[TableA] inner join TableB on ...">
    <entity name="Entity1" pk="Id1" query="SELECT [Text] Tag from [Table2] where ResourceId = '${MyMainEntity.pId}'"></entity>
            <entity name="Entity1" pk="Id2" query="SELECT [Text] Tag from [Table2] where ResourceId2 = '${MyMainEntity.pId}'"></entity>
    <entity name="LibraryItem" pk="ResourceId" 
            query="select SKU
                    FROM [TableB] 
                    INNER JOIN ...
                    ON ...
                    INNER JOIN ...
                    ON ...
                    WHERE ... AND ...'">
    </entity>
</entity>

Now, this takes a lot of time.
10000 rows in the first query and then each other inner entities are fetched later (around 10 rows each).

If I use a db profiler I see a the three inner entities query running over and over (3 select sentences than again 3 select sentences over and over)
This is really not efficient.
And the import can run over 40 hrs ()
Now,
What are my options to run it faster .

Obviously there is an option to flat the tables to one big table - but that will create a lot of other side effects. I would really like to avoid that extra effort and run solr on my production relational tables.
So far it works great out of the box and I am searching here if there is a configuration tweak.
If I will flat the rows that - does the schema.xml need to be change too? or the same fields that are multivalued will keep being multivalued.

Thanks.

Lösung

without changing the schema of the DB, the first thing to try is caching. If the inner entities cache well, gains will be substantial.

Maybe the wiki is not uptodate so you should check the jira issues, namely solr-2382 and maybe have a look at solr-2948 too.

A second path could be trying multithreading DIH, but it's more tricky. At one point this was optional, but later was removed cause it was buggy, and I think now there was some jira issue trying to reimplement it, try look it up, but I recommend caching first.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow