Query 30 million HTML documents

https://softwareengineering.stackexchange.com/questions/402481

05-03-2021
|

Pergunta

I have 30-ish million html documents in a file system. There is no emergency, the files are in a reasonable directory tree, it's not breaking the file system. But I'd like to be able to organize and query them. EDIT: Let's make a specific requirement to query for documents in xpath or a similar syntax. For example, //span[@class="class"] to return all documents with the specified element and attribute.

Options considered:

a) A search index like elasticsearch or lucene. This would definitely flatten the document hierarchy and provide search as well as certain metadata. However, no xpath or xquery to explore document properties on the fly

b) A big data / stream processing framework like spark. I'd have to write batch jobs for any analysis, but it would be more scalable. This looks really hard to try out because of the learning curve and need for compute resources.

c) An XML database, or a database with an XML type. Now HTML is not XML, and my documents are unlikely to be XHTML compliant (another good thing to find out actually, how many of these documents are loadable using a stock XML parser). Obviously it would be super cool if all the attributes and values would be indexed ahead of time.

What's the most promising approach or pattern?

Solução

I'm assuming:

You want to be able to search through thirty million HTML documents with XPath.
The search itself should be relatively fast (say two seconds).
You are looking at the way you could pre-process those files and store the processed information in order to be able to perform the actual searches.

The first thing you may try is to walk through all those documents and check if they can be understood by a (possibly lenient) XML parser. This would give you an idea whether you can store those documents as XML in the first place. It is likely that some documents won't be parsed, in which case, check manually why (if there aren't too many) and see if this can be fixed. If there are too many of them, you may look at HTML parsers which could generate XHTML from whatever they parsed. A simple Google search shows that there is a tool for that, a HTML to XHTML converter, although I haven't tried it and don't know how good it is.

Once you got all the documents in a form of XHTML, you can simply put them in a one-column table in a relational database which supports XML types, such as Microsoft SQL Server. That's pretty all: you should be able to query the table with XPath, letting the database do all the work.

Another possibility could be to use a NoSQL database which stores documents in a form of XML and supports XPath searches (sorry, I won't be able to give any specific name of a database, since my knowledge of NoSQL databases is very limited).

Outras dicas

Indexing options with PostgreSQL

A sequential scan on 30ish million HTML documents should take less than an hour on cheap notebooks this days. Most database servers work with XML. If this performance level is acceptable you can use your preferred database server. But if you need faster queries, you will need to develop a good indexing strategy.

PostgreSQL is good because it has a query optimizations that streamline several concurrent seq scans and supports advanced indexing like:

Extract classes into a array and index it

If you queries are based on the existence of a span with determined class you can create a function to return all classes in a array and then index the array. It will be something like that:

create function extractspanclasses(xml) returns text[] immutable as (...)

create index tb001_idx001 on tb001 using GIN (extractspanclasses(xml));

With that, the following query should be indexed:

select * from tb001 where extractspanclasses(xml) @> '{class001,class002}';
                                                  ^^ this means contains

select * from tb001 where extractspanclasses(xml) = 'class001';

The array may not be limited to classes, you can combine classes, precalculated flags in this array according to your interface. Something like hasthis, hasthat, hasjavascript, hassvg, hasclass1, hasclass2 and etc.

To speed up your queries the indexes does not need to filter the exact xpath you are looking for. The index is the first step. It only has to filter enough to avoid huge xpath testings on every row.

Indexed key=value store

PostgreSQL has a module called hstore (https://www.postgresql.org/docs/10/hstore.html) that can be indexed, but I never used.

What about scale?

The scaling depends a lot on the kinds of queries you do. For instance my biggest app have 8 billion record with very demanding constraints of throughput and latency and I am running it on PostgreSQL for about 12 years now.

What about document databases?

I am designing a new app and did a benchmark of possible solutions a few months ago. One of the comparisons was MongoDB x PostgreSQL for storage and retrieval of JSon documents. A simple test of with generation of several million JSon documents of 0.5 to 2Kb in size with the same schema indexed by three json keys. PostgreSQL won by a very large margin. I was not expecting this because there is a lot of hype around document databases.

What about spark?

If I understood right, spark is a parallel processing infrastructure. Without a indexing strategy on the database, spark would have to filter all documents all the time. If you have access to large computational grids this may pay off, but for xpath like queries over 30ish million records I have my doubts if will be faster than PostgreSQL running seq scans unless you need something far more heavier and cpu intensive than element-class xpath filters.

It might depend on what you are comfortable with, but one way to approach it is to write your own parser with a library for example BS in Python. That would enable a detailed control of exactly what data and information you want to index and full freedom to create your own structure.

If you do not want to code anything yourself you can try importing everything into some available local search engine or document database.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange