Compile analytics from millions of rows in PostgreSQL

https://softwareengineering.stackexchange.com/questions/339600

05-01-2021
|

문제

How do we compile analytics from millions of rows in a PostgreSQL table?

We pull order data from multiple CRM's and need to compile the data for reporting and each CRM has it's own orders table. We compile these tables into a compiled_orders table in 24 hour increments.

Our current implementation uses SQL Views to aggregate results and SUM the columns

CREATE OR REPLACE VIEW crm1_sql_views AS
  SELECT
      account_id
    , name
    , COUNT(*) AS order_count
    , SUM(CASE WHEN
        status = 0
        THEN 1 ELSE 0 END) AS approved_count
    , SUM(CASE WHEN
        status = 0
        THEN total ELSE 0 END) AS approved_total
  FROM crm1_orders
  WHERE
    AND is_test = false
  GROUP BY
    account_id
    , name
  ;

We select the data we want from this view. The issue that we are running into is that a query like this pulls all the order data for a client into memory. If a client has 20M orders, it becomes extremely slow, and sometimes the query results are larger than the available memory/cache.

How do we incrementally/consistently/quickly take 20M records in a table and compile it into another table?

Increasing hardware is one solution, but we feel that is not the correct solution right now. We looked at materialized views, but since each CRM has it's own tables, it would have major maintenance implications every time we added a new CRM to our offering.

The goal is for our end users to answer questions like: - How many orders did we receive last week/month/year? - What weekday do I receive the most orders?

What technologies/methodologies/terms do we need to look at and research?

Sharding
ETL
Data Pipelines
"Big Data" tools
NoSQL

해결책

I don't see a necessity to change the whole db technology or infrastructure just because you need a little bit of optimizing here. Start with something simple like writing a stored procedure (or maybe a client program in your favorite programming language) and collect the results in a new table. If you do it right, the memory needed will be proportional to the number of different pairs (account_id,name), not more. I guess that number is much smaller than the number of orders.

On a larger scale, inform yourself about data warehouseing, and how to model things like a "star schema" for the kind of queries you mentioned. You will find plenty of books, tutorials and information on the web for this. "ETL" will indeed be the right term to search for, since it is the kind of process you need to fill your "data warehouse".

다른 팁

I agree with Doc Brown, if you're looking for a long term permanent solution, consider loading your data to a data warehouse.

Implementation of a data warehouse will increase your query speed and overall system performance.

Data warehouses are purposely designed and constructed with a focus on speed of data retrieval and analysis. Moreover, a data warehouse is designed for storing large volumes of data and being able to quickly query the data.

Further, the data warehouse allows for a large system burden to be taken off the operational environment and effectively distributes system load across an entire organization’s technology infrastructure.

Once a data warehouse is in place and populated with data, it will become a part of a BI solution, your end users will direclty create reports of their orders.

To load your PosgreSQL data into a data warehouse you will need an ETL tool like Alooma.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 softwareengineering.stackexchange