Question

I'm building a PHP app to prefill third party PDF account forms with client data, and am getting stuck on the database design.

The current form has about 70 fields, which seems like far too many to set up as individual columns, especially as some (ie company/trust information) are not relevant depending on the type of account the client requires.

I've tried to normalize but it seems like there would be a lot of joins, and also require several sub queries for things like multiple addresses.

It also means a ton of extra queries to check if rows exist or not when updating to decide if the script needs to do an INSERT, a DELETE or an UPDATE, whereas if it was all in one row, it would basically just be an UPDATE each time.

Not sure if this helps but here is a list of most of the fields:

id, account_type, account_phone, account_email, account_designation, account_adviser, account_source, account_complete, account_residential_unit_number, account_residential_street_number, account_residential_street_name, account_residential_street_type, account_residential_suburb, account_residential_state, account_residential_postcode, account_postal_unit_number, account_postal_street_number, account_postal_street_name, account_postal_street_type, account_postal_suburb, account_postal_state, account_postal_postcode, individual_1_title, individual_1_firstname, individual_1_middlename, individual_1_lastname, individual_1_dob, individual_1_occupation, individual_1_email, individual_1_phone, individual_1_unit_number, individual_1_street_number, individual_1_street_name, individual_1_street_type, individual_1_suburb, individual_1_state, individual_1_postcode, individual_2_title, individual_2_firstname, individual_2_middlename, individual_2_lastname, individual_2_dob, individual_2_occupation, individual_2_email, individual_2_phone, individual_2_unit_number, individual_2_street_number, individual_2_street_name, individual_2_street_type, individual_2_suburb, individual_2_state, individual_2_postcode, company_name, company_date, company_unit_number, company_street_number, company_street_name, company_street_type, company_suburb, company_state, company_postcode, trust_name, trust_date, settlement_bank, settlement_account, settlement_bsb

The most this will need to handle is around 200,000 applications, and once the data is in the database, it won't change very often, if at all - not sure if that is relevant?

So really just wanted to figure out the smartest way to do design this, even if it's just a name or topic to research further.

Was it helpful?

Solution

Generally speaking you can divide a database into two broad categories:

  1. OLTP Systems

    Online Transaction Processing Systems are normally write intensive i.e. a lot of updates compared to the reads of the data. This system is typically a day to day application used by a business users of all scopes i.e. data capture, admin etc. These databases are usually normalized to the extreme and then certain demoralized for performance gains in certain areas.

  2. OLAP/DSS system:

    On Line Analytic Processing are database that are normally large data warehouse like systems. Used to support Analytic activities such as data mining, data cubes etc. Typically the information is used by a more limited set of users than OLTP. These database are normally very denormalised.

Go read here for a short description of these and the main differences. OLTP VS OLAP

Regarding your INSERT/UPDATE/DELETE point go read about the MySQL ON DUPLICATE KEY UPDATE statement which will resolve that issue for you easily. It is called a MERGE operation in most database systems.

Now I dont understand why you are worried about JOINS. I have had tables with millions (500 000 000+) rows that I joined with other tables also large in size and the queries ran very fast. So designing a database to eliminate joins is NOT a good idea.

My suggestion is:

If designing a OLTP system normalise as much as possible then denormalise to increase performance where needed. For A OLAP system look at star schemas etc and dont even bother with normalizing it first. Oh by the way most of the OLAP systems normally use a OLTP system as a data source.

OTHER TIPS

Usually I normalise and then denormalise for performance. However

If I didn't have too much validation to do e.g Valid address, duplicated indivual

And I didn't want to reuse parts of the data for another version of the form, e.g select an existing individual , Name and address etc

And I didn't want to analyse it e.g Find all mentions of Fred Bloggs

And my user's were happy with entering all of this one form ( I wouldn't be)

Then I'd go with denormalise from the get go.

Thing is if you normalise, then denormalising if required is fairly trivial and low risk, normalising denormalised data usually means de-duplication which is likely to be really painful data and design wise.

Normalize your input, de-normalize the output. Meaning, for reporting, extract your data into a de-normalized format like Mongo and use that for querying. Or, create rollups of some sort. I have found, with large datasets, to extract the reporting data from the input data for best efficiency.

I find denormalized data extremely painful to work with at a very basic level. What if I want a tally of the number of people who live in Georgia. In your denormalized structure I would have to count where ind_1_state = GA or ind_2_state = GA.

This is not too bad I guess, but to anywho who has seen the ease of querying that normalization provides, it is quite painful.

Normalization establishing the foundation for more and more complex queries. Without it, you will find it increasingly difficult to implement richer data analysis.

Normalization also provides the basis for integrity and consistency in your database. If you have all the occurrences of a particular thing ( state abbreviations ) in one place ( one column ) you can easily check and constrain those values to not allow nonexistent codes.

The rationale for normalization goes on and on, but I hope I hit a few no brainers.

This is no brainer - all you have now is a noun-soup which you have shoved in a single table-storage-shoebox and glued some ID at the beginning of each row.

Create some kind of schema. If this is more like a OLAP -- and you decide for star schema -- it will have dimensions in 2-5 NF and facts in 2-6 NF. For OLTP (or different warehouse models) aim for BCNF - 6NF.

I would argue that you do not even have 1NF here, gluing that ID at the beginning does not count as preventing duplicates. Therefore, you can not de-normalize from this point even if you wanted to :) -- ok, maybe you could put some comma-separated list somewhere to make things definitely not in 1NF.

Joins are what relational databases do, so do not worry about that.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top