Implementing custom fields in a database for large numbers of records

https://stackoverflow.com/questions/23086152

04-07-2023
|

Domanda

I'm developing an app which requires a user defined custom fields on a contacts table. This contact table can contain many millions of contacts.

We're looking at using a secondary metadata table which stores information about the fields, along with a tertiary value table which stores the actual data.

Here's the rough schema:

CREATE TABLE [dbo].[Contact](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [FirstName] [nvarchar](max) NULL,
    [MiddleName] [nvarchar](max) NULL,
    [LastName] [nvarchar](max) NULL,
    [Email] [nvarchar](max) NULL
)

CREATE TABLE [dbo].[CustomField](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [FieldName] [nvarchar](50) NULL,
    [Type] [varchar](50) NULL
) 

CREATE TABLE [dbo].[ContactAndCustomField](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [ContactID] [int] NULL,
    [FieldID] [int] NULL,
    [FieldValue] [nvarchar](max) NULL
)

However, this approach introduces a lot of complexity, particularly with regard to importing CSV files with multiple custom fields. At the moment this requires a update/join statement and a separate insert statement for every individual custom field. Joins would also be required to return custom field data for multiple rows at once

I've argued for this structure instead:

CREATE TABLE [dbo].[Contact](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [FirstName] [nvarchar](max) NULL,
    [MiddleName] [nvarchar](max) NULL,
    [LastName] [nvarchar](max) NULL,
    [Email] [nvarchar](max) NULL
    [CustomField1] [nvarchar](max) NULL
    [CustomField2] [nvarchar](max) NULL
    [CustomField3] [nvarchar](max) NULL /* etc, adding lots of empty fields */
)

CREATE TABLE [dbo].[ContactCustomField](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [FieldIndex] [int] NULL, 
    [FieldName] [nvarchar](50) NULL,
    [Type] [varchar](50) NULL
)

The downside of this second approach is that there is a finite number of custom fields that must be specified when the contacts table is created. I don't think that's a major hurdle given the performance benefits it will surely have when importing large CSV files, and returning result sets.

What approach is the most efficient for large numbers of rows? Are there any downsides to the second technique that I'm not seeing?

Soluzione

Microsoft introduced sparse columns exactly for this type of problems. Tha point is that in a "classic" design you end up with large number of columns, most of the NULLs for any particular row. Same here with sparse columns, but NULLs don't require any storage. Moreover, you can create sets of columns and modify sets with XML.

Performance- and storage-wise, sparse columns are the winner.

http://technet.microsoft.com/en-us/library/cc280604.aspx

Altri suggerimenti

uery performance. Query performance for any "property bag table" approach is funny and comically slow - but if you need flexibility you can either have a dynamic table that is changed via an editor OR you have a property bag table. So when you need it, you need it.

But expect the performance to be slow.

The best approach would likely be a ContactCustomFields table which has - fields that are determined by an editor.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow