Data structure design for supporting arbitrary number of columns in table or database

https://datascience.stackexchange.com/questions/10053

16-10-2019
|

Question

I'm currently working on a sort-of a meta-modeler to build a free web service so that people can input data and run several models on that data.

The task I'm currently struggling is: user needs to enter data column by column, which would consist of a number n of ID's , a number m of attributes and a number k of classes, with the conditions that n, m > 0 and k >= 0. Data is heterogeneous, so indexes can be both numeric or text, and the same goes for attributes and classes. I'm supposing there will be no null in the data for simplicity.

I'm currently thinking on:

1) Creating a table with more than enough columns (all with null values), so that I can work using only the non-null columns (which will be got from user input). However this would limit the size of the datasets people could input.

2) Resorting to create an specialized data structure on a programming language, do all the work there and finally, create a table dynamically to store the result data there.

3) Using a database specialized for this kind of data (maybe a document-based DB).

4) Create a data structure on the RDBMS itself (I'm using PostgreSQL), let's say a variable size array, so that I can create the table directly from the user input, using only 3 variable arrays (one for indexes, one for attributes and one for classes). However, I keep in mind that attributes and indexes could be of different types, so the array would have to support heterogeneous data type and I don't know if this is possible on a RDBMS or SQL.

I've been looking for information on information but got no result until now. Any guidances to a package, language library, extension or paper, thesis, technical report with relevant information would be appreciated. Also, personal experiences with doing something similar could be useful.

Solution

I've done something like you're describing using mongoDB--I think you'll best use your time using some sort of NoSQL approach, rather than creating a specialized one-off solution. If you're using Python, I've had excellent experiences using PyMongo to handle reads and writes from within my code.

I would strongly caution you against adopting your approach #1. This could break very easily in the future, and there are databases designed to handle your exact problem!

OTHER TIPS

You should be using a combination of 2) and 3) in your scenario above.

For 2) use JSON as it allows you to construct an arbitrary structure on the fly with as many columns as you need without tying yourself to a fixed schema. JSON also has wide language adoption (Python, R, Java, Scala eg).

For 3) I agree that MongoDB is the simplest as it is designed to be used for JSON.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange