What's the canonical way to store arbitrary (possibly marked up) text in SQL?

StackOverflow https://stackoverflow.com/questions/95459

  •  01-07-2019
  •  | 
  •  

Question

What do wikis/stackoverflow/etc. do when it comes to storing text? Is the text broken at newlines? Is it broken into fixed-length chunks? How do you best store arbitrarily long chunks of text?

Was it helpful?

Solution

nvarchar(max) ftw. because over complicating simple things is bad, mmkay?

OTHER TIPS

I guess if you need to offer the ability to store large chunks of text and you don't mind not being able to look into their content too much when querying, you can use CLobs.

This all depends on the RDBMS that you are using as well as the types of text that you are going to store. If the text is formatted into sizable chunks of data that mean something in and of themselves, like, say header/body, then you might want to break the data up into columns of these types. It may take multiple tables to use this method depending on the content that you are dealing with.

I don't know how other RDBMS's handle it, but I know that that it's not a good idea to have more than one open ended column in each table (text or varchar(max)). So you will want to make sure that only one column has unlimited characters.

Regarding PostgreSQL - use type TEXT or BYTEA. If you need to read random chunks you may consider large objects.

If you need to worry about keeping things like formatting strings, quotes, and other "cruft" in the text, as code would likely have, then the special characters need to be completely escaped first - otherwise on submission the db, they might end up causing an invalid command to be issued.

Most scripting languages have tools to do this built-in natively.

I guess it depends on where you want to store the text, if you need things like transactions etc.

Databases like SQL Server have a type that can store long text fields. In SQL Server 2005 this would primarily be nvarchar(max) for long unicode text strings. By using a database you can benefit from transactions and easy backup/restore assuming you are using the database for other things like StackOverflow.com does.

The alternative is to store text in files on disk. This may be fairly simple to implement and can work in environments where a database is not available or overkill.

Regards the format of the text that is stored in a database or file, it is probably very close to the input. If it's HTML then you would just push it through a function that would correctly escape it.

Something to remember is that you probably want to be using unicode or UTF-8 from creation to storage and vice-versa. This will allow you to support additional languages. Any problem with this encoding mechanism will corrupt your text. Historically people may have defaulted to ASCII based on the assumption they were saving disk space etc.

For SQL Server:

Use a varchar(max) to store. I think the upper limit is 2 GB.

Don't try to escape the text yourself. Pass the text through a parameterizing structure that will do the escapes properly for you. In .Net you'd add a parameter to a SqlCommand, or just use LinqToSQL (which then manages the SqlCommand for you).

I suspect StackOverflow is storing text in markdown format in arbitrarily-sized 'text' column. Maybe as UTF8 (but it might be UTF16 or something. I'm guessing it's SQL Server, which I don't know much about).

As a general rule you want to store stuff in your database in the 'rawest' form possible. That is, do all your decoding, and possibly cleaning, but don't do anything else with it (for example, if it's Markdown, don't encode it to HTML, leave it in its original 'raw' format)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top