Вопрос

As known, in relational databases, when adding a new column, data must be reallocated (https://stackoverflow.com/questions/463677/alter-table-without-locking-the-table) to maintain a single row contiguous on disk.

enter image description here

I would like to understand how this is achieved on wide-columns storages such as Cassandra, which are sparses and can handle lots of dynamic columns insertions (http://www.datastax.com/dev/blog/thrift-to-cql3 (Dynamic Column family))

Thanks!

Это было полезно?

Решение

Although Cassandra allows the definition of "columns" within a "table", these are much less strict than a relational database schema. As it says here

different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time.

In this world, the new value is only added to the row when that row is re-written. The software knows where it has written the row and maintains any forward pointers to be able to return the complete row on demand. It may choose to delete the old version, freeing that disk space for reuse, and write the new row elsewhere in contiguous storage. There is no implication that making a schema change in one place automatically propagates to other, similar rows.

For key-value stores the requirements are even more lax. What constitues a "column" is entirely up to the applicaiton. All the storage engine sees is a blob of bits, which it writes to disk and indexes. How or where on disk it holds these bits is neither here nor there to the application.

Not all relational databases require all parts of all rows to be contiguous. Oracle allows a single row to span multiple pages. SQL Server has off-page pointers for long text columns, and Filestream allows for storage in the OS's filesystem outside of the DBMS.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с dba.stackexchange
scroll top