Implementing soft delete with minimal impact on performance and code

https://stackoverflow.com/questions/7366849

28-10-2019
|

سؤال

There are some similar questions on the topic, but they are not really helping me.

I want to implement a soft delete feature like on StackOverflow, where items are not really deleted, but just hidden. I am using a SQL database. Here are 3 options:

Add a is_deleted boolean field.
- Advantages: Simple.
- Disadvantages: No date record. Forces me to add a is_deleted = 0 in every query.
Add a deleted_date date field. This is set to NULL if it's not deleted.
- Advantages: Has date.
- Disadvantages: Still cluttering my queries.

For both of the above

It will also impact performance because there are all these useless rows. They still have to be maintained in indexes. Also an index on the deleted column won't help when fetching non-deleted (the majority) of the rows. Full table scan is needed.

Another option is to create a separate table to hold deleted items:

Advantages: Improved performance when querying non-deleted rows. No need to add conditions to my queries on non-deleted rows. Easier on index maintenance.
Disadvantages: Complexity: Requires data migration for both deletion and undeletion. Need for new tables. Referential integrity is harder to handle.

Is there a better option?

المحلول

If the key is numeric, I handle a "soft-delete" by negating the key. (Of course, won't work for identity keys). You don't need to change your code at all, and can easily restore the record by multiplying by -1.

Just another approach to give some thought to... If the key is alphanumeric, you can do something similar by prepending a unique "marker" characters. Since deleted records will all begin with this marker, then will end up off by themselves in the index.

نصائح أخرى

I personally would base my answer off of how often you anticipate your users wanting to access that deleted data or "restore" that deleted data.

If it's often, then I would go with a "Date_Deleted" field and put a calculated "IsDeleted" in my poco in the code.

If it's never (or almost never) then a history table or deleted table is good for the benefits you explained.

I personally almost never use deleted tables (and opt for isDeleted or date_deleted) because of the potential risk to referencial integrity. You have A -> B and you remove the record from B database... You now have to manage referencial integrity because of your design choice.

In my opinion, the best way forward, when thinking about scaling and eventual table/database sizes is your third option - a separate table for deleted items. Such a table can eventually be moved to a different database to support scaling.

I believe you have listed the three most common options. As you have seen, each has advantages and disadvantages. Personally, I like taking the longer view on things.

Let's suppose we create a field called dead to mark deleted rows. We can create a index where field dead is false. In this way, we only search non-deleted rows using the hint use index.

I think your analysis of the options is good but you missed a few relevant points which I list below. Almost all implementations that I have seen use some sort of deleted or versioning field on the row as you suggest in your first two options.

Using one table with deleted flag: If your indexes all contain the deleted flag field first and your query's mostly contain a where isdeleted=false type structure then it DOES solve you performance problems and the indexes very efficiently exclude the deleted rows. Similar logic could be used for the deleted date option.

Using two Tables In general you need to make massive changes to reports because some reports may refer to deleted data (like old sales figures might refer to a deleted sales category). One can overcome this by creating a view which is a union of the two tables to read from and only write to the active records table.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow