Question

My apologies if this has already been answered. I've searched SO and of course here on DBA, and am surprised to find no close matches. Specifically, I'm looking for a solution which survives schema differences.


I finally got a development server with which to shadow a production data warehouse server. I'll be testing schema and sproc changes here, eventually publishing such changes to the prod server. To do so, I'll need reasonably current data, but it need not be updated in real time. My plan is to copy end-state data from the main server to the dev server as the final step in overnight ETL. What's the best way to achieve this?

Parameters

All automatic copying will be from prod to dev. Any copying from dev to prod will be by hand, and generally will just be DDL.

Because the dev server's schema is expected to change independently of the production server, I don't want the copy process to fail when the target schema differs (it's OK if the specific tables in flux can't be kept in synch, of course). Otherwise I would just DROP DATABASE and restore last night's backup.

Keeping the schema, SPs, views, and UDFs in synch is not necessary, or even desirable. They should only change when I specifically change them.

Records can be changed retroactively in most tables, so incremental updates will probably not be practical.

The volume of data is 14 GB for the most critical tables, plus about 200 GB of less important data which can be updated weekly.

My goal is to finish the process within two hours. The servers are co-located, and should have high throughput. Copying (INSERT INTO..SELECT * FROM ProdServer..) a single table with 600 MB of data and 300 of indices took 7.5 minutes; not great. Of some concern, another table with 11 GB of data and 8 GB of indices did not finish in 130 minutes, when I cancelled it. I'll test this again without indices.

Taking the prod database offline is acceptable if it can be kept brief - no more than 30 minutes. The dev database can be offline for hours if necessary.

If I can temporarily exclude a specific table, keeping it static for a few days while testing, that would be valuable.

Options

1: Replication

I could create a daily snapshot after the end of ETL, and publish it to the dev server. I've not used replication before, but this seems to be the kind of scenario for which it is meant. Time for me to learn a new facet of the technology?

2: Bulk Copy and BULK INSERT

I could write a script which iterates through every table in the prod database and spits out the content to a uniquely-named file. On the dev side, I'd loop through these files and TRUNCATE/BULK INSERT into the target table, with a TRY..CATCH block in case the schema has changed. I'm not sure if this would perform acceptably, but it would be fairly simple to implement.

3: INSERT from Linked Server

I can run a TRUNCATE/INSERT..SELECT for each table, probably pulling from the dev side. This would be simple, and should be fast, especially if one drops and re-created the indices. To cope with schema changes, one could identify the intersection of the field lists for each pair of tables and only attempt to copy those fields; this would help if many fields were NULLable.


Are there other options? Is there an easier way that I'm overlooking? Are there any gotchas that you've hit when working on similar projects?

This question describes the export side, but the desired output was CSV, so the options are limited (and BCP works great, as answered).

This question talks about quickly copying a whole database, including schema. It specifically excludes replication.

This question discusses bi-directional synchronization, for which replication is suggested.

Was it helpful?

Solution

I'd consider backup/restore of a mix of full and differential backups from prod to dev. Then synch between this restored copy to your actual dev databases

SQL Server 2008 R2+ supports backup compression in Standard Edition (In SQL Server 2008 it was Enterprise Edition only) which makes this easier.

Reasons:

  • You test your backup integrity and restore capability
  • You can reset your dev database easily if some code monkey bollixes it
  • You can compare "before" and "after" DDL, performance etc
  • You can snapshot the restored database as needed to roll this back if you make changes

I've used this before and will use it again for these reasons

I suggested this in my answer to The smallest backup possible ... with SQL Server

OTHER TIPS

Contiguous integration with migrations. Deployment of V. Next means running the upgrade scripts for V. Prev to v. Next. Scripts are source, checked in. Never consider the database binary files (MDF, LDF) as the true schema. consider the source (.sql) as the schema, always operate on the .sql files, test the .sql files until you're confident on the upgrade and then deploy the upgrade by running the .sql file.

Diff based deployment is riddled with problems as you are at the mercy of the diff tool (some are better than other, personally I would never trust my live deployment on any of them). Copy based deployment is not even worth mentioning.

No change occurs on liver server w/o going through this process. Ever.

How about copy of prod server to dev? Why would you ever need to do it, other than compromise data by exposing it to unauthorized eyes? Deploy a test/dev server using the same procedure, contiguous integration and migrations.

In some scenarios is acceptable to have a test replica of prod, or a dev replica of prod. Make it part of the contiguous integration build drop: the nightly drop starts with a prod backup, runs the migrations from v. prod to v. dev, the result is the dev variant of the prod database, freshly dropped. BTW, did you notice that the drop also just tested the migration from v. prod to v. dev? Does data has to be anonymized or some other transformation for the dev environment? Make it a dev specific deployment step, of course controlled by scripts checked into the source tree.

I would recommend using SQL Server Integration Services (SSIS) for this type of requirement. The power in SSIS would be the ease of error handling (as required by your question), the ability to schedule it, and the flawless maintainability of a DTSX package. You won't have to worry about explicitly linking servers or any downtime.

SSIS nicely handles the separation of concerns. It's an independent package that, when used with SQL Server Agent, can be scheduled nightly or you can run it on demand.

Snapshot replication would be the only kind of replication that might work. But it could prevent schema changes like the other versions do (transactional, merge/peer-to-peer). But you would just be better off with a backup and restore (Remus' ideas look spot-on) as you will gain nothing with replication for this scenario.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top