Question

I've had to use and maintain an old database scheme for gameservers...
A really bad one. Every column with data which could've had a non-numeric character in it was stored as text.
I've converted every column to the proper data type, but now I am facing an issue with setting the primary index.
It should be id, the entry which contains the specific user's unique identification string. (it's a varchar).
Due to previous lack of indexing and an innocent, unsolvable bug due to our multiple gameservers (we have plenty, and had plenty more in the past) accessing the same tables, we have some duplicate rows, and thus are unable to set the column as primary index.

I have very little experience with MySQL or SQL in general. I don't know how to write a query to remove the duplicates.

One of our tables has two columns, id and lst (varchar). For this one, the duplicates have completely identical rows, due to a lack of limiting in the update query.

The other is a tad bit more complex. It has the same id column, and quite a lot more. There are three that matter, though: id, cur (int) and mdl (varchar). The duplicate finding rule here is a bit more complex. Firstly, whichever has a mdl other than a specific value (let it be "default.mdl", for instance) is more likely to be the latest info. Secondly, the one with the highest cur value is more likely to be the correct one.

Based on these, I only need to keep the latest (most likely to be correct) row in each (not both) of the two tables for every id.

How do I do this with only SQL?

Edit: The reason why I'm not doing this manually is that each table has ~186,000 rows, and I estimate that 1/20 (~9,000) rows are duplicates.

Was it helpful?

Solution

The easiest way to do this is probably through creating temporary tables, then copying and moving some data around.

It's a little hard to tell you what exactly to do since there's no schema to reference, but hopefully this will get you on the right track. It assumes that the table name of the first table you mentioned is my_table_1, the second is my_table_2, that you have permission to create / drop tables, and that you've backed up your database (if you haven't backed it up, stop now):

# First, add what will become the new id column. We'll rename it shortly.
ALTER TABLE `my_table_1`
  ADD `id_new` INT( 10 ) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;

ALTER TABLE `my_table_2`
  ADD `id_new` INT( 10 ) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;

# Next, build the structure to backup the existing values for future reference.
CREATE TABLE `temp_table_backup` (
  `id_orig` varchar( 255 ) NOT NULL,
  `id_new` int( 10 ) NULL DEFAULT NULL,
  `lst` varchar( 255 ) NULL DEFAULT NULL,
  `cur` int( 10 ) NULL DEFAULT NULL,
  `mdl` varchar ( 255 ) NULL DEFAULT NULL
);

# Now copy the old id values to the backup table
INSERT INTO temp_table_backup
  SELECT
    my_table_1.id,
    my_table_1.id_new,
    my_table_1.lst,
    my_table_2.cur,
    my_table_2.mdl
  FROM
    my_table_1
  INNER JOIN
    my_table_2
  ON
    my_table_1.id = my_table_2.id GROUP BY my_table_1.id;

# Create a table to use temporarily. I'm avoiding temporary tables because of the
# complexity of this whole thing.
CREATE TABLE `temp_table_1` (
  `id` int( 10 ) NOT NULL
);

# Copy values to the new table...
INSERT INTO temp_table_1
  SELECT
    p2.id
  FROM
    my_table_1 AS p1,
    my_table_1 AS p2
  WHERE
    p1.lst = p2.lst
  AND
    p1.id != p2.id
  GROUP BY p2.lst;

# Create another table (temporarily) for my_table_2. This one's kinda tricky,
# but "ranks" things according to different criteria.
CREATE TABLE `temp_table_2` (
  `id` int( 10 ) NOT NULL,
  `id_new` int( 10 ) NULL DEFAULT NULL,
  `rank` int( 10 ) NULL DEFAULT NULL,
  `cur` int( 10 ) NULL DEFAULT NULL,
  `mdl` varchar ( 255 ) NULL DEFAULT NULL
);

# Copy values to the new table...
INSERT INTO temp_table_2
  SELECT t1.id AS id,
  t1.id_new AS id_new,
  CASE
    WHEN t1.mdl = 'default.mdl' AND t1.cur >= t2.cur THEN 4
    WHEN t1.mdl = 'default.mdl' AND t1.cur < t2.cur THEN 3
    WHEN t1.mdl != 'default.mdl' AND t1.cur >= t2.cur THEN 2
    ELSE 1
  END AS rank,
  t1.cur AS cur,
  t1.mdl AS mdl
  FROM
    `my_table_2` AS t1,
    `my_table_2` AS t2
  WHERE t1.id != t2.id
  GROUP BY id HAVING MAX(rank)
  ORDER BY
    rank DESC,
    t1.cur DESC,
    id ASC;

# Update values in the old table using the values from temp_table_2.
UPDATE
  IGNORE `temp_table_2`,
  `my_table_2`
SET
  `my_table_2`.cur = `temp_table_2`.cur,
  `my_table_2`.mdl = `temp_table_2`.mdl
WHERE
  `my_table_2`.id_new = `temp_table_2`.id_new;

# Delete stale values...
DELETE
  FROM my_table_1
  WHERE id IN (SELECT id FROM temp_table_1);
# Again...
DELETE
  FROM my_table_2
  WHERE id IN (SELECT id FROM temp_table_1);

# Next, drop the old id columns and rename id_new to id
ALTER TABLE
  `my_table_1`
DROP `id`;

ALTER TABLE
  `my_table_1`
CHANGE
  `id_new` `id` INT( 10 ) UNSIGNED NOT NULL AUTO_INCREMENT;

ALTER TABLE
  `my_table_2`
DROP `id`;

ALTER TABLE
  `my_table_2`
CHANGE `id_new` `id` INT( 10 ) UNSIGNED NOT NULL AUTO_INCREMENT;

# Optional. We're done with these tables but you can drop or keep them if you want.
DROP TABLE IF EXISTS temp_table_1;
DROP TABLE IF EXISTS temp_table_2;
DROP TABLE IF EXISTS temp_table_backup;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top