Wednesday, March 31, 2010

Deleting duplicate rows in Oracle

Deleting duplicate rows in Oracle seems to be the simplest to me (may be because I have used it more or may be its quite simple really!). Lets get down straight to the code and do a recap of how to delete duplicate rows in Oracle.

Simple Solution
Note that this simple solution assumes that there are no primary or unique keys on the table but the combination of data in a set of columns defines the duplicates. The example shows three key columns, but you can use more or less columns to identify unique rows based on your actual data.

DELETE FROM data_table
WHERE rowid not in
(SELECT MIN(rowid)
FROM data_table
GROUP BY keycol1, keycol2, keycol3

However, there could be cases where the table has a sequence / auto-generated column as the primary key. In such cases too, you can use the above query after substituting rowid with the id column name.

Advanced Solution
If you are looking for an advanced and faster solution with more features & flexibility, you can use the following solution that I found on devx. This solution uses RANK() function's capabilities to provide the extra features.

FROM data_table
(SELECT tmp_key
(SELECT tmp_key, keycol1, keycol2,
RANK() OVER (PARTITION BY keycol1, keycol2 ORDER BY tmp_key) AS seq_num
(SELECT rowid as tmp_key, keycol1, keycol2
FROM data_table
WHERE (keycol1, keycol2) IN
(SELECT keycol1, keycol2
FROM data_table
GROUP BY keycol1, keycol2
HAVING COUNT(0) > 1)))
WHERE seq_num > 1)

Let me break up this complicated query into parts and explain in detail.
  1. The innermost sub-query (lines 12 - 15) finds those records that have duplicates. This gives us the sub-set of data that are only duplicate rows.
  2. The next outer sub-query (lines 9 - 11) uses ROWD psuedo-column to create a single key column. This adds a key column to the set of data rows we are interested in.
  3. The next outer sub-query (lines 6 - 8) uses RANK() function to dynamically assign sequence numbers to rows in the set of duplicate rows and orders them. Note that there are a set of sequence numbers each for each set of duplicate rows. This assigns a sequence number to and orders the rows inside each set of duplicates.
  4. The next outer sub-query (lines 4, 5 & 16) selects the duplicate rows based on the sequence number assigned by the RANK function is above step.
  5. The outer-most query (lines 1 - 3) deletes the duplicate rows.

Note: In case the table has a unique / primary key, substitute that instead of the ROWID psuedo-column in the above solution to get the same results.

Now, the extra features that using RANK provides are that you can choose the number of rows you want to keep. For example, you want to keep only the top 10 rows, you can change the numbers in lines 15 & 16 of the above query from 1 to 10 and add a proper ORDER BY to line 7 to get the data sorted in the required manner to determine the "top 10 rows".

Similarly, if you want to keep only the latest row, you can add a proper ORDER BY clause to line 7 to get the data sorted in the required manner to determine the latest row.

Remember to change the table & column names in the above queries.
Replace data_table with your actual table name and
keycol1..n with the actual table column names!

Check out my next post here for doing this same thing in Sybase.

No comments:


Related Posts Plugin for WordPress, Blogger...