Removing Duplicate DB Records

While working on a Microsoft SQL Server DTS package the other day, I needed to handle the possibility of duplicate records coming from an external data source. While totally forgetting the more obvious solutions, I developed an alternative approach that takes advantage of the very useful properties of certain mathematical rings (to be arcane).

While massaging data in a temporary table before going to the final destination, I needed to find and delete any duplicate records (records with matching UserID fields). I added two integer fields to the temporary table, CounterID and DuplicateCount. CounterID was an identity column. This is similar to the AutoNumber of Microsoft Access: the database server will automatically choose a unique integer, in increasing numerical order, for each record. After inserting the data, ordered by UserID, I ran these queries (table name changed for security):

  1. UPDATE DTS_Temp1
  2. SET DuplicateCount = (
  3.     SELECT Count(sub.UserID) FROM DTS_Temp1 sub
  4.     WHERE sub.UserID = DTS_Temp1.UserID
  5. )
  6.  
  7. DELETE FROM DTS_Temp1
  8. WHERE CounterID % DuplicateCount <> 0

The UPDATE query counts the number of duplicate records found for each UserID and puts that count in the DuplicateCount field of the appropriate records. The DELETE query is the interesting part. We use the modulo operator (the % symbol) to delete any record whose CounterID field is not a multiple of the number of duplicates. This works because (1) we inserted data into this table ordered by UserID so all duplicate records have adjacent CounterID values, such as 3–4–5, and (2) any range of n integers corresponding to n duplicate records will contain a multiple of n.

For the sake of practicality and maintainability, I ended up implementing the standard correlated subquery method of deleting duplicates:

  1. DELETE FROM DTS_Temp1
  2. WHERE CounterID >
  3.  ( SELECT min(sub.CounterID)
  4.    FROM DTS_Temp1 AS sub
  5.    WHERE sub.UserID = DTS_Temp1.UserID )

Leave a Reply

You must be logged in to post a comment.