Wednesday, March 28, 2012

optimizing a query to delete duplicates

I have a DELETE statement that deletes duplicate data from a table. It
takes a long time to execute, so I thought I'd seek advice here. The
structure of the table is little funny. The following is NOT the table,
but the representation of the data in the table:

+----+
| a | b |
+--+--+
| 123 | 234 |
| 345 | 456 |
| 123 | 123 |
+--+--+

As you can see, the data is tabular. This is how it is stored in the table:

+--+----+----+
| Row | FieldName | FieldValue |
+--+----+----+
| 1 | a | 123 |
| 1 | b | 234 |
| 2 | a | 345 |
| 2 | b | 456 |
| 3 | a | 123 |
| 3 | b | 234 |
+--+----+----+

What I need is to delete all records having the same "Row" when there exists
the same set of records with a different (smaller, to be precise) "Row".
Using the example above, what I need to get is:

+--+----+----+
| Row | FieldName | FieldValue |
+--+----+----+
| 1 | a | 123 |
| 1 | b | 234 |
| 2 | a | 345 |
| 2 | b | 456 |
+--+----+----+

A slow way of doing this seem to be:

DELETE FROM X
WHERE Row IN
(SELECT DISTINCT Row FROM X x1
WHERE EXISTS
(SELECT * FROM X x2
WHERE x2.Row < x1.Row
AND NOT EXISTS
(SELECT * FROM X x3
WHERE x3.Row = x2.Row
AND x3.FieldName = x2.FieldName
AND x3.FieldValue <> x1.FieldValue)))

Can this be done faster, better, and cheaper?my knee-jerk reaction is:

Why is it important to optimize it? I think you should delete the
duplicates, then create a constraint that prevents them from recurring.

If, for some reason, you are unable to fix the application that creates
these duplicates, and creating a constraint causes errors in the application
that you can't tolerate, then I suppose an alternative would be to create a
trigger that deletes them upon entry. Having a composite index on the
columns that are being duplicated would enable such a trigger to run
quickly.

But looking at your query, I find it strangely complex.

Why not just:

DELETE FROM X
WHERE EXISTS (SELECT * FROM X x2
WHERE x2.Row < x.Row
AND X.FieldName = x2.FieldName
AND X.FieldValue = x2.FieldValue)

Am I missing something? Your NOT EXISTS has me a bit confused... I think it
might delete data in situations other than described.

Also, NOT EXISTS is generally slow.|||On 2004-07-15, Aaron W. West <tallpeak@.hotmail.NO.SPAM> wrote:
> Why is it important to optimize it? I think you should delete the
> duplicates, then create a constraint that prevents them from recurring.

Such constraint may not be created. This table is a temporary table, where
data from an input file is loaded. Duplicate sets of records must be
deleted because the data then goes into permanent tables. Those table have
constraints against duplicates.

> But looking at your query, I find it strangely complex.

Me too. I'm trying to improve it. Its complexity seems to hinder its
performance.

> Why not just:
> DELETE FROM X
> WHERE EXISTS (SELECT * FROM X x2
> WHERE x2.Row < x.Row
> AND X.FieldName = x2.FieldName
> AND X.FieldValue = x2.FieldValue)

This would delete records that should not be deleted. Here's an example:

+--+----+----+
| Row | FieldName | FieldValue |
+--+----+----+
| 1 | a | 123 |
| 1 | b | 234 |
| 2 | a | 345 |
| 2 | b | 456 |
| 3 | a | 123 |
| 3 | b | 666 |
+--+----+----+

Here the combination of values for "a" and "b" on every "Row" is
different. There are no duplicates here. The query that you proposed would
delete the second to last row

+--+----+----+
| 3 | a | 123 |
+--+----+----+

because it has the same FieldName and FieldValue as the first row.

Think of it the data this way:

+--+--+
| a | b |
+--+--+
| 123 | 234 |
| 345 | 456 |
| 123 | 666 |
+--+--+

No duplicate rows here.|||Hi

You could try only selecting the correct data when you move it into the
permanent tables. But the following may work better:

DELETE FROM X1
FROM X X1 JOIN X X2
ON x2.Row < x1.Row
AND x1.Fieldvalue = x2.Fieldvalue
AND x1.FieldName = x2.FieldName

John

"Alexander Anderson" <no@.spam.com> wrote in message
news:slrncfe0ft.mk1.alex@.Toronto-HSE-ppp3682122.sympatico.ca...
> I have a DELETE statement that deletes duplicate data from a table. It
> takes a long time to execute, so I thought I'd seek advice here. The
> structure of the table is little funny. The following is NOT the table,
> but the representation of the data in the table:
> +----+
> | a | b |
> +--+--+
> | 123 | 234 |
> | 345 | 456 |
> | 123 | 123 |
> +--+--+
> As you can see, the data is tabular. This is how it is stored in the
table:
> +--+----+----+
> | Row | FieldName | FieldValue |
> +--+----+----+
> | 1 | a | 123 |
> | 1 | b | 234 |
> | 2 | a | 345 |
> | 2 | b | 456 |
> | 3 | a | 123 |
> | 3 | b | 234 |
> +--+----+----+
> What I need is to delete all records having the same "Row" when there
exists
> the same set of records with a different (smaller, to be precise) "Row".
> Using the example above, what I need to get is:
> +--+----+----+
> | Row | FieldName | FieldValue |
> +--+----+----+
> | 1 | a | 123 |
> | 1 | b | 234 |
> | 2 | a | 345 |
> | 2 | b | 456 |
> +--+----+----+
> A slow way of doing this seem to be:
> DELETE FROM X
> WHERE Row IN
> (SELECT DISTINCT Row FROM X x1
> WHERE EXISTS
> (SELECT * FROM X x2
> WHERE x2.Row < x1.Row
> AND NOT EXISTS
> (SELECT * FROM X x3
> WHERE x3.Row = x2.Row
> AND x3.FieldName = x2.FieldName
> AND x3.FieldValue <> x1.FieldValue)))
> Can this be done faster, better, and cheaper?

No comments:

Post a Comment