Frage

I have a table containing about 500 points and am looking for duplicates within a tolerance. This takes less than a second and gives me 500 rows. Most have a distance of zero because it gives the same point (PointA = PointB)

DECLARE @TOL AS REAL
SET @TOL = 0.05

SELECT 
    PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON PointA.Geometry.STDistance(PointB.Geometry) < @TOL
   -- AND
   -- PointA.ObjectId <> PointB.ObjectID
ORDER BY ObjectIDa

If I use the commented out lines near the bottom, I get 14 rows but the execution time goes up to 14 seconds. Not that big a deal until my point table expands to 10's of thousands.

I apologize in advance if the answer is already out there. I did look, but being new I get lost reading posts which are way over my head.

ADDENDUM: ObjectID is a bigint and the PK for the table, so I realized that I could change the statement to

AND PointA.ObjectID > PointB.ObjectID

This now takes half the time and gives me half the results (7 rows in 7 seconds). I now don't get duplicates (as in Point 4 is close to Point 8 followed by Point 8 is close to Point 4). However the performance still concerns me as the table will be very large, so any performance issues will become problems.

ADDENDUM 2: Changing the order of the JOIN and AND (or WHERE as suggested) as below makes no difference either.

DECLARE @TOL AS REAL
SET @TOL = 0.05

SELECT 
    PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON PointA.ObjectId < PointB.ObjectID
    WHERE
    PointA.Geometry.STDistance(PointB.Geometry) < @TOL
ORDER BY ObjectIDa

I find it fascinating that I can change the @Tol value to something large that returns over 100 rows with no change in performance even though it requires many computations. But then adding a simple A

War es hilfreich?

Lösung

This is a fun question.

It's not unrealistic that you get a large performance improvement by changing from "<>" to ">".

As others have mentioned, the trick is to get the most out of your indexes. Certainly by using ">", you should easily get the server to limit to that specific range on your PK - avoiding looking "backwards" when you've already checked looking "forwards".

This improvement will scale - will help as you add rows. But you're right to worry it won't help prevent any increase in work. As you're correctly thinking, as long as you have to scan a larger number of rows, it will take longer. And that's the case here because we always want to compare everything.

If the first part is looking good, just the TOL check, have you thought about splitting out the second part entirely?

Change the first part to dump into a temp table as

SELECT 
    PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST

into #AllDuplicatesWithRepeats

FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON 
    PointA.Geometry.STDistance(PointB.Geometry) < @TOL
ORDER BY ObjectIDa

And they you can write the direct query that skips duplicates, below. It isn't special, but against that small set in the temp table it should be perfectly speedy.

Select
    *
from    
    #AllDuplicatesWithRepeats d1
        left join #AllDuplicatesWithRepeats d2 on (
                        d1.objectIDa = d2.objectIDb
                        and
                        d1.objectIDb = d2.objectIDa
                        )
where
    d2.objectIDb is null

Andere Tipps

The execution plan is probably doing something behind the scenes when you add in the ObjectID comparison. Check the execution plan to see if the two different versions of the query are, for example, using an index seek vs. a table scan. If so, consider experimenting with query hints.

As a workaround, you could always use a subquery:

DECLARE @TOL AS REAL
SET @TOL = 0.05

SELECT 
    ObjectIDa,
    PTNameA,
    PTdescA,
    ObjectIDb,
    PTNameB,
    PTdescB,
    DIST
FROM
(
SELECT 
  PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON PointA.Geometry.STDistance(PointB.Geometry) < @TOL
   -- AND
   -- PointA.ObjectId <> PointB.ObjectID
) Subquery
WHERE ObjectIDa <> ObjectIDb
ORDER BY ObjectIDa

Try using PointA.ObjectId <> PointB.ObjectID with a WHERE clause between the JOIN and the ORDER BY clause.

Like so:

DECLARE @TOL AS REAL
SET @TOL = 0.05

SELECT 
    PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON PointA.Geometry.STDistance(PointB.Geometry) < @TOL
WHERE PointA.ObjectId <> PointB.ObjectID
ORDER BY ObjectIDa

With kudos to @Mike_M here is the edited Select which runs in 2 seconds.

SELECT 
    PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST

into #AllDuplicatesWithRepeats

FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON PointA.Geometry.STDistance(PointB.Geometry) < @TOL  
ORDER BY ObjectIDa

Select
    *
from    
    #AllDuplicatesWithRepeats d1
Where
    d1.ObjectIDa < d1.ObjectIDb
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top