Question

Starting from: X can have zero or more Y; X is associated to a particular Z.

My function counts all X without Y associated to a particular Z; this is my UDF function:

ALTER FUNCTION [dbo].[CountXWithNoYByZ]
(
    @zUID int
)
RETURNS int
AS
BEGIN

DECLARE @result int

IF (@zUID IS NULL)
    RETURN -1;

SET @result = 
(
    SELECT COUNT(DISTINCT X.UID)
    FROM X
    INNER JOIN XZ ON X.UID = XZ.X_UID
    LEFT JOIN Y ON X.UID = Y.X_UID
    WHERE (Y.UID IS NULL) AND (XZ.Z_UID = @zUID)
)

RETURN @result
END

The usage of this function:

 DECLARE @myCount int
 SET @myCount = dbo.CountXWithNoYByZ(@zUID)
 SELECT @myCount

This function is very slow for me (~8secs for ~10'000 records in X table and ~20'000 records in Y table) when I used it as a scalar function, but not when I use it outside (< 1sec). Why?

NOTE: I am aware of some slowness of UDF when using it inside SELECT, because it will run for each row, but I don't use it inside SELECT; it will run only once in combination with SET in a stored procedure for statistics purposes (along other functions without performance problems).

EDIT: Well, I restarted SQL-Server and now is faster, but it does not mean the case is solved...

I am new to this, but I am trying to attach the execution plan... hope it helps! Estimated execution plan

Was it helpful?

Solution

Several thoughts/things I notice:

  1. Your query plan has all scans (not any seeks) for its information retrieval. An index scan is really only marginally better than a table scan in terms of performance, and a "clustered index scan" is a table scan. It would be interesting to compare this plan with the (presumably much more efficient) one you get when running the SQL statements in-line, rather than as a function.

  2. (Accepted answer) In cases like this, where a query's performance varies widely depending how it is run, it can be because of an unholy alliance between unevenly distributed data and cached query plans. A little background: SQL Server supports an optimization called "parameter sniffing", where it will choose a different plan based on the particular values in the query. If you say "WHERE Breed='Pomeranian'" and there are only 5, it will use one plan, but if you say "WHERE Breed='Mutt'" and there are 10,000, it will use a different plan. The trouble comes when parameter sniffing doesn't occur, resulting in the pomeranian plan being applied to the mutt query. With functions & stored procs, if you want to ensure parameter sniffing, you must force a recompile of the plan on every execution. (This has a cost in and of itself, though, so you should only do it when you have a known performance issue due to this particular cause.) For a function, you would include a "RECOMPILE" hint in the relevant query; for a stored proc, you can specify "WITH RECOMPILE" right in the CREATE PROC statement. Very good link about this here

  3. You could experiment with some different approaches for your query structure itself-- a JOIN + DISTINCT can be a resource hog if the JOIN creates a bunch of duplicate results and the DISTINCT just throws them all away again. This is especially possible in your case-- why join to every single one of those Y rows when you, in fact, are actively disinterested in any row where the join succeeds? A "NOT EXISTS" may be faster (depending how many needlessly joined child rows we're talking about) because it will stop trying to join as soon as it finds the first match.

Something like this:

SELECT X.UID
FROM X
  INNER JOIN XZ ON X.UID = XZ.X_UID
WHERE (XZ.Z_UID = @zUID)
  AND NOT EXISTS (SELECT 1 FROM Y WHERE X.UID = Y.X_UID)

You would have to do some benchmarking, because of course this may also give you worse performance, depending on your data.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top