Scalar VS Tabular Function¶

Scalar Functions¶

Scalar functions (sometimes referred to as User-Defined Functions / UDFs) return a single value as a return value, not as a result set, and can be used in most places within a query or SET statement, except for the FROM clause (and maybe other places?). Also, scalar functions can be called via EXEC, just like Stored Procedures, though there are not many occasions to make use of this ability.

Table-Valued Functions¶

Table-Valued Functions (TVFs) return result sets, and can be used in a FROM clause, JOIN, or CROSS APPLY / OUTER APPLY of any query, but unlike simple Views, cannot be the target of any DML statements (INSERT / UPDATE / DELETE). These can also be created in both T-SQL and SQLCLR.

Introduction¶

Table-Valued Functions have been around since SQL Server version 2005. Basically a Table-Valued Function is a function that returns a table, thus it can be used as a table in a query.

First sounds like nothing new since a view has been an available mechanism for a much longer time. That’s true, partly. Function has the same benefits as a view when it comes to data protection enforced via privileges or simplifying a query. However, a Table-Valued Function has at least few advantages:

Parameterization, a function can receive parameters so the logic inside the function can be adjusted better than using traditional predicate pushing. Programmability, a view can have certain amount of logic (calculations, case-structures etc.) but it’s still quite column bound so more complex logic is hard or impossible to create.

Creating a Simple Table-Valued Function with (some kind of) Logic¶

First, let's create a small table to store some data:

CREATE TABLE TrackingItem (
   Id       int  NOT NULL IDENTITY(1,1),
   Issued   date NOT NULL,
   Category int  NOT NULL
);
CREATE INDEX X_TrackingItem_Issued ON TrackingItem (Issued);

And then add few rows for test data:

INSERT INTO TrackingItem (Issued, Category) VALUES ( DATEADD( day, 0, GETDATE()), 1);
INSERT INTO TrackingItem (Issued, Category) VALUES ( DATEADD( day, 1, GETDATE()), 2);
INSERT INTO TrackingItem (Issued, Category) VALUES ( DATEADD( day, 4, GETDATE()), 1);
INSERT INTO TrackingItem (Issued, Category) VALUES ( DATEADD( day, 4, GETDATE()), 2);

Now, if we would need a result set which would:

Include all the columns from TrackingTable

Include an extra Modified (date) column

Not have even numbers in Category

The Modified-column indicates when the changes into the data have been made Return only TrackingItem-rows having the Id greater than or equal to the parameter passed.

The Table-Valued Function could look like this:

CREATE FUNCTION TrackingItemsModified(@minId int)
RETURNS @trackingItems TABLE (
   Id       int      NOT NULL,
   Issued   date     NOT NULL,
   Category int      NOT NULL,
   Modified datetime NULL
) 
AS
BEGIN
   INSERT INTO @trackingItems (Id, Issued, Category)
   SELECT ti.Id, ti.Issued, ti.Category 
   FROM   TrackingItem ti
   WHERE  ti.Id >= @minId; 

   UPDATE @trackingItems
   SET Category = Category + 1,
       Modified = GETDATE()
   WHERE Category%2 = 0;

   RETURN;
END;

The function defines a new table called @trackingItems. This is a temporary table stored in the tempdb. The contents of this table will be return value for the function when the function exits.

First, the function inserts all the desired rows from the TrackingItem-table to the temporary table. After that, the contents of the temporary table are modified based on the specifications and then returned to the caller.

Using the Function¶

The next step is to use the function. If we want to select all the rows having Id equal to 2 or more, the query would look like:

1	`SELECT * FROM TrackingItemsModified(2);`

And the results:

Id  Issued      Category  Modified
--  ----------  --------  -----------------------
2   2011-03-11  3         2011-03-10 23:46:53.523
3   2011-03-14  1         NULL
4   2011-03-14  3         2011-03-10 23:46:53.523

As the result is a table, it can be used like one. For example, if we want to query all the original tracking items that don’t exist in this subset, the query could be:

SELECT *
FROM  TrackingItem ti
WHERE ti.Id NOT IN (SELECT tim.Id
                    FROM   TrackingItemsModified(2) tim)

An the results would be:

Id  Issued      Category
--  ----------  --------
1   2011-03-10  1

Generating Data¶

So, Table-Valued Functions can be used to return modified data from one or more tables in the database. But since they are programmable functions, they can also generate data.

One quite common problem is to query all dates from a specified period and then have some results from a table which doesn’t have entries for all the dates. In our test data, there is a row for today and tomorrow but the next few dates are missing. So, if we want to get the amount of tracking items for each day for the next seven days, it wouldn’t be so simple. One typical solution is to create a table that contains all the necessary dates and then use that table in the query. Table-Valued Function can be used as an alternative. If we pass the date range to a function, we can create the necessary data on-the-fly with a simple loop.

CREATE FUNCTION DatesBetween(@startDate date, @endDate date)
RETURNS @dates TABLE (
   DateValue date NOT NULL
) 
AS
BEGIN
   WHILE (@startDate <= @endDate) BEGIN
      INSERT INTO @dates VALUES (@startDate);
      SET @startDate = DATEADD(day, 1, @startDate);
   END;

   RETURN;
END;

And the query for the TrackingItem amounts would be:

SELECT d.DateValue,
       (SELECT COUNT(*)
        FROM   TrackingItem ti
        WHERE  d.DateValue = ti.Issued) AS Items
FROM DatesBetween(DATEADD(day, 1, GETDATE()), DATEADD(day, 7, GETDATE())) d
ORDER BY d.DateValue;

So the results would be something like:

DateValue   Items
----------  ------
2011-03-12  1
2011-03-13  0
2011-03-14  0
2011-03-15  2
2011-03-16  0
2011-03-17  0
2011-03-18  0

What About Performance¶

Since this is a procedural approach. The performance won’t be as good as it would be using a good, set-based approach. However, since functions can provide more flexibility from the programming point of view, let’s have a look at larger amounts of data.

First, we’ll ensure that the statistics are fine and then see what happens if we take the same query but for a period of 50 years:

UPDATE STATISTICS TrackingItem WITH FULLSCAN;

SELECT d.DateValue,
       (SELECT COUNT(*)
        FROM   TrackingItem ti
        WHERE  d.DateValue = ti.Issued) AS Items
FROM DatesBetween(DATEADD(YEAR, -25, GETDATE()), DATEADD(YEAR, 25, GETDATE())) d
ORDER BY d.DateValue;

The query plan looks like:

DBAImg

Not very good, we’re going to do scans to tracking items in a loop based on the days returned from the function. Execution statistics verify this:

Table 'TrackingItem'. Scan count 18264, logical reads 36528, 
      physical reads 0, read-ahead reads 0, lob logical reads 0, 
      lob physical reads 0, lob read-ahead reads 0.
Table '#24927208'. Scan count 1, logical reads 30, 
      physical reads 0, read-ahead reads 0, lob logical reads 0, 
      lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 4087 ms, elapsed time = 4232 ms.

Now, if we add a clustered primary key index to the table (which is a great opportunity in certain situations), we will first re-create the function:

DROP FUNCTION DatesBetween;

CREATE FUNCTION DatesBetween(@startDate date, @endDate date)
RETURNS @dates TABLE (
   DateValue date NOT NULL PRIMARY KEY CLUSTERED
) 
AS
BEGIN
   WHILE (@startDate <= @endDate) BEGIN
      INSERT INTO @dates VALUES (@startDate);
      SET @startDate = DATEADD(day, 1, @startDate);
   END;

   RETURN;
END;

So what happens with the exact same query:

DBAImg

The plan has changed since the new index is taken into account, but the approach is still the same.

Statistics:

Table 'TrackingItem'. Scan count 18264, logical reads 36528, 
      physical reads 0, read-ahead reads 0, lob logical reads 0, 
      lob physical reads 0, lob read-ahead reads 0.
Table '#33D4B598'. Scan count 1, logical reads 30, 
      physical reads 0, read-ahead reads 0, lob logical reads 0, 
      lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 4165 ms, elapsed time = 4304 ms.

It seems that the problem lies inside the function. Let’s separate the date generation and see how it goes:

SELECT * 
FROM DatesBetween(DATEADD(YEAR, -25, GETDATE()), DATEADD(YEAR, 25, GETDATE())) d

The plan:

DBAImg

And the statistics:

Table '#33D4B598'. Scan count 1, logical reads 30, 
      physical reads 0, read-ahead reads 0, lob logical reads 0, 
      lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 4103 ms, elapsed time = 4268 ms.

So what’s happening? Few things:

The statistics we see do not include the whole operation, just getting the results from the function (and in the previous examples, joining the data). There’s absolutely no way that a single scan reading just 30 pages would take 4 seconds. It’s the data generation and inserts to the temporary table that's taking the time.

A bad plan is chosen simply because the optimizer cannot know how many rows are going to be returned. There’s no statistical data to use since we’re not querying any table (actually this effect would happen in all functions that contain multiple statements). So, the optimizer assumes that the result set is going to be small.

Having a look at the statistics for the Table-Valued Function access verifies this (note the Estimated Number of Rows):

DBAImg

Unfortunately, there’s no way to tell the optimizer (during the optimization phase) an estimate for the row count. This feature would be a great thing to have in the future.

Let’s take a look other way round: Use small amounts of dates against a big table. First, let’s add a few rows to the TrackingItem-table. 100’000 rows randomly for the next 50 years:

SET NOCOUNT ON
DECLARE @counter int = 0;
BEGIN
   WHILE (@counter < 100000) BEGIN
      INSERT INTO TrackingItem (Issued, Category) 
      VALUES ( DATEADD( day, RAND() * 365 * 50, GETDATE()), 1);

      SET @counter = @counter + 1;
   END;
END;

And now let’s update the statistics again and query for few days in the far future:

UPDATE STATISTICS TrackingItem WITH FULLSCAN;

SELECT d.DateValue,
       (SELECT COUNT(*)
        FROM   TrackingItem ti
        WHERE  d.DateValue = ti.Issued) AS Items
FROM DatesBetween(DATEADD(MONTH, 101, GETDATE()), _
            DATEADD(MONTH, 102, GETDATE())) d
ORDER BY d.DateValue;

Results (partial):

DateValue   Items
----------  -----
2019-08-11  7
2019-08-12  6
2019-08-13  2
2019-08-14  3
2019-08-15  5
2019-08-16  1
2019-08-17  5
2019-08-18  7
...

The plan:

DBAImg

And the statistics:

Table 'TrackingItem'. Scan count 32, logical reads 97, 
      physical reads 0, read-ahead reads 0, lob logical reads 0, 
      lob physical reads 0, lob read-ahead reads 0.
Table '#33D4B598'. Scan count 1, logical reads 2, 
      physical reads 0, read-ahead reads 0, lob logical reads 0, 
      lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 104 ms.

So the plan was quite good when there were relatively small amounts of rows coming out of the function.

When Creating a Table Valued Function in Management Studio¶

The easiest way to start creating a table valued function (well, actually any function or procedure) is to use SQL Server Management Studio and start creating an anonymous T-SQL block. Let's take an example with the DatesBetween function. The following will create and fill a table with 10'000 dates:

DECLARE @startDate DATETIME = GETDATE();
DECLARE @endDate DATETIME = DATEADD(day, 9999, GETDATE());
DECLARE @dates 
   TABLE (
      DateValue date NOT NULL
   );
BEGIN
   WHILE (@startDate <= @endDate) BEGIN
      INSERT INTO @dates VALUES (@startDate);
      SET @startDate = DATEADD(day, 1, @startDate);
   END;
END;

On my test machine, this takes roughly 5 seconds to complete. Not very good at all. The problem is that every time the insert is made, the database engine returns a message:

1	`(1 row(s) affected)`

Even worse, because I had both statistics time and io set on, every insert returns information about the statistics:

SQL Server Execution Times: 
   CPU time = 0 ms,  elapsed time = 0 ms.
 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
Table '#286302EC'. Scan count 0, logical reads 1, physical reads 0, 
                   read-ahead reads 0, lob logical reads 0, 
                   lob physical reads 0, lob read-ahead reads 0.
 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

So I'm getting 10'000 of these messages in the message tab and filling the messages takes time. So if I set both statistics off and use:

1	`SET NOCOUNT ON`

in the beginning of the script, the execution is done in less than a second.

This is good to bear in mind when using anonymous blocks in the Management Studio. The same effect won't happen when the T-SQL block is inside a function and the function is called from the Management Studio since only the messages for the calling SQL statement will be returned, not from the inside of the function.

Conclusions¶

Table-Valued Function is a good alternative for a view or an extra table when parameterization is needed or complex logic is included and especially when the amount of data returning from the function is relatively small.

For example, if the date generator should be able to generate dates for different granularities (days, working days, weekends, etc.), we would possibly need several tables or tagged rows if the traditional approach is used.