I need to remove numbers from a string while processing some data as part of an ETL process in a data warehouse. I have to do this for many columns and potentially many tables so I decided to put the logic into a Scalar-Valued Function (SVF), understanding there are performance detriments when using SVFs. Generally this will be used on relatively small data sets for cleaning “codes” that will be insert into a slowly changing dimension. I was curious about how to do this. My instinct was to do it using nested replace functions. I did a quick google search to see what options the collective would recommend. I came across a post on Stack Overflow that mentioned the nested replace functions, and then someone also mentioned using a recursive Common Table Expression (CTE), which I thought was a creative suggestion.
https://stackoverflow.com/questions/13240298/remove-numbers-from-string-sql-server/
I have a soft spot in my heart for recursive CTEs so I thought I’d try both options and see which solution performs better. Let’s start with building the functions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
--create Nested function create function udf_RemoveNumbers_Nested ( -- Add the parameters for the function here @String varchar(200) ) returns varchar(200) as begin -- Declare the return variable here declare @Result varchar(200); -- Add the T-SQL statements to compute the return value here select @Result = replace( replace( replace( replace( replace( replace( replace( replace( replace( replace( @String, '0', '' ) , '1', '' ) , '2', '' ) , '3', '' ) , '4', '' ) , '5', '' ) , '6', '' ) , '7', '' ) , '8', '' ) , '9', '' ); -- Return the result of the function return @Result; end; go --creat CTE function alter function udf_RemoveNumbers_CTE ( -- Add the parameters for the function here @String varchar(200) ) returns varchar(200) as begin -- Declare the return variable here declare @Result varchar(200); -- Add the T-SQL statements to compute the return value here ; with cte as ( select @String as string , 0 as n union all select convert( varchar(200), replace( string, n, '' )) , n + 1 from cte where n <= 9 ) select @Result = ( select top 1 string from cte order by n desc ); -- Return the result of the function return @Result; end; go |
Now that we have some functions we need some test data so let’s create a simple table and populate it with simi-random data by leveraging the newid() function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
--test table create table dbo.test ( string varchar(200) ); go --insert data declare @n int = 0; while @n < 1000 begin insert into dbo.test ( string ) select convert( varchar(200), newid()); set @n = @n + 1; end; |
Run the test by calling each function against the test table. I made the calls as separate batches to the server so that I can grab some execution query statistics afterwards. I ran the test 10 times to get a satisfactory sample set for the performance test.
1 2 3 4 5 |
--test select dbo.udf_RemoveNumbers_Nested(string) from dbo.test; go select dbo.udf_RemoveNumbers_CTE(string) from dbo.test; go |
I measured performance by utilizing the execution query statistics. You can not easily do this any other way. The execution plan and IO statistics for scalar value functions are nearly useless.
1 2 3 4 5 6 7 8 9 10 11 |
--get performance SELECT text , execution_count , total_logical_reads / execution_count as AvgReads , total_worker_time / execution_count / 1000 as AvgWorkersTime_ms , total_elapsed_time / execution_count / 1000 as AvgElapsedTime_ms FROM sys.dm_exec_query_stats CROSS APPLY sys.dm_exec_sql_text(sql_handle) WHERE text LIKE '%udf_RemoveNumbers_%' OPTION (RECOMPILE) |
Here is the result. As you can see, the nested replace function performs much better than the CTE. I hypothesized this would be the case, but it’s nice to be able to prove it so you can feel comfortable moving forward with a solution.
Type | Avg Reads | Avg Workers Time (ms) | Avg Elapsed Time (ms) |
---|---|---|---|
CTE | 102,010 | 681 | 746 |
Nested | 9 | 25 | 25 |
To learn more about performance of user defined functions (UDFs) in general, see this great post on mssqltips:
https://www.mssqltips.com/sqlservertip/4689/understand-the-performance-behavior-of-sql-server-scalar-user-defined-functions/