Tag Archives: tsql

Remove Numbers from Text SVF, Nested vs CTE

I need to remove numbers from a string while processing some data  as part of an ETL process in a data warehouse.  I have to do this for many columns and potentially many tables so I decided to put the logic into a Scalar-Valued Function (SVF), understanding there are performance detriments when using SVFs. Generally this will be used on relatively small data sets for cleaning “codes” that will be insert into a slowly changing dimension.  I was curious about how to do this.  My instinct was to do it using nested replace functions.  I did a quick google search to see what options the collective would recommend.  I came across a post on Stack Overflow that mentioned the nested replace functions, and then someone also mentioned using a recursive Common Table Expression (CTE), which I thought was a creative suggestion.

https://stackoverflow.com/questions/13240298/remove-numbers-from-string-sql-server/

I have a soft spot in my heart for recursive CTEs so I thought I’d try both options and see which solution performs better.  Let’s start with building the functions:

Now that we have some functions we need some test data so let’s create a simple table and populate it with simi-random data by leveraging the newid() function.

Continue reading Remove Numbers from Text SVF, Nested vs CTE

Search for Column Name in All Tables in All Databases

Use a text wild card to search for a column name in all tables in each database.  Stored procedure sp_msforeachdb is used to cycle through each database.  The sys tables are used to obtain table and column information.  The output will include Database Name, Schema Name, Table Name, Column Name, and code to easily select the top 10 from a table returned in the result.

Row Count on Large Tables Fast

There are ~550 million records in a table I am trying to get a row count on.  Typically I have always used the “COUNT(*)” method to get a row count. I was curious if there is a faster way.  I had read in some forums that “COUNT(1)” or “COUNT(Primary_Key)” could be theoretically faster but there was some argument on the topic.  I decided to run execution plans on all three to see how the query engine is handling it.  According to the execution plan all three operations are treated the same.

row count execution plan

I decided to then prove they are treated the same by running each one three times and taking the average execution time.  Although the results show small differences , this only caused by fluctuations on server load from other processes.   If I took enough measurements they would  converge.

Results:
COUNT(*) = 15.0 seconds
COUNT(1) = 16.3 seconds
COUNT(PK) = 18.0 seconds

Change in Direction

For my purposes I am trying to quickly obtain a count for all tables and databases on a server.  This a large production data warehouse with many databases and hundreds of tables.  I want to minimize the impact to the server and in this case that is more important to me than accuracy.  An alternative method to reading the table directly is to get the statistics from sys.dm_db_partition_stats. This is a Dynamic Management View (DMV) that reads the information about the partition, and by summing it up for a given object you can obtain the row count.  This row count is an estimation.  Any transactions that are occurring would not be captured.

dm_db_partition_stats execution plan

Althought the execution plan looks more complicated, it does not have to read a large table and instead only reads 1 or a handful of records via the system DMV.

Considerations

To stress the point; this is a fast but not necessarily accurate method to get row count.  If you are doing some validation, such as comparing tables, then I recommend using count() and possible checksum().

More Information About dm_db_partition_stats: https://msdn.microsoft.com/en-us/library/ms187737.aspx