Sqlbulkcopy fails at the exact same location, but works the next day

Question

0.00/5 (No votes)

See more:

Im working on a project where I have a feature that can copy entire tables from one database to another from one site to another. Each site has about 30 tables each that are normally transfered at the same time, though to protect the server and network, only 7 "insert" instances can be active at the same time and the rest wait for an open slot. This has worked fine on individual sites during our testing phase, and can download all tables with speeds up to 5000 row pr sec.

In addition to that feature we have an update system, that tracks changes on our client sites tables, register them on a seperate table and once every 5 minutes, the server site pull those data down so the tables stays updated with any new information. This also runs perfectly fine.

During testing we only used 5 sites, but we are now starting to add more sites, where we experienced an issue we do not understand completely. What we see is something like this: We add 2 sites, with 30 inserts each without issue, but during the 3rd site insert, the largest tables would stop mid transfer, wait for 30 seconds and then display an error. Some smaller once would sucessfully complete, but the lasts once would then experince the same stop, wait 30 seconds and then fail. We then do a clean up, where all data from that 3rd site is removed from the server, and then retry again, just to see that exact same result. The largest tables stop at the exact same location, and some small once complete, but eventually they also stop mid transfer.

In our log, we would see the following log message repeat multiple times:

15/12/2022 10:32:36.052	FAIL	TaskHandlerAddTabel[0]	Error during add table task Add [tablename] from [site] (ID: 2485008)
Exception type: AggregateException
One or more errors occurred. (Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
Operation cancelled by user.
The statement has been terminated.)
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at TaskHandlerAddTabel._DoWork(Object sender, DoWorkEventArgs e) in [path]\TaskHandlerAddTabel.cs:line 128
Exception type: SqlException
Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
Operation cancelled by user.
The statement has been terminated.
SQL error number: -2
   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at Microsoft.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
   at Microsoft.Data.SqlClient.SqlBulkCopy.RunParser(BulkCopySimpleResultSet bulkCopyHandler)
   at Microsoft.Data.SqlClient.SqlBulkCopy.CopyBatchesAsyncContinuedOnSuccess(BulkCopySimpleResultSet internalResults, String updateBulkCommandText, CancellationToken cts, TaskCompletionSource`1 source)
Exception type: Win32Exception
The wait operation timed out.

As you can read from this error dump, which is a mix of custom and original Sql message and stacktrace dumps (some redactions made for confidential purposes), I use SqlBulkCopy to do the transfer. Here is the isolated sqlbulk copy code:

C#

SqlBulkCopy bulkCopy = globalDatalayer.CreateSqlBulkCopy(_server.GetConnectionStringWithUserPass());
bulkCopy.BatchSize = _bulkCopyLimit;
bulkCopy.EnableStreaming = true;
bulkCopy.DestinationTableName = $"tablename";
                
try
{
	var task = bulkCopy.WriteToServerAsync(dataReader, CancellationToken);
	task.Wait();
	CancellationToken.ThrowIfCancellationRequested();
}
catch
{
	throw;
}
finally
{
	try
	{
		bulkCopy.Dispose();
	}
	catch
	{
		//Ignored
	}
}

Now, I do not think there is anything wrong with the SqlBulkCopy implementation as it works very well, up untill we see the stop->wait->error sequence. We can also see that our updates running on other added sites, are running fine and adds changes as they go.

What I have tried:

Our current theory is that we overload the SQL Server somehow and here is why.

As described the system work for the first couple of sites, but then just start having the stop->wait->error while update are running fine. But if we let the system be overnight, still running updates, and try to add the 3rd site in the morning, then the 3rd site succeeds. If we continue to add more site, then the 5th or 6th site will experience the stop->wait->error scenario. And again, if we let it be, for 12-24 hours, then we can start again the next day.

We looked into the servers vitals, but everything looks norminal. Granted, the server is busy during our table additions, but not critically so. Nothing seems to be screaming. We can still contact the server without issues and a monitoring client we built to keep an eye out for the server is behaving normally.

So, connection to the server works. Updates to the tables work (small amounts of work) every 5 minutes for each site. Access to the server itself and vitals work and are not critical, but after several tables being bombarded with inserts (talking tens of millions of inserts to different tables), SqlBulkCopy stops working, but only for some time (12-24 hours), when it then starts working again. Any Sql server experts here that has a suggestion as to what is going on here. My own theory is that we somehow fill up a buffer or log in the sql server and when that is filled up, it cant handle anymore untill is has processed it and we can then go again. The 12-24 hour cooldown havent been investigated a lot and could be just a 6 hour window. We tried waiting for an hour, but that didn't help.

Update 1:
We did try to extend the timeout on the sqlbulkcopy, but it just delays the timeout exception. Waiting forever is not an option we want to persue, as it can actually cause problem on the client side if it cant deliver updates (inserts needs to happen first). Plus customers starts asking why they do not get updated data.

I should also note that we looked into EventViewer of the server, but no issue reported from the SqlServer.

Update 2:
One of my colleagues did a batch today again, which resulted in the same failure. We then looked into some server processes as another colleague suggestion that a cleanup routine might be why we can do it the next day. We did have a routine, that among other stuff rebuilds indexes and handles some logs. We ran that and retried again, where it completed all tables except one which we currently think was just a fluke that we ran again and it worked. So we are working on doing some automatic rebuilds on indexes upon table adds and see if that should help with the issue. I read today online of another one who had this issue, and he was rebuilding indexes.

Posted 19-Dec-22 22:36pm

Evilfish2000

Updated 22-Dec-22 2:14am

v4

Add a Solution

Comments

PIEBALDconsult 20-Dec-22 7:47am

Are data from all sites being loaded into one table? Or does each site have its own set of tables in the destination database?

Evilfish2000 20-Dec-22 8:11am

All sites load data into several tables, but all sites have the same tables. So all site load data from their TableA to server TableA and so on. I am currently investigating that rebuilding index may help with the issue.

PIEBALDconsult 20-Dec-22 12:04pm

Is it a truncate-and-load process? Ours are.
What we do is disable the indices at the start of the process then rebuild them at the end.

Evilfish2000 22-Dec-22 2:49am

No, there is no truncation.

RedDk 21-Dec-22 14:19pm

Bad stuff, like timeout errors/warnings and insufficient memory messages, have plagued my code for years and it seems as if one of the first corrections that rehabilitated my TSQL coding was dropping the use of CURSOR. For any and/or all procedures. Search through yours and see if you use it

So, stripping index then retabulating to commit the data sounds right.

Evilfish2000 22-Dec-22 2:48am

We only have one location with a cursor that is a manual maintenance task that is executed only when necessary. Otherwise, I avoid using such constructs.

3 solutions

Solution 1

The error is quite clear:

One or more errors occurred. (Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.

The first thing to do is Google the issue: sql server sqlbulkcopy timeout - Google Search[^]. The first result is: SqlBulkCopy.BulkCopyTimeout Property (System.Data.SqlClient) | Microsoft Learn[^]

The documentation states:
The integer value of the BulkCopyTimeout property. The default is 30 seconds. A value of 0 indicates no limit; the bulk copy will wait indefinitely.

It also shows you how to extend the timeout period:

C#

// Set the timeout.
bulkCopy.BulkCopyTimeout = 60;

Posted 19-Dec-22 22:43pm

Graeme_Grant

Comments

Evilfish2000 20-Dec-22 4:58am

@Graeme_Grant I have looked into this documentation, but extending the timeout just do that. Extends the timeout. We still see the same issue that the server is blocked for hours. Yes we do get a timeout exception, but we want to know why the server stops at the exact same time during our SqlBulkInsert and then just give back to us that timeout error.

Richard Deeming 20-Dec-22 5:07am

If it works at one time, and fails at another, then something else is happening on your server to cause it to fail.

Nobody else can debug that for you, since we don't have access to your server.

Graeme_Grant 20-Dec-22 5:19am

As Richard points out, if the timeout is not the issue, then there is another issue that you have. Keep breaking it down into smaller chunks (more manageable to identify the cause) and try and isolate the issue. The joys of debugging...

Evilfish2000 20-Dec-22 5:40am

Well... "joy". I'm aware that this issue is very much a "only I can fix it". My hope was that I could be pointed into a direction that made sense. I am still working on the issue from my end. I did do some more reading on inserting large amounts of rows. I might need to experiment with my batch sizes it would seem.

Graeme_Grant 20-Dec-22 5:52am

If you are doing it over the wire, the timeout could be the connection to the remote DB, not the DB processing the bulk insert. See: SqlConnection.ConnectionTimeout Property (System.Data.SqlClient) | Microsoft Learn[^]

Evilfish2000 20-Dec-22 8:08am

Unlikely. We have a 30 second timeout on the bulkcopy and we have seen processes up to half an hour complete without issue from sites half way around the world (Good suggestion though). I have however found a promising lead. One of my colleagues did a batch today again, which resulted in the same failure. We then looked into some server processes as another colleague suggestion that a cleanup routine might be why we can do it the next day. We did have a routine, that among other stuff rebuilds indexes and handles some logs. We ran that and retried again, where it completed all tables except one which we currently think was just a fluke that we ran again and it worked. So we are working on doing some automatic rebuilds on indexes upon table adds and see if that should help with the issue. I read today online of another one who had this issue, and he was rebuilding indexes.

We also tried working a bit with batch sizes but that didn't really have any noticeable effects.

Graeme_Grant 20-Dec-22 8:26am

"We also tried working a bit with batch sizes but that didn't really have any noticeable effects." - this suggests that there may be an issue with the data. keep halving the size of every batch that fails until you have just one record that fails. Process of elimination.

Solution 3

In my experience, issues with timeouts when bulkcopying is because of table locks or blocking transactions.

Since you are copying from several instances at the same time this is a probable reason.
Default setting for BulkCopy is Row locks.
Normally when a table has more than 5000 locks it's elevated to a table lock instead. (I don't know if this is the case also when BulkCopying)
But if this is the case all other inserts has to wait until the batch is finished which may lead to a timeout.
Question: What size is _bulkCopyLimit

Also, BulkCopy allows parallell inserts using a Bulk Update Lock, but only if there are no indexes on the table.
Importing Data in Parallel with Table Level Locking | Microsoft Learn[^]
Note, a clustered table IS indexed.
Also, note from the same link:

Quote:
When indexes exist on a table, you cannot perform a parallel load operation by using the TABLOCK option. Also, the concurrent threads block each other if TABLOCK is not specified. Before a bulk-import operation, consider removing indexes from the table.

Posted 20-Dec-22 8:31am

Jörgen Andersson

Comments

Evilfish2000 22-Dec-22 2:53am

We have been experimenting with batch limits between 500-2000. Currently it is set to 500, which have been our normal one. This was chosen initially, as it seems to be the one that performs best bandwidth wise. I've seen more comments about removing the index when doing the insert, which is a strategy I have in mind as the next step. We are still trying test with just rebuilds or reorganizations first, though our test environment seems to have gone on Christmas holiday already.

PIEBALDconsult 22-Dec-22 8:14am

I use much higher batch sizes for my loads, but again, mine are truncate/load.
I do think that other processes on the system are causing me some issues, locking system resources and the like.
For my most intense processes I tell production support not to run any other processes while they run.

Jörgen Andersson 22-Dec-22 9:23am

In my experience the Batch size is more about kB than rows. Except for the lock escalation limit of 5000 rows.

PIEBALDconsult 22-Dec-22 9:49am

In SSIS, that is certainly the case, because it uses a "buffer" to move data around.
When I'm developing a load which uses an SqlBulkCopy, I typically test various batch sizes to judge performance variations. Higher batch sizes frequently perform better -- with my work loads. I also specify the TableLock and UseInternalTransaction options.
I do pretty much only truncate/load ETLs, so that can affect things. But, I recommend always using a truncate/load into a landing table and then moving the data to other locations afterward.

Jörgen Andersson 22-Dec-22 9:26am

I checked, the lock escalation goes from row locks to page locks.
So if two processes have locked rows in the same index and page and both tries to escalate you will in worst case get a deadlock, and otherwise the process that tries to lock last will have to wait, and therefore timeout.

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Evilfish2000 · Accepted Answer · 2022-12-22T02:14:00

I want to thank everyone that participated in this. It lead me to think differently and discover some facets of my situation I never would have considered.

Our tests with reorganizing or rebuilding indexes as necessary after an insert operation via SqlBulkCopy have yielded very positive results. What we have done is created 2 stored procedures in our database. The first one is using sys.dm_db_index_physical_stats and sys.indexes to get a quick fragmentation state:

SQL

CREATE PROCEDURE [dbo].[GetFragmentationStatus] @Tablename nvarchar(50)
AS
BEGIN
	DECLARE @db_id SMALLINT;  
	DECLARE @object_id INT;  
	SET @db_id = DB_ID(N'MyDatabase');  
	SET @object_id = NULL;
	if @Tablename IS NOT NULL
	BEGIN
	    SET @object_id = OBJECT_ID(@Tablename); 
	END

	SELECT ist.object_id as ObjectID, ist.index_id AS IndexId, avg_fragmentation_in_percent AS AvgFragmentation, OBJECT_NAME(ist.object_id) as ObjectName, idx.name AS IndexName
	FROM sys.dm_db_index_physical_stats(@db_id,@object_id, NULL, NULL , 'LIMITED') ist INNER JOIN sys.indexes idx ON ist.object_id=idx.object_id AND ist.index_id=idx.index_id;
END;

As all our insert operations only work on one table at a time we can call this after a successful operation and validate if a rebuild or reorganize is the best option. This is done with the second procedure:

SQL

CREATE PROCEDURE [dbo].[CheckAndMaintainIndexes] @Tablename nvarchar(50)
AS
BEGIN

	set nocount on;

	DECLARE @curObjectName nvarchar(100);
	DECLARE @curIndexName nvarchar(100);
	DECLARE @curFragmentationPercent float;
	DECLARE @alter varchar(300);
		
	CREATE TABLE #Maintaintable 
	(
		ObjectId int, 
		IndexId int,
		AvgFragmentation float, 
		ObjectName varchar(100),
		IndexName varchar(100),
		[Action] varchar(20),
	)

	INSERT INTO #Maintaintable (ObjectId, IndexId, AvgFragmentation, ObjectName, IndexName)
	EXEC GetFragmentationStatus @Tablename

	DECLARE maintainCursor CURSOR FOR 
	-- Populate the cursor with your logic
	-- * UPDATE WITH YOUR SPECIFIC CODE HERE *
	SELECT ObjectName, IndexName, AvgFragmentation
	FROM #Maintaintable 

	-- Open the Cursor
	OPEN maintainCursor

	-- 3 - Fetch the next record from the cursor
	FETCH NEXT FROM maintainCursor INTO @curObjectName, @curIndexName, @curFragmentationPercent   

	-- Set the status for the cursor
	WHILE @@FETCH_STATUS = 0  

	BEGIN  
		-- 4 run maintenance
		if(@curFragmentationPercent < 10)
			UPDATE #Maintaintable SET [Action] = 'No Action' WHERE ObjectName = @curObjectName AND IndexName = @curIndexName;
		else if(@curFragmentationPercent < 30)
		BEGIN
			set @alter = 'ALTER INDEX ' + @curIndexName + ' ON MyDatabase.dbo.' + @curObjectName + ' REORGANIZE';
			exec(@alter);

			set @alter = 'UPDATE STATISTICS MyDatabase.dbo.' + @curObjectName;
			exec(@alter);
			
			UPDATE #Maintaintable SET [Action] = 'Reorganized' WHERE ObjectName = @curObjectName AND IndexName = @curIndexName;
		END
		else
		BEGIN
			set @alter = 'ALTER INDEX ' + @curIndexName + ' ON MyDatabase.dbo.' + @curObjectName + ' REBUILD';
			exec(@alter);

			-- When rebuilding, statistics are updated automatically

			UPDATE #Maintaintable SET [Action] = 'Rebuilded' WHERE ObjectName = @curObjectName AND IndexName = @curIndexName;
		END
			   
		-- 5 - Fetch the next record from the cursor
 		FETCH NEXT FROM maintainCursor INTO @curObjectName, @curIndexName, @curFragmentationPercent 
	END 

	-- 6 - Close the cursor
	CLOSE maintainCursor  

	-- 7 - Deallocate the cursor
	DEALLOCATE maintainCursor 
	
	SELECT * FROM #Maintaintable
	DROP TABLE #Maintaintable
END

This will take the result from the first procedure, the evaluate if the it should reorganize or rebuild with the rules: >10%: Reorganize, >30% Rebuild. These seem to be the place to start according to various sources around the web. I'm aware that this procedure is using a CURSOR, which is normally frowned upon. However, this is only slow if maintenance is needed, and most of the time it is not needed, or its quite fast. We cannot see any noticeable drop in speeds. Infact, running these for each insert actually seems to have speed up the database a lot. Some of our sites run with speeds around 30k-40k rows pr second which we have not seen before. It is reflected over all sites with speed improvements up to a factor of 8. And best of all: No more stalled operations.

2 things though that we are still considering:

1: Some tables are fairly small and seems to be rebuilt constantly each time a check is done. We could and probably should use some more information from sys.dm_db_index_physical_stats and maybe skip them if they are not that big and just let the daily maintenance task handle these.

2: As @Jörgen Andersson, @RedDk and @PIEBALDconsult have commented, I might have a look into disabling the index first before starting an insert operation. Its a valid suggention and something I do have in mind while our system is running its test period. If I have more troubles this will probably the next step. But for now it does not look to be nessesary.

Thank you all again for your time, it has been a great help!

Sqlbulkcopy fails at the exact same location, but works the next day

3 solutions

Solution 1

Solution 3

Solution 4

Add your solution here

Preview 0

Existing Members

...or Join us