Im working on a project where I have a feature that can copy entire tables from one database to another from one site to another. Each site has about 30 tables each that are normally transfered at the same time, though to protect the server and network, only 7 "insert" instances can be active at the same time and the rest wait for an open slot. This has worked fine on individual sites during our testing phase, and can download all tables with speeds up to 5000 row pr sec.
In addition to that feature we have an update system, that tracks changes on our client sites tables, register them on a seperate table and once every 5 minutes, the server site pull those data down so the tables stays updated with any new information. This also runs perfectly fine.
During testing we only used 5 sites, but we are now starting to add more sites, where we experienced an issue we do not understand completely. What we see is something like this: We add 2 sites, with 30 inserts each without issue, but during the 3rd site insert, the largest tables would stop mid transfer, wait for 30 seconds and then display an error. Some smaller once would sucessfully complete, but the lasts once would then experince the same stop, wait 30 seconds and then fail. We then do a clean up, where all data from that 3rd site is removed from the server, and then retry again, just to see that exact same result. The largest tables stop at the exact same location, and some small once complete, but eventually they also stop mid transfer.
In our log, we would see the following log message repeat multiple times:
15/12/2022 10:32:36.052 FAIL TaskHandlerAddTabel[0] Error during add table task Add [tablename] from [site] (ID: 2485008)
Exception type: AggregateException
One or more errors occurred. (Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
Operation cancelled by user.
The statement has been terminated.)
at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at TaskHandlerAddTabel._DoWork(Object sender, DoWorkEventArgs e) in [path]\TaskHandlerAddTabel.cs:line 128
Exception type: SqlException
Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
Operation cancelled by user.
The statement has been terminated.
SQL error number: -2
at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
at Microsoft.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
at Microsoft.Data.SqlClient.SqlBulkCopy.RunParser(BulkCopySimpleResultSet bulkCopyHandler)
at Microsoft.Data.SqlClient.SqlBulkCopy.CopyBatchesAsyncContinuedOnSuccess(BulkCopySimpleResultSet internalResults, String updateBulkCommandText, CancellationToken cts, TaskCompletionSource`1 source)
Exception type: Win32Exception
The wait operation timed out.
As you can read from this error dump, which is a mix of custom and original Sql message and stacktrace dumps (some redactions made for confidential purposes), I use SqlBulkCopy to do the transfer. Here is the isolated sqlbulk copy code:
SqlBulkCopy bulkCopy = globalDatalayer.CreateSqlBulkCopy(_server.GetConnectionStringWithUserPass());
bulkCopy.BatchSize = _bulkCopyLimit;
bulkCopy.EnableStreaming = true;
bulkCopy.DestinationTableName = $"tablename";
try
{
var task = bulkCopy.WriteToServerAsync(dataReader, CancellationToken);
task.Wait();
CancellationToken.ThrowIfCancellationRequested();
}
catch
{
throw;
}
finally
{
try
{
bulkCopy.Dispose();
}
catch
{
}
}
Now, I do not think there is anything wrong with the SqlBulkCopy implementation as it works very well, up untill we see the stop->wait->error sequence. We can also see that our updates running on other added sites, are running fine and adds changes as they go.
What I have tried:
Our current theory is that we overload the SQL Server somehow and here is why.
As described the system work for the first couple of sites, but then just start having the stop->wait->error while update are running fine. But if we let the system be overnight, still running updates, and try to add the 3rd site in the morning, then the 3rd site succeeds. If we continue to add more site, then the 5th or 6th site will experience the stop->wait->error scenario. And again, if we let it be, for 12-24 hours, then we can start again the next day.
We looked into the servers vitals, but everything looks norminal. Granted, the server is busy during our table additions, but not critically so. Nothing seems to be screaming. We can still contact the server without issues and a monitoring client we built to keep an eye out for the server is behaving normally.
So, connection to the server works. Updates to the tables work (small amounts of work) every 5 minutes for each site. Access to the server itself and vitals work and are not critical, but after several tables being bombarded with inserts (talking tens of millions of inserts to different tables), SqlBulkCopy stops working, but only for some time (12-24 hours), when it then starts working again. Any Sql server experts here that has a suggestion as to what is going on here. My own theory is that we somehow fill up a buffer or log in the sql server and when that is filled up, it cant handle anymore untill is has processed it and we can then go again. The 12-24 hour cooldown havent been investigated a lot and could be just a 6 hour window. We tried waiting for an hour, but that didn't help.
Update 1:
We did try to extend the timeout on the sqlbulkcopy, but it just delays the timeout exception. Waiting forever is not an option we want to persue, as it can actually cause problem on the client side if it cant deliver updates (inserts needs to happen first). Plus customers starts asking why they do not get updated data.
I should also note that we looked into EventViewer of the server, but no issue reported from the SqlServer.
Update 2:
One of my colleagues did a batch today again, which resulted in the same failure. We then looked into some server processes as another colleague suggestion that a cleanup routine might be why we can do it the next day. We did have a routine, that among other stuff rebuilds indexes and handles some logs. We ran that and retried again, where it completed all tables except one which we currently think was just a fluke that we ran again and it worked. So we are working on doing some automatic rebuilds on indexes upon table adds and see if that should help with the issue. I read today online of another one who had this issue, and he was rebuilding indexes.