this days, I updated Service Management Automation (SMA) in our production environment from version 2016 UR7 to version 2019.
after the update, I had an issue with Windows Azure Pack, which couldn’t display the dashboard for the Automation provider.
after some research, i found out, that the root cause was some months earlier. we have a lot of concurrent jobs running and some times the SMA workers was overloaded and all new jobs where hanging in the queue. the solution I found on some other websites, was to mark this queued jobs as failed inside the database and then restart the worker service:
USE SMA
-- New Jobs
UPDATE [Core].[Jobs] SET StatusID=5 WHERE StatusID=1
-- Running Jobs
UPDATE [Core].[Jobs] SET StatusID=5 WHERE StatusID=3
-- Suspended Jobs
UPDATE [Core].[Jobs] SET StatusID=5 WHERE StatusID=8
after setting this jobs to failed state, I had to configure the Runbook worker to start only one Runbook per Sandbox process in the file C:\Program Files\Microsoft System Center\Service Management Automation\Orchestrator.Settings.config:
<add key="MaxRunningJobs" value="1"/> <add key="TotalAllowedJobs" value="1"/>
the Runbook worker could now be restarted and was working properly again. Stability was much better, than before.
unfortunately, there is an other problem, with the workaround of setting the jobs to failed. every 15 minutes the SMA database executes a Purge Job, to remove old Job entries and all stream datas for this old jobs. but this Purge Job is removing only Jobs with a valid EndDate, which is still NULL when you set the Job to failed with the query above. in my case i had over 15000 old jobs inside the database.
to set hanging jobs to failed, you should use this query:
USE SMA
-- New Jobs
update [Core].[Jobs] set StatusID=5, EndTime=LastModifiedTime where StatusID=1
-- Running Jobs
update [Core].[Jobs] set StatusID=5, EndTime=LastModifiedTime where StatusID=3
-- Suspended Jobs
update [Core].[Jobs] set StatusID=5, EndTime=LastModifiedTime where StatusID=8
this query sets the hanging jobs to failed inside the Jobs table and sets the EndTime, which is important for the Purge Job. After this, the Purge Job will run for some minutes and will remove the 1000 oldest jobs per batch until the defined age. you can define the age of the oldest jobs with the commandlet Set-Sma
after all old jobs where removed, my Windows Azure Pack was working again properly