I hate this alert. As a SCOM admin, your main job is to keep SCOM healthy and ensure it’s monitoring. One of the first things I check in the morning is the Operations Manager Active Alerts (sorted by Repeat Count). This morning I found a few machines with a high number of “Operations Manager Failed to Start a Process” alerts. Here’s how I went about resolving one of them.
1. On the client server, looked in the OpsMgr event log. Too much red/yellow!
2. Noticed a lot of event IDs 21403 and 21402. None of the scripts/rules/monitors were running properly!
3. Found an event with a well-known script. These are scripts that remain in the Monitoring Host Temporary Files folder after being run and often are the first folders in the list. I.e. SCOMpercentageCPUTimeCounter.vbs, MemoryUtilization.vbs, WMIFunctionalCheck.vbs, etc.
4. Manually ran the script using the working directory and parameters from the eventlog entry.
5. The first time it ran and quickly returned the property bag.
6. Ran it again…this time it hung. (you can add “wscript.echo time” statements to the script to output the time started and finished if you want to know how long it ran).
7. I have two choices now, override the script timeout or see what is making the script run long. In this case the script ran quickly the first time, so I want to investigate why it ran long the second time.
8. Browsed to the vbs and opened it. I noticed it was making a WMI call to gather data.
9. Opened wbemtest and manually ran the WMI query from the script…and it too hung.
10. Now I know WMI is problematic, even though I see no errors in the application or system event logs. WMI is a very common data source for SCOM and very prone to issues. This would explain why so many rules/monitors were failing.
11. I applied WMI hotfixes recommended by Kevin Holman (http://blogs.technet.com/b/kevinholman/archive/2009/01/27/which-hotfixes-should-i-apply.aspx)
12. Rebooted the server and the error are gone!