Mark's Sysinternals Blog

Sunday, August 28, 2005

The Case of the Intermittent (and Annoying) Explorer Hangs

I have several computers in my home network where each one has a general designated purpose. For example, one is my game machine, another is my home development system, and a third is where I manage my pictures and home videos (and they all double as test systems). I recently started experiencing intermittent temporary Explorer hangs of up to a minute in length on the development system when I opened certain folders and scrolled through their contents. We’ve all experienced Explorer hangs and the first few I wrote off to standard Explorer flakiness. However, after the initial hang I could navigate the folders fine for a while, but a few minutes later I’d experience a long hang again. When I recognized the reliability of the hangs I suspected that something was broken and I decided to investigate.

The first step in my analysis was to attempt to examine the hanging Explorer thread to see if it revealed clues as the cause of the hang. For that I turned to Process Explorer. I opened the process properties dialog for Explorer.exe and clicked on the Threads tab. Then I repeated the step of opening a folder that exhibited the hang effect and noted that one of the threads that executed in response to the action had a start address in a DLL with a promising name: BrowseUI. I double-clicked on the thread to view its stack:

A thread’s stack is a record of functions that have invoked other functions that you read from bottom to top. For example, in this stack the GetPathSpeed function in SHELL32.dll at frame 29 invoked MutlinetGetConnectionPerformanceW in MPR.dll at frame 28. This stack trace proceeds all the way into the kernel, ntkrnlpa.exe, and then into the Mup driver. The calls to SHDefExtractIconW (frame 31), MPR.dll (Multiple Provider Router) and Mup.sys (Multiple UNC Provider driver) told me that Explorer was trying to access a network path to obtain icons for the folder that I had opened. I suspected that the network path in question was not valid and that the apparent hang was a timeout waiting for a remote computer to respond.

My next step was to determine what network path Explorer was trying to access and why it was accessing a network path to get an icon in the first place. I knew that Filemon could tell me the path that was causing problems so I launched it and set a filter so that it would report only errors. The resulting trace looked like this:

Aha! Now I was getting somewhere. The reference to the path on the Development computer resulted in a BAD NETWORK PATH error and for good reason: several days earlier I’d decommissioned that computer.

There was just one question left to answer: where was Explorer getting the path that referenced the decommissioned computer? I knew the answer lay in the Registry and that gave me two options: I could search for the path in Regedit or I could watch for the reference in Regmon. I chose the latter so I started Regmon and set the include filter to “development”. I refreshed a problematic folder and saw this in Regmon’s output:

There was the answer. I have for a long time used an older version of Paint Shop Pro as my picture viewer because it has the features I need and the convenience of the fact that you don’t need run an installer to use it. If you run its executable, Psp.exe, Paint Shop Pro configures its settings automatically and runs correctly even from network shares. That means that I can put it on one system in my network and run it from the others.

The system from which I’d originally run it was the decommissioned computer, Development. Paint Shop creates browse files in the folders you view with its browse functionality and the Regmon trace revealed that it registers an icon for those browse files. Since I hadn’t performed any kind of uninstall the icon for Paint Shop Pro’s browse file type was still registered to the original location of the executable on the missing computer. When Explorer came across one of those browse files it tried to load the icon, and because Explorer is, even today, sadly largely single-threaded, when the network driver waited for the phantom system to respond Explorer’s user interface hung.

Now I was left with fixing the problem. I found after some digging that there is no way to easily manage default icon associations through any supported means in the Explorer UI. I could have hunted down every Paint Shop Pro-related file in Explorer's File Types tab in the Folder Options dialog and deleted the associations, but instead I just manually deleted Paint Shop Pro’s registry keys and browse files from my systems.

Using Process Explorer, Filemon and Regmon I had diagnosed and fixed the problem in about 15 minutes. What’s disturbing to me about this troubleshooting example is that the average user confronted with the same scenario would have had no way of knowing what was causing the hangs or of fixing them. This is just one example of the many types of Windows issues that cause users to complain that their systems slow down over time and that result in a general “I don’t know, just reinstall” mentality. I hope Vista does better.

Update: Here's an article by Larry Seltzer on this blog entry where he points out that some users confronted with this type of problem might suspect malware to be responsible: http://www.eweek.com/article2/0,1895,1855391,00.asp

posted by Mark Russinovich @ 7:52 AM (53) comments

Wednesday, August 17, 2005

Unkillable Processes

Have you ever terminated an application only to see in your favorite task manager (Process Explorer, of course) that the process still exists? Or have you tried logging out or shutting down only to have the logoff or shutdown stall indefinitely for no apparent reason? These scenarios are usually the result of buggy device drivers that don’t properly handle the cancellation of outstanding I/O requests.

Over the last few years I’ve developed a tool called Notmyfault that demonstrates a number of common device driver bugs, including accessing freed memory, overrunning buffers, and leaking memory. The crashes generated by Notmyfault are featured in the crash analysis chapter of Windows Internals book I coauthored with Dave Solomon. I’ve recently added a new error selection, Hang Irp, in order to show the effects of drivers that don’t cancel I/O requests.

When you run Notmyfault and select the Hang Irp bug Notmyfault sends an I/O request into its helper driver, Myfault.sys, that Myfault.sys never completes. The names of the executable and driver reinforce the fact that user-mode code can never directly cause a Windows crash: Notmyfault relies on the Myfault driver to do the dirty work. The Notmyfault thread that issues the request never continues executing because it ends up stuck in the kernel waiting for the I/O request to complete. However, because Notmyfault issues the request from a second thread the UI remains responsive and you can issue other bugs, more hanging IRPs, or try to terminate the process.

Terminating Notmyfault reveals the effect of a hung IRP. Even after you close the Notmyfault window the Notmyfault process still shows in Process Explorer’s process list. Logging off and back in, even into a different account, does not cause the zombied process to exit. So what’s going on under the hood? If you’ve configured Process Explorer to take advantage of Microsoft’s symbol support (steps for doing so are documented in Process Explorer’s help file) you can view the stack of the hung thread by double-clicking on the Notmyfault process, navigating to the resulting Process Properties dialog’s Threads tab, and double-clicking on the thread:

A stack reflects a history of subroutine invocation and reads top to bottom from most to least recent. The stack above indicates that Notmyfault called DeviceIoControlFile, which called ZwDeviceIoControlFile. ZwDeviceIoControlFile transitioned into kernel-mode (the frames that are prefixed with “ntkrnlpa.exe”) where the kernel’s system call dispatcher executed NtDeviceIoControlFile. Since the I/O request was synchronous the I/O manager waits for the driver at which the I/O is targeted to complete the request.

When a process terminates the Process Manager performs process rundown, which includes terminating all the threads in the process, closing handles to opened system resources (e.g. files and registry keys) and tearing down the address space of the process. When the Process Manager sees a terminating thread has outstanding I/O requests it informs the drivers processing the requests that the requests should be cancelled. You can see that in the stack as the call to IopCancelAlertedRequest. Because the completion of an I/O request requires access to the address space of the owning thread’s process the system can’t finish tearing down a process until all its I/O requests have completed or cancelled. The I/O Manager has no choice but to wait indefinitely, which you can see in the stack as the call to KeWaitForSingleObject.

If you run across this type of problem in the real world you’ll need to run a kernel debugger to look at the outstanding I/O requests of any hung threads and the determine driver that owns them. If the system is hung you need to debug it from a second computer running a kernel debugger. Since the system as a whole isn’t hung when you create a hung thread with Notmyfault you can use local kernel debugging with LiveKd or, if you’re running Windows XP or higher, the Windows Debugging Tools for Windows built-in local kernel debugging. If you’ve never used a kernel debugger the easiest approach is to download the Debugging Tools for Windows and then run Livekd from the directory in which you install the tools.

The first kernel debugger command to execute is one to look at the hung process and its threads. Look at the IRP List area, which a list of outstanding I/O requests, of any threads that are listed. Here’s the command to dump hung process and partial output that includes the IRP list for the Notmyfault thread:

kd> !process 0 7 notmyfault.exe
PROCESS 8183ad18 SessionId: 0 Cid: 02dc Peb: 7ffdf000 ParentCid: 04e4
DirBase: 08b40280 ObjectTable: e107cd10 HandleCount: 23.
Image: NotMyfault.exe
VadRoot 817d8d68 Vads 44 Clone 0 Private 98. Modified 1. Locked 0.
…
THREAD 81810560 Cid 02dc.02e4 Teb: 7ffdd000 Win32Thread: 00000000 WAIT: (Executive) KernelMode Non-Alertable
81821d0c NotificationEvent
IRP List:
82370f68: (0006,0094) Flags: 40000000 Mdl: 00000000
…

The next step is to look at the IRP (I/O Request Packet) or IRPs you find:

kd> !irp 82370f68
Irp is active with 1 stacks 1 is current (= 0x82370fd8)
No Mdl Thread 81810560: Irp stack trace.
cmd flg cl Device File Completion-Context
>[ e, 0] 5 0 8172daa8 81821cb0 00000000-00000000
*** ERROR: Module load completed but symbols could not be loaded for myfault.sys
\Driver\MYFAULT
Args: 00000000 00000000 83360020 00000000

The output reports that \Driver\Myfault, the internal name of the Myfault driver, owns the IRP and is therefore the driver that’s guilty of not completing the I/O and not responding to the system’s cancellation request. The error regarding missing symbols for myfault.sys is expected since Microsoft only stores symbols for its own drivers and components.

The reason that the Notmyfault bug does not result in logoff or shutdown hangs is that the system doesn’t care if user applications really terminate during either of those activities. As long as the TerminateProcess API returns success, which it does for such zombie processes, the system is happy. However, if Explorer or one of the core system processes gets into a zombie state the system will be effectively hung.

posted by Mark Russinovich @ 3:52 PM (17) comments

Mark's Sysinternals Blog

The Case of the Intermittent (and Annoying) Explorer Hangs

Unkillable Processes

RSS Feed

Index

Recent Posts

Archives

Other Blogs