# Chapter 29 Primitive Thread Synchronization Constructs

When a thread pool thread blocks, the thread pool creates additional threads, and the time and memory resources required to create, destroy, and schedule threads is very expensive. When many developers see that they have threads in their program that are not doing anything useful, they tend to create more threads in hopes that the new threads will do something useful. The key to building scalable and responsive applications is to not block the threads you have so that they can be used and reused to execute other tasks. Chapter 27, “Compute-Bound Asynchronous Operations,” focused on how to use existing threads to perform compute-bound operations, and Chapter 28, “I/O-Bound Asynchronous Operations,” focused on how to use threads when performing I/O-bound operations.

In this chapter, I focus on thread synchronization. Thread synchronization is used to prevent corruption when multiple threads access shared data at the same time. I emphasize at the same time, because thread synchronization is all about timing. If you have some data that is accessed by two threads and those threads cannot possibly touch the data simultaneously, then thread synchronization is not required at all. In Chapter 28, I discussed how different sections of async functions can be executed by different threads. Here we could potentially have two different threads accessing the same variables and data. But async functions are implemented in such a way that it is impossible for two threads to access this same data at the same time. Therefore, no thread synchronization is required when code accesses data contained within the async function.

This is ideal because thread synchronization has many problems associated with it. First, it is tedious and extremely error-prone. In your code, you must identify all data that could potentially be touched by multiple threads at the same time. Then you must surround this code with additional code that acquires and releases a thread synchronization lock. The lock ensures that only one thread at a time can access the resource. If you forget to surround just one block of code with a lock, then the data will become corrupted. Also, there is no way to prove that you have added all your locking code correctly. You just have to run your application, stress-test it a lot, and hope that nothing goes wrong. In fact, you should test your application on a machine that has as many CPUs as possible because the more CPUs you have, the better chance that two or more threads will attempt to access the resource at the same time, making it more likely you’ll detect a problem.

The second problem with locks is that they hurt performance. It takes time to acquire and release a lock because there are additional method calls, and because the CPUs must coordinate with each other to determine which thread will acquire the lock first. Having the CPUs in the machine communicate with each other this way hurts performance. For example, let’s say that you have code that adds a node to the head of a linked list.

// This class is used by the LinkedList class
public class Node { 
 internal Node m_next; 
 // Other members not shown
}
public sealed class LinkedList {
 private Node m_head;
 public void Add(Node newNode) {
 // The following two lines perform very fast reference assignments
 newNode.m_next = m_head;
 m_head = newNode;
 }
}

This Add method simply performs two reference assignments that can execute extremely fast. Now, if we want to make Add thread safe so that multiple threads can call it simultaneously without corrupting the linked list, then we need to have the Add method acquire and release a lock.

public sealed class LinkedList {
 private SomeKindOfLock m_lock = new SomeKindOfLock();
 private Node m_head;
 public void Add(Node newNode) {
 m_lock.Enter();
 // The following two lines perform very fast reference assignments
 newNode.m_next = m_head;
 m_head = newNode;
 m_lock.Leave();
 }
}

Although Add is now thread safe, it has also become substantially slower. How much slower depends on the kind of lock chosen; I will compare the performance of various locks in this chapter and in Chapter 30, “Hybrid Thread Synchronization Constructs.” But even the fastest lock could make the Add method several times slower than the version of it that didn’t have any lock code in it at all. Of course, the performance becomes significantly worse if the code calls Add in a loop to insert several nodes into the linked list.

The third problem with thread synchronization locks is that they allow only one thread to access the resource at a time. This is the lock’s whole reason for existing, but it is also a problem, because blocking a thread causes more threads to be created. So, for example, if a thread pool thread attempts to acquire a lock that it cannot have, it is likely that the thread pool will create a new thread to keep the CPUs saturated with work. As discussed in Chapter 26, “Thread Basics,” creating a thread is very expensive in terms of both memory and performance. And to make matters even worse, when the blocked thread gets to run again, it will run with this new thread pool thread; Windows is now scheduling more threads than there are CPUs, and this increases context switching, which also hurts performance.

The summary of all of this is that thread synchronization is bad, so you should try to design your applications to avoid as much of it as possible. To that end, you should avoid shared data such as static fields. When a thread uses the new operator to construct an object, the new operator returns a reference to the new object. At this point in time, only the thread that constructs the object has a reference to it; no other thread can access that object. If you avoid passing this reference to another thread that might use the object at the same time as the creating thread, then there is no need to synchronize access to the object.

Try to use value types because they are always copied, so each thread operates on its own copy. Finally, it is OK to have multiple threads accessing shared data simultaneously if that access is readonly. For example, many applications create some data structures during their initialization. Once initialized, the application can create as many threads as it wants; if all these threads just query the data, then all the threads can do this simultaneously without acquiring or releasing any locks. The String type is an example of this: after a String object is created, it is immutable; so many threads can access a single String object at the same time without any chance of the String object becoming corrupted.

💡小结:为了构建可伸缩的、响应灵敏的应用程序,关键在于不要阻塞你拥有的线程,使它们能用于 (和重用于) 执行其他任务。多个线程同时访问共享数据时,线程同步能防止数据损坏。之所以要强调同时,是因为线程同步问题其实就是计时问题。如果一些数据由两个线程访问,但那些线程不可能同时接触到数据,就完全用不着线程同步。可能有两个不同的线程访问相同的变量和数据,但根据异步函数的实现方式,不可能有两个线程同时访问相同的数据。所以,在代码访问异步函数中包含的数据时不需要线程同步。不需要线程同步是最理想的情况,因为线程同步存在许多问题。第一个问题是它比较繁琐,而且很容易写错。在你的代码中,必须标识出所有可能由多个线程同时访问的数据。然后,必须用额外的代码将这些代码包围起来,并获取和释放一个线程同步锁。锁的作用是确保一次只有一个线程访问资源。只要有一个代码块忘记用锁包围,数据就会损坏。另外,没有办法证明你已正确添加了所有锁定代码。只能运行应用程序,对它进行大量压力测试,并寄希望于没有什么地方出错。事实上,应该在 CPU (或 CPU 内核) 数量尽可能多的机器上测试应用程序。因为 CPU 越多,两个或多个线程同时访问资源的机率越大,越容易检测到问题。锁的第二个问题在于,它们会损害性能。获取和释放锁是需要时间的,因为要调用一些额外的方法,而且不同的 CPU 必须进行协调,以决定哪个线程先取得锁。让机器中的 CPU 以这种方式相互通信,会对性能造成影响。线程同步锁的第三个问题在于,它们一次只允许一个线程访问资源。这是锁的全部意义之所在,但也是问题之所在,因为阻塞一个线程会造成更多的线程被创建。创建线程时一个昂贵的操作,会耗费大量内存和时间。更不妙的是,当阻塞的线程再次运行时,它会和这个新的线程池线程共同运行。也就是说,Windows 现在要调度比 CPU 数量更多的线程,这会增大上下文切换的机率,进一步损害到性能。综上所述,线程同步是一件不好的事情,所以在设计自己的应用程序时,应该尽可能地避免进行线程同步。具体就是避免使用像静态字段这样的共享数据。线程用 new 操作符构造对象时, new 操作符会返回对新对象的引用。在这个时刻,只要构造对象的线程才有对它的引用;其他任何线程都不能访问那个对象。如果能避免将这个引用传给可能同时使用对象的另一个线程,就不必同步对该对象的访问。可试着使用值类型,因为它们总是被复制,每个线程操作的都是它自己的副本。最后,多个线程同时对共享数据进行只读访问是没有任何问题的。

# Class Libraries and Thread Safety

Now, I’d like to say a quick word about class libraries and thread synchronization. Microsoft’s Framework Class Library (FCL) guarantees that all static methods are thread safe. This means that if two threads call a static method at the same time, no data will get corrupted. The FCL had to do this internally because there is no way that multiple companies producing different assemblies could coordinate on a single lock for arbitrating access to the resource. The Console class contains a static field, inside which many of its methods acquire and release to ensure that only one thread at a time is accessing the console.

For the record, making a method thread safe does not mean that it internally takes a thread synchronization lock. A thread-safe method means that data doesn’t get corrupted if two threads attempt to access the data at the same time. The System.Math class has a static Max method implemented as follows.

public static Int32 Max(Int32 val1, Int32 val2) {
 return (val1 < val2) ? val2 : val1;
}

This method is thread safe even though it doesn’t take any lock. Because Int32 is a value type, the two Int32 values passed to Max are copied into it and so, multiple threads could be calling Max simultaneously, but each thread is working on its own data, isolated from any other thread.

On the other hand, the FCL does not guarantee that instance methods are thread safe because adding all the locking code would hurt performance too much. And, in fact, if every instance method acquires and releases a lock, then you ultimately end up having just one thread running in your application at any given time, which hurts performance even more. As mentioned earlier, when a thread constructs an object, only this thread has a reference to the object, no other thread can access that object, and no thread synchronization is required when invoking instance methods. However, if the thread then exposes the reference to the object—by placing it in a static field, passing as the state argument to ThreadPool.QueueUserWorkItem or to a Task, and so on—then thread synchronization is required if the threads could attempt simultaneous non-read-only access

It is recommended that your own class libraries follow this pattern; that is, make all your static methods thread safe and make all your instance methods not thread safe. There is one caveat to this pattern: if the purpose of the instance method is to coordinate threads, then the instance method should be thread safe. For example, one thread can cancel an operation by calling CancellationTokenSource’s Cancel method, and another thread detects that it should stop what it’s doing by querying the corresponding CancellationToken’s IsCancellationRequested property. These two instance members have some special thread synchronization code inside them to ensure that the coordination of the two threads goes as expected.

💡小结:Microsoft 的 Framework Class Library (FCL) 保证所有静态方法都是线程安全的。这意味着假如两个线程同时调用一个静态方法,不会发生数据被破坏的情况。FCL 必须在内部做到这一点,因为开发不同程序集的多个公司不可能事先协商好使用一个锁来仲裁对资源的访问。 Console 类包含了一个静态字段,类的许多方法都要获取和释放这个字段上的锁,确保一次只有一个线程访问控制台。要郑重声明的是,使一个方法线程安全,并不是说它一定要在内部获取一个线程同步锁。线程安全的方法意味着在两个线程试图同时访问数据时,数据不会被破坏。另一方面,FCL 不保证实例方法是线程安全的,因为假如全部添加锁定,会造成性能的巨大损失。另外,假如每个实例方法都需要获取和释放一个锁,事实上会造成最终在任何给定的时刻,你的应用程序只有一个线程在运行,这对性能的影响是显而易见的。如前所述,调用实例方法时无需线程同步。然而,如果线程随后公开了这个对象引用 ———— 把它放到一个静态字段中,把它作为状态实参传给一个 ThreadPool.QueueUserWorkItemTask ———— 那么在多个线程可能同时进行非读只读访问的前提下,就需要线程同步。建议你自己的类库也遵循 FCL 的这个模式;也就是说,使自己的所有静态方法都线程安全,使所有实例方法都非线程安全。这个模式有一点要注意:如果实例方法的目的是协调线程,则实例方法应该是线程安全的。

# Primitive User-Mode and Kernel-Mode Constructs

n this chapter, I explain the primitive thread synchronization constructs. By primitive, I mean the simplest constructs that are available to use in your code. There are two kinds of primitive constructs: user-mode and kernel-mode. Whenever possible, you should use the primitive user-mode constructs, because they are significantly faster than the kernel-mode constructs because they use special CPU instructions to coordinate threads. This means that the coordination is occurring in hardware (which is what makes it fast). But this also means that the Windows operating system never detects that a thread is blocked on a primitive user-mode construct. Because a thread pool thread blocked on a user-mode primitive construct is never considered blocked, the thread pool will not create a new thread to replace the temporarily blocked thread. In addition, these CPU instructions block the thread for an incredibly short period of time.

Wow! All of this sounds great, doesn’t it? And it is great, which is why I recommend using these constructs as much as possible. However, there is a downside—only the Windows operating system kernel can stop a thread from running so that it is not wasting CPU time. A thread running in user mode can be preempted by the system, but the thread will be scheduled again as soon as possible. So, a thread that wants to acquire some resource, but can’t get it, spins in user mode. This potentially wastes a lot of CPU time, which would be better spent performing other work or even just letting the CPU go idle to conserve power.

This brings us to the primitive kernel-mode constructs. The kernel-mode constructs are provided by the Windows operating system itself. As such, they require that your application’s threads call functions implemented in the operating system kernel. Having threads transition from user mode to kernel mode and back incurs a big performance hit, which is why kernel-mode constructs should be avoided.2 However, they do have a positive feature—when a thread uses a kernel-mode construct to acquire a resource that another thread has, Windows blocks the thread so that it is no longer wasting CPU time. Then, when the resource becomes available, Windows resumes the thread, allowing it to access the resource.

A thread waiting on a construct might block forever if the thread currently holding the construct never releases it. If the construct is a user-mode construct, the thread is running on a CPU forever, and we call this a livelock. If the construct is a kernel-mode construct, the thread is blocked forever, and we call this a deadlock. Both of these are bad, but of the two, a deadlock is always preferable to a livelock, because a livelock wastes both CPU time and memory (the thread’s stack, etc.), whereas a deadlock wastes only memory.

In an ideal world, we’d like to have constructs that take the best of both worlds. That is, we’d like a construct that is fast and non-blocking (like the user-mode constructs) when there is no contention. But when there is contention for the construct, we’d like it to be blocked by the operating system kernel. Constructs that work like this do exist; I call them hybrid constructs, and I will discuss them in Chapter 30. It is very common for applications to use the hybrid constructs, because in most applications, it is rare for two or more threads to attempt to access the same data at the same time. A hybrid construct keeps your application running fast most of the time, and occasionally it runs slowly to block the thread. The slowness usually doesn’t matter at this point, because your thread is going to be blocked anyway.

Many of the CLR’s thread synchronization constructs are really just object-oriented class wrappers around Win32 thread synchronization constructs. After all, CLR threads are Windows threads, which means that Windows schedules and controls the synchronization of threads. Windows thread synchronization constructs have been around because 1992, and a ton of material has been written about them.4 Therefore, I give them only cursory treatment in this chapter.

💡小结:基元 (primitive) 是指可以在代码中使用的最简单的构造。有两种基元构造;用户模式 (user-mode) 和内核模式 (kernel-mode)。应尽量使用基元用户模式构造,它们的速度要显著快于内核模式的构造。这是因为它们使用了特殊 CPU 指令来协调线程。这意味着协调是在硬件中发生的 (所以才这么快)。但这也意味着 Microsoft Windows 操作系统永远检测不到一个线程在基元用户模式的构造上阻塞了。由于在用户模式的基元构造上阻塞的线程池线程永远不认为已阻塞,所以线程池不会创建新线程来替换这种临时阻塞的线程。此外,这些 CPU 指令只阻塞线程相当短的时间。但它们也有一个缺点:只有 Windows 操作系统内核才能停止一个线程的运行 (防止它浪费 CPU 时间)。在用户模式中运行的线程可能被系统抢占 (preempted),但线程会以最快的速度再次调度。所以,想要取得资源但暂时取不到的线程会一直在用户模式中 “自旋”,这可能浪费大量 CPU 时间,而这些 CPU 时间本可用于执行其他更有用的工作。即便没有其他更有用的工作,更好的做法也是让 CPU 空闲,这至少能省一点电。核模式的构造是由 Windows 操作系统自身提供的。所以,它们要求在应用程序的线程中调用由操作系统内核实现的函数。将线程从用户模式切换为内核模式 (或相反) 会招致巨大的性能损失,这正是为什么要避免使用内核模式构造的原因。<sup>①</sup > 但它们有一个重要的优点:线程通过内核模式的构造获取其他线程拥有的资源时,Windows 会阻塞线程以避免它浪费 CPU 时间。当资源变得可用时,Windows 会恢复线程,允许它访问资源。对于在一个构造上等待的线程,如果拥有这个构造的线程一直不释放它,前者就可能一直阻塞。如果是用户模式的构造,线程将一直在一个 CPU 上运行,我们称为 “活锁”(deadlock)。两种情况都不好。但在两者之间,死锁总是优于活锁,因为活锁既浪费 CPU 时间,又浪费内存 (线程栈等),而死锁只浪费内存。理想中的构造应兼具两者的长处。也就是说,在没有竞争的情况下,这个构造应该快而且不会阻塞 (就像用户模式的构造)。但如果存在对构造的竞争,我希望它被操作系统内核阻塞。像这样的构造确实存在;我把它们称为混合构造 (hybrid construct)。混合构造使你的应用程序在大多数时间都快速运行,偶尔运行得比较慢是为了阻塞线程。CLR 的许多线程同步构造实际只是 "Win32 线程同步构造" 的一些面向对象的类包装器。毕竟,CLR 线程就是 Windows 线程,这意味着要由 Windows 调度线程和控制线程同步。

# User-Mode Constructs

The CLR guarantees that reads and writes to variables of the following data types are atomic: Boolean, Char, (S)Byte, (U)Int16, (U)Int32, (U)IntPtr, Single, and reference types. This means that all bytes within that variable are read from or written to all at once. So, for example, if you have the following class:

internal static class SomeType {
 public static Int32 x = 0;
}

and if a thread executes this line of code:

SomeType.x = 0x01234567;

then the x variable will change from 0x00000000 to 0x01234567 all at once (atomically). Another thread cannot possibly see the value in an intermediate state. For example, it is impossible for some other read to query SomeType.x and get a value of 0x01230000. Suppose that the x field in the preceding SomeType class is an Int64. If a thread executes this line of code:

SomeType.x = 0x0123456789abcdef;

it is possible that another thread could query x and get a value of 0x0123456700000000 or 0x0000000089abcdef, because the read and write operations are not atomic. This is called a torn read.

Although atomic access to variable guarantees that the read or write happens all at once, it does not guarantee when the read or write will happen due to compiler and CPU optimizations. The primitive user-mode constructs discussed in this section are used to enforce the timing of these atomic read and write operations. In addition, these constructs can also force atomic and timed access to variables of additional data types: (U)Int64 and Double.

There are two kinds of primitive user-mode thread synchronization constructs:

  • Volatile constructs, which perform an atomic read or write operation on a variable containing a simple data type at a specific time

  • Interlocked constructs, which perform an atomic read and write operation on a variable containing a simple data type at a specific time

All the volatile and interlocked constructs require you to pass a reference (memory address) to a variable containing a simple data type.

# Volatile Constructs

Back in the early days of computing, software was written using assembly language. Assembly language is very tedious, because programmers must explicitly state everything—use this CPU register for this, branch to that, call indirect through this other thing, and so on. To simplify programming, higher-level languages were introduced. These higher-level languages introduced common useful constructs, like if/else, switch/case, various loops, local variables, arguments, virtual method calls, operator overloads, and much more. Ultimately, these language compilers must convert the high-level constructs down to the low-level constructs so that the computer can actually do what you want it to do.

In other words, the C# compiler translates your C# constructs into Intermediate Language (IL), which is then converted by the just-in-time (JIT) compiler into native CPU instructions, which must then be processed by the CPU itself. In addition, the C# compiler, the JIT compiler, and even the CPU itself can optimize your code. For example, the following ridiculous method can ultimately be compiled into nothing.

private static void OptimizedAway() {
 // Constant expression is computed at compile time resulting in zero
 Int32 value = (1 * 100) - (50 * 2);
 // If value is 0, the loop never executes
 for (Int32 x = 0; x < value; x++) {
 // There is no need to compile the code in the loop because it can never execute
 Console.WriteLine("Jeff");
 }
}

In this code, the compiler can see that value will always be 0; therefore, the loop will never execute and consequently, there is no need to compile the code inside the loop. This method could be compiled down to nothing. In fact, when JITting a method that calls OptimizedAway, the JITter will try to inline the OptimizedAway method’s code. Because there is no code, the JITter will even remove the code that tries to call OptimizedAway. We love this feature of compilers. As developers, we get to write the code in the way that makes the most sense to us. The code should be easy to write, read, and maintain. Then compilers translate our intentions into machine-understandable code. We want our compilers to do the best job possible for us.

When the C# compiler, JIT compiler, and CPU optimize our code, they guarantee us that the intention of the code is preserved. That is, from a single-threaded perspective, the method does what we want it to do, although it may not do it exactly the way we described in our source code. However, the intention might not be preserved from a multithreaded perspective. Here is an example where the optimizations make the program not work as expected.

internal static class StrangeBehavior {
 // As you'll see later, mark this field as volatile to fix the problem
 private static Boolean s_stopWorker = false;
 public static void Main() {
 Console.WriteLine("Main: letting worker run for 5 seconds");
 Thread t = new Thread(Worker);
 t.Start();
 Thread.Sleep(5000);
 s_stopWorker = true;
 Console.WriteLine("Main: waiting for worker to stop");
 t.Join();
 }
 private static void Worker(Object o) {
 Int32 x = 0;
 while (!s_stopWorker) x++;
 Console.WriteLine("Worker: stopped when x={0}", x);
 }
}

In this code, the Main method creates a new thread that executes the Worker method. This Worker method counts as high as it can before being told to stop. The Main method allows the Worker thread to run for five seconds before telling it to stop by setting the static Boolean field to true. At this point, the Worker thread should display what it counted up to, and then the thread will terminate. The Main thread waits for the Worker thread to terminate by calling Join, and then the Main thread returns, causing the whole process to terminate.

Looks simple enough, right? Well, the program has a potential problem due to all the optimizations that could happen to it. You see, when the Worker method is compiled, the compiler sees that s_stopWorker is either true or false, and it also sees that this value never changes inside the Worker method itself. So the compiler could produce code that checks s_stopWorker first. If s_stopWorker is true, then Worker: stopped when x=0 will be displayed. If s_stopWorker is false, then the compiler produces code that enters an infinite loop that increments x forever. You see, the optimizations cause the loop to run very fast because checking s_stopWorker only occurs once before the loop; it does not get checked with each iteration of the loop.

If you actually want to see this in action, put this code in a .cs file and compile the code by using C#’s /platform:x86 and /optimize+ switches. Then run the resulting EXE file, and you’ll see that the program runs forever. Note that you have to compile for x86, ensuring that the x86 JIT compiler is used at run time. The x86 JIT compiler is more mature than the x64 JIT compiler, so it performs more aggressive optimizations. The x64 JIT compiler does not perform this particular optimization, and therefore the program runs to completion. This highlights another interesting point about all of this. Whether your program behaves as expected depends on a lot of factors, such as which compiler version and compiler switches are used, which JIT compiler is used, and which CPU your code is running on. In addition, to see the preceding program run forever, you must not run the program under a debugger because the debugger causes the JIT compiler to produce unoptimized code that is easier to step through.

Let’s look at another example, which has two threads that are both accessing two fields.

internal sealed class ThreadsSharingData { 
 private Int32 m_flag = 0; 
 private Int32 m_value = 0; 
 // This method is executed by one thread 
 public void Thread1() { 
 // Note: These could execute in reverse order
 m_value = 5; 
 m_flag = 1; 
 } 
 // This method is executed by another thread 
 public void Thread2() { 
 // Note: m_value could be read before m_flag
 if (m_flag == 1) 
 Console.WriteLine(m_value); 
 } 
}

The problem with this code is that the compilers/CPU could translate the code in such a way as to reverse the two lines of code in the Thread1 method. After all, reversing the two lines of code does not change the intention of the method. The method needs to get a 5 in m_value and a 1 in m_flag. From a single-threaded application’s perspective, the order of executing this code is unimportant. If these two lines do execute in reverse order, then another thread executing the Thread2 method could see that m_flag is 1 and then display 0.

Let’s look at this code another way. Let’s say that the code in the Thread1 method executes in program order (the way it was written). When compiling the code in the Thread2 method, the compiler must generate code to read m_flag and m_value from RAM into CPU registers. It is possible that RAM will deliver the value of m_value first, which would contain a 0. Then the Thread1 method could execute, changing m_value to 5 and m_flag to 1. But Thread2’s CPU register doesn’t see that m_value has been changed to 5 by this other thread, and then the value in m_flag could be read from RAM into a CPU register and the value of m_flag becomes 1 now, causing Thread2 to again display 0.

This is all very scary stuff and is more likely to cause problems in a release build of your program than in a debug build of your program, making it particularly tricky to detect these problems and correct your code. Now, let’s talk about how to correct your code.

The static System.Threading.Volatile class offers two static methods that look like this.

public static class Volatile {
 public static void Write(ref Int32 location, Int32 value); 
 public static Int32 Read(ref Int32 location); 
}

These methods are special. In effect, these methods disable some optimizations usually performed by the C# compiler, the JIT compiler, and the CPU itself. Here’s how the methods work:

  • The Volatile.Write method forces the value in location to be written to at the point of the call. In addition, any earlier program-order loads and stores must occur before the call to Volatile.Write.

  • The Volatile.Read method forces the value in location to be read from at the point of the call. In addition, any later program-order loads and stores must occur after the call to Volatile.Read.

💡重要提示:我知道目前这些概念很容易令人迷惑,所以让我归纳一条简单的规则:当线程通过共享内存相互通信时,调用 Volatile.Write 来写入最后一个值,调用 Volatile.Read 来读取第一个值。

So now we can fix the ThreadsSharingData class by using these methods.

internal sealed class ThreadsSharingData {
 private Int32 m_flag = 0;
 private Int32 m_value = 0;
 // This method is executed by one thread 
 public void Thread1() {
 // Note: 5 must be written to m_value before 1 is written to m_flag
 m_value = 5;
 Volatile.Write(ref m_flag, 1);
 }
 // This method is executed by another thread 
 public void Thread2() {
 // Note: m_value must be read after m_flag is read 
 if (Volatile.Read(ref m_flag) == 1) 
 Console.WriteLine(m_value);
 }
}

First, notice that we are following the rule. The Thread1 method writes two values out to fields that are shared by multiple threads. The last value that we want written (setting m_flag to 1) is performed by calling Volatile.Write. The Thread2 method reads two values from fields shared by multiple threads, and the first value being read (m_flag) is performed by calling Volatile.Read.

But what is really happening here? Well, for the Thread1 method, the Volatile.Write call ensures that all the writes above it are completed before a 1 is written to m_flag. Because m_value = 5 is before the call to Volatile.Write, it must complete first. In fact, if there were many variables being modified before the call to Volatile.Write, they would all have to complete before 1 is written to m_flag. Note that the writes before the call to Volatile.Write can be optimized to execute in any order; it’s just that all the writes have to complete before the call to Volatile.Write.

For the Thread2 method, the Volatile.Read call ensures that all variable reads after it start after the value in m_flag has been read. Because reading m_value is after the call to Volatile.Read, the value must be read after having read the value in m_flag. If there were many reads after the call to Volatile.Read, they would all have to start after the value in m_flag has been read. Note that the reads after the call to Volatile.Read can be optimized to execute in any order; it’s just that the reads can’t start happening until after the call to Volatile.Read.

# C#’s Support for Volatile Fields

Making sure that programmers call the Volatile.Read and Volatile.Write methods correctly is a lot to ask. It’s hard for programmers to keep all of this in their minds and to start imagining what other threads might be doing to shared data in the background. To simplify this, the C# compiler has the volatile keyword, which can be applied to static or instance fields of any of these types: Boolean, (S)Byte, (U)Int16, (U)Int32, (U)IntPtr, Single, or Char. You can also apply the volatile keyword to reference types and any enum field as long as the enumerated type has an underlying type of (S)Byte, (U)Int16, or (U)Int32. The JIT compiler ensures that all accesses to a volatile field are performed as volatile reads and writes, so that it is not necessary to explicitly call Volatile's static Read or Write methods. Furthermore, the volatile keyword tells the C# and JIT compilers not to cache the field in a CPU register, ensuring that all reads to and from the field actually cause the value to be read from memory.

Using the volatile keyword, we can rewrite the ThreadsSharingData class as follows.

internal sealed class ThreadsSharingData {
 private volatile Int32 m_flag = 0;
 private Int32 m_value = 0;
 // This method is executed by one thread 
 public void Thread1() {
 // Note: 5 must be written to m_value before 1 is written to m_flag
 m_value = 5;
 m_flag = 1;
 }
 // This method is executed by another thread 
 public void Thread2() {
 // Note: m_value must be read after m_flag is read 
 if (m_flag == 1)
 Console.WriteLine(m_value);
 }
}

There are some developers (and I am one of them) who do not like C#’s volatile keyword, and they think that the language should not provide it.6 Our thinking is that most algorithms require few volatile read or write accesses to a field and that most other accesses to the field can occur normally, improving performance; seldom is it required that all accesses to a field be volatile. For example, it is difficult to interpret how to apply volatile read operations to algorithms like this one.

m_amount = m_amount + m_amount; // Assume m_amount is a volatile field defined in a class

Normally, an integer number can be doubled simply by shifting all bits left by 1 bit, and many compilers can examine the preceding code and perform this optimization. However, if m_amount is a volatile field, then this optimization is not allowed. The compiler must produce code to read m_amount into a register and then read it again into another register, add the two registers together, and then write the result back out to the m_amount field. The unoptimized code is certainly bigger and slower; it would be unfortunate if it were contained inside a loop.

Furthermore, C# does not support passing a volatile field by reference to a method. For example, if m_amount is defined as a volatile Int32, attempting to call Int32’s TryParse method causes the compiler to generate a warning as shown here.

Boolean success = Int32.TryParse("123", out m_amount); 
// The preceding line causes the C# compiler to generate a warning: 
// CS0420: a reference to a volatile field will not be treated as volatile

Finally, volatile fields are not Common Language Specification (CLS) compliant because many languages (including Visual Basic) do not support them.

# Interlocked Constructs

Volatile’s Read method performs an atomic read operation, and its Write method performs an atomic write operation. That is, each method performs either an atomic read operation or an atomic write operation. In this section, we look at the static System.Threading.Interlocked class’s methods. Each of the methods in the Interlocked class performs an atomic read and write operation. In addition, all the Interlocked methods are full memory fences. That is, any variable writes before the call to an Interlocked method execute before the Interlocked method, and any variable reads after the call execute after the call.

The static methods that operate on Int32 variables are by far the most commonly used methods. I show them here.

public static class Interlocked { 
 // return (++location) 
 public static Int32 Increment(ref Int32 location); 
 // return (--location) 
 public static Int32 Decrement(ref Int32 location); 
 // return (location += value) 
 // Note: value can be a negative number allowing subtraction 
 public static Int32 Add(ref Int32 location, Int32 value); 
 // Int32 old = location; location = value; return old; 
 public static Int32 Exchange(ref Int32 location, Int32 value); 
 // Int32 old = location; 
 // if (location == comparand) location = value;
 // return old; 
 public static Int32 CompareExchange(ref Int32 location, Int32 value, Int32 comparand); 
 ... 
}

There are also overloads of the preceding methods that operate on Int64 values. Furthermore, the Interlocked class offers Exchange and CompareExchange methods that take Object, IntPtr, Single, and Double, and there is also a generic version in which the generic type is constrained to class (any reference type).

Personally, I love the Interlocked methods, because they are relatively fast and you can do so much with them. Let me show you some code that uses the Interlocked methods to asynchronously query several web servers and concurrently process the returned data. This code is pretty short, never blocks any threads, and uses thread pool threads to scale automatically, consuming up to the number of CPUs available if its workload could benefit from it. In addition, the code, as is, supports accessing up to 2,147,483,647 (Int32.MaxValue) web servers. In other words, this code is a great model to follow for your own scenarios.

internal sealed class MultiWebRequests {
 // This helper class coordinates all the asynchronous operations
 private AsyncCoordinator m_ac = new AsyncCoordinator();
 // Set of web servers we want to query & their responses (Exception or Int32)
 // NOTE: Even though multiple could access this dictionary simultaneously,
 // there is no need to synchronize access to it because the keys are 
 // read-only after construction
 private Dictionary<String, Object> m_servers = new Dictionary<String, Object> {
 { "http://Wintellect.com/", null },
 { "http://Microsoft.com/", null },
 { "http://1.1.1.1/", null } 
 };
 public MultiWebRequests(Int32 timeout = Timeout.Infinite) {
 // Asynchronously initiate all the requests all at once
 var httpClient = new HttpClient();
 foreach (var server in m_servers.Keys) {
 m_ac.AboutToBegin(1);
 httpClient.GetByteArrayAsync(server)
 .ContinueWith(task => ComputeResult(server, task));
 }
 // Tell AsyncCoordinator that all operations have been initiated and to call
 // AllDone when all operations complete, Cancel is called, or the timeout occurs
 m_ac.AllBegun(AllDone, timeout);
 }
 private void ComputeResult(String server, Task<Byte[]> task) {
 Object result;
 if (task.Exception != null) {
 result = task.Exception.InnerException;
 } else {
 // Process I/O completion here on thread pool thread(s)
 // Put your own compute-intensive algorithm here...
 result = task.Result.Length; // This example just returns the length
 }
 // Save result (exception/sum) and indicate that 1 operation completed
 m_servers[server] = result;
 m_ac.JustEnded();
 }
 // Calling this method indicates that the results don't matter anymore
 public void Cancel() { m_ac.Cancel(); }
 // This method is called after all web servers respond, 
 // Cancel is called, or the timeout occurs
 private void AllDone(CoordinationStatus status) {
 switch (status) {
 case CoordinationStatus.Cancel:
 Console.WriteLine("Operation canceled.");
 break;
 case CoordinationStatus.Timeout:
 Console.WriteLine("Operation timed-out.");
 break;
 case CoordinationStatus.AllDone:
 Console.WriteLine("Operation completed; results below:");
 foreach (var server in m_servers) {
 Console.Write("{0} ", server.Key);
 Object result = server.Value;
 if (result is Exception) {
 Console.WriteLine("failed due to {0}.", result.GetType().Name);
 } else {
 Console.WriteLine("returned {0:N0} bytes.", result);
 }
 }
 break;
 }
 }
}

OK, the preceding code doesn’t actually use any Interlocked methods directly, because I encapsulated all the coordination code in a reusable class called AsyncCoordinator, which I’ll explain shortly. Let me first explain what this class is doing. When the MultiWebRequest class is constructed, it initializes an AsyncCoordinator and a dictionary containing the set of server URIs (and their future result). It then issues all the web requests asynchronously one right after the other. It does this by first calling AsyncCoordinator’s AboutToBegin method, passing it the number of requests about to be issued.7 Then it initiates the request by calling HttpClient’s GetByteArrayAsync. This returns a Task and I then call ContinueWith on this Task so that when the server replies with the bytes, they can be processed by my ComputeResult method concurrently via many thread pool threads. After all the web servers’ requests have been made, the AsyncCoordinator’s AllBegun method is called, passing it the name of the method (AllDone) that should execute when all the operations complete and a timeout value. As each web server responds, various thread pool threads will call the MultiWebRequests’s ComputeResult method. This method processes the bytes returned from the server (or any error that may have occurred) and saves the result in the dictionary collection. After storing each result, AsyncCoordinator’s JustEnded method is called to let the AsyncCoordinator object know that an operation completed.

If all the operations have completed, then the AsyncCoordinator will invoke the AllDone method to process the results from all the web servers. The code executing the AllDone method will be the thread pool thread that just happened to get the last web server response. If timeout or cancellation occurs, then AllDone will be invoked via whatever thread pool thread notifies the AsyncCoordinator of timeout or using whatever thread happened to call the Cancel method. There is also a chance that the thread issuing the web server requests could invoke AllDone itself if the last request completes before AllBegun is called.

Note that there is a race because it is possible that all web server requests complete, AllBegun is called, timeout occurs, and Cancel is called all at the exact same time. If this happens, then the AsyncCoordinator will select a winner and three losers, ensuring that the AllDone method is never called more than once. The winner is identified by the status argument passed into AllDone, which can be one of the symbols defined by the CoordinationStatus type.

internal enum CoordinationStatus { AllDone, Timeout, Cancel };

Now that you get a sense of what happens, let’s take a look at how it works. The AsyncCoordinator class encapsulates all the thread coordination logic in it. It uses Interlocked methods for everything to ensure that the code runs extremely fast and that no threads ever block. Here is the code for this class.

internal sealed class AsyncCoordinator {
 private Int32 m_opCount = 1; // Decremented when AllBegun calls JustEnded
 private Int32 m_statusReported = 0; // 0=false, 1=true
 private Action<CoordinationStatus> m_callback;
 private Timer m_timer;
 // This method MUST be called BEFORE initiating an operation 
 public void AboutToBegin(Int32 opsToAdd = 1) { 
 Interlocked.Add(ref m_opCount, opsToAdd); 
 }
 // This method MUST be called AFTER an operation’s result has been processed 
 public void JustEnded() {
 if (Interlocked.Decrement(ref m_opCount) == 0) 
 ReportStatus(CoordinationStatus.AllDone);
 }
 // This method MUST be called AFTER initiating ALL operations
 public void AllBegun(Action<CoordinationStatus> callback, 
 Int32 timeout = Timeout.Infinite) {
 m_callback = callback;
 if (timeout != Timeout.Infinite)
 m_timer = new Timer(TimeExpired, null, timeout, Timeout.Infinite);
 JustEnded();
 }
 private void TimeExpired(Object o) { ReportStatus(CoordinationStatus.Timeout); }
 public void Cancel() { ReportStatus(CoordinationStatus.Cancel); }
 private void ReportStatus(CoordinationStatus status) {
 // If status has never been reported, report it; else ignore it
 if (Interlocked.Exchange(ref m_statusReported, 1) == 0) 
 m_callback(status);
 }
}

The most important field in this class is the m_opCount field. This field keeps track of the number of asynchronous operations that are still outstanding. Just before each asynchronous operation is started, AboutToBegin is called. This method calls Interlocked.Add to add the number passed to it to the m_opCount field in an atomic way. Adding to m_opCount must be performed atomically because web servers could be processing responses on thread pool threads as more operations are being started. As web server responses are processed, JustEnded is called. This method calls Interlocked.Decrement to atomically subtract 1 from m_opCount. Whichever thread happens to set m_opCount to 0 calls ReportStatus.

💡注意: m_opCount 字段初始化为 1 (而非 0),这一点很重要。执行构造器方法的线程在发出 Web 服务器请求期间,由于 m_opCount 字段为 1,所以能保证 AllDone 不会被调用。构造器调用 AllBegun 之前, m_opCount 永远不不可能变成 0 。构造器调用 AllBegun 时, AllBegun 内部调用 JustEnded 来递减 m_opCount ,所以事实上撤销 ( undo ) 了把它初始化成 1 的效果。现在 m_opCount 能变成 0 了,但只能是在发起了所有 Web 服务器请求之后。

The ReportStatus method arbitrates the race that can occur among all the operations completing, the timeout occurring, and Cancel being called. ReportStatus must make sure that only one of these conditions is considered the winner so that the m_callback method is invoked only once. Arbitrating the winner is done via calling Interlocked.Exchange, passing it a reference to the m_statusReported field. This field is really treated as a Boolean variable; however, it can’t actually be a Boolean variable because there are no Interlocked methods that accept a Boolean variable. So I use an Int32 variable instead where 0 means false and 1 means true.

Inside ReportStatus, the Interlocked.Exchange call will change m_statusReported to 1. But only the first thread to do this will see Interlocked.Exchange return a 0, and only this thread will invoke the callback method. Any other threads that call Interlocked.Exchange will get a return value of 1, effectively notifying these threads that the callback method has already been invoked and therefore it should not be invoked again.

# Implementing a Simple Spin Lock

The Interlocked methods are great, but they mostly operate on Int32 values. What if you need to manipulate a bunch of fields in a class object atomically? In this case, we need a way to stop all threads but one from entering the region of code that manipulates the fields. Using Interlocked methods, we can build a thread synchronization lock.

internal struct SimpleSpinLock {
 private Int32 m_ResourceInUse; // 0=false (default), 1=true
 public void Enter() {
 while (true) {
 // Always set resource to in-use
 // When this thread changes it from not in-use, return
 if (Interlocked.Exchange(ref m_ResourceInUse, 1) == 0) return;
 // Black magic goes here...
 }
 }
 public void Leave() {
 // Set resource to not in-use
 Volatile.Write(ref m_ResourceInUse, 0);
 }
}

And here is a class that shows how to use the SimpleSpinLock.

public sealed class SomeResource {
 private SimpleSpinLock m_sl = new SimpleSpinLock();
 public void AccessResource() {
 m_sl.Enter();
 // Only one thread at a time can get in here to access the resource...
 m_sl.Leave();
 }
}

The SimpleSpinLock implementation is very simple. If two threads call Enter at the same time, Interlocked.Exchange ensures that one thread changes m_resourceInUse from 0 to 1 and sees that m_resourceInUse was 0. This thread then returns from Enter so that it can continue executing the code in the AccessResource method. The other thread will change m_resourceInUse from a 1 to a 1. This thread will see that it did not change m_resourceInUse from a 0, and this thread will now start spinning continuously, calling Exchange until the first thread calls Leave.

When the first thread is done manipulating the fields of the SomeResource object, it calls Leave, which internally calls Volatile.Write and changes m_resourceInUse back to a 0. This causes the spinning thread to then change m_resourceInUse from a 0 to a 1, and this thread now gets to return from Enter so that it can access SomeResource object’s fields.

There you have it. This is a simple implementation of a thread synchronization lock. The big potential problem with this lock is that it causes threads to spin when there is contention for the lock. This spinning wastes precious CPU time, preventing the CPU from doing other, more useful work. As a result, spin locks should only ever be used to guard regions of code that execute very quickly.

Spin locks should not typically be used on single-CPU machines, because the thread that holds the lock can’t quickly release it if the thread that wants the lock is spinning. The situation becomes much worse if the thread holding the lock is at a lower priority than the thread wanting to get the lock, because now the thread holding the lock may not get a chance to run at all, resulting in a livelock situation. Windows sometimes boosts a thread’s priority dynamically for short periods of time. Therefore, boosting should be disabled for threads that are using spin locks; see the PriorityBoostEnabled properties of System.Diagnostics.Process and System.Diagnostics.ProcessThread. There are issues related to using spin locks on hyperthreaded machines, too. In an attempt to circumvent these kinds of problems, many spin locks have some additional logic in them; I refer to the additional logic as Black Magic. I’d rather not go into the details of Black Magic because it changes over time as more people study locks and their performance. However, I will say this: The FCL ships with a structure, System.Threading.SpinWait, which encapsulates the state-of-the-art thinking around this Black Magic.

在线程处理中引入延迟

“黑科技” 旨在让希望获得资源的线程暂停执行,使当前拥有资源的线程能执行它的代码并让出资源。为此, SpinWait 结构内部调用 Thread 的静态 SleepYieldSpinWait 方法。在这里的补充内容中,我想简单解释一下这些方法。

线程可告诉系统它在指定时间内不想被调度,这是调用 Thread 的静态 Sleep 方法来实现的:

public static void Sleep(Int32 millisecondsTimeout);
public static void Sleep(TimeSpan timeout);

这个方法导致线程在指定时间内挂起。调用 Sleep 允许线程自愿放弃它的时间片的剩余部分。系统会使线程在大致指定的时间里不被调度。没有错 ———— 如果告诉系统你希望一个线程睡眠 100 毫秒,那么会睡眠大致那么长的时间,但也有可能会多睡眠几秒、甚至几分钟的时间。记住,Windows 不是实时操作系统。你的线程可能在正确的时间唤醒,但具体是否这样,要取决于系统中正在发生的别的事情。

可以调用 Sleep ,并为 millisecondsTimeout 参数传递 System.Threading.Timeout.Infinite 中的值 (定义为 -1 )。这告诉系统永远不调度线程,但这样做没什么意义。更好的做法是让线程退出,回收它的栈和内核对象。可以向 Sleep 传递 0 ,告诉系统调用线程放弃了它当前时间片的剩余部分,强迫系统调度另一个线程。但系统可能重新调度刚才调用了 Sleep 的线程 (如果没有相同或更高优先级的其他可调度线程,就会发生这种情况)。

线程可要求 Windows 在当前 CPU 上调度另一个线程,这是通过 ThreadYield 方法来实现的:

public static Boolean Yield();

如果 Windows 发现有另一个线程准备好在当前处理器上运行, Yield 就会返回 true ,调用 Yield 的线程会提前结束它的时间片 <sup>①</sup>,所选的线程得以运行一个时间片。然后,调用 Yield 的线程被再次调度,开始用一个全新的时间片运行。如果 Windows 发现没有其他线程准备在当前处理器上运行, Yield 就会返回 false ,调用 Yield 的线程继续运行它的时间片。

Yield 方法旨在使 “饥饿” 状态的、具有相等或更低优先级的线程有机会运行。如果一个线程希望获得当前另一个线程拥有的资源,就调用这个方法。如果运气好,Windows 会调度当前拥有资源的线程,而那个线程会让出资源。然后,当调用 Yield 的线程再次运行时就会拿到资源。

调用 Yield 的效果介于调用 Thread.Sleep(0)Thread.Sleep(1) 之间。 Thread.Sleep(0) 不允许较低优先级的线程运行,而 Thread.Sleep(1) 总是强迫进行上下文切换,而由于内部系统计时器的解析度的问题, Windows 总是强迫线程睡眠超过 1 毫秒的时间。

事实上,超线程 CPU 一次只允许一个线程运行。所以,在这些 CPU 上执行 “自旋” 循环时,需要强迫当前线程暂停,使 CPU 有机会切换到另一个线程并允许它运行。线程可调用 ThreadSpinWait 方法强迫它自身暂停,允许超线程 CPU 切换到另一线程:

public static void SpinWait(Int32 iterations);

调用这个方法实际会执行一个特殊的 CPU 指令;它不告诉 Windows 做任何事情 (因为 Windows 已经认为它在 CPU 上调度了两个线程)。在非超线程 CPU 上,这个特殊 CPU 指令会被忽略。

要更多地了解这些方法,请参见它们的 Win32 等价函数: SleepSwitchToThreadYieldProcessor 。另外,要想进一步了解如何调整系统计时器的解析度,请参考 Win32 timeBeginPeriodtimeEndPeriod 函数。

The FCL also includes a System.Threading.SpinLock structure that is similar to my SimpleSpinLock class shown earlier, except that it uses the SpinWait structure to improve performance. The SpinLock structure also offers timeout support. By the way, it is interesting to note that my SimpleSpinLock and the FCL’s SpinLock are both value types. This means that they are lightweight, memory-friendly objects. A SpinLock is a good choice if you need to associate a lock with each item in a collection, for example. However, you must make sure that you do not pass SpinLock instances around, because they are copied and you will lose any and all synchronization. And although you can define instance SpinLock fields, do not mark the field as readonly, because its internal state must change as the lock is manipulated.

# The Interlocked Anything Pattern

Many people look at the Interlocked methods and wonder why Microsoft doesn't create a richer set of interlocked methods that can be used in a wider range of scenarios. For example, it would be nice if the Interlocked class offered Multiply, Divide, Minimum, Maximum, And, Or, Xor, and a bunch of other methods. Although the Interlocked class doesn’t offer these methods, there is a well-known pattern that allows you to perform any operation on an Int32 in an atomic way by using Interlocked.CompareExchange. In fact, because Interlocked.CompareExchange has additional overloads that operate on Int64, Single, Double, Object, and a generic reference type, this pattern will actually work for all these types, too.

This pattern is similar to optimistic concurrency patterns used for modifying database records. Here is an example of the pattern that is being used to create an atomic Maximum method.

public static Int32 Maximum(ref Int32 target, Int32 value) {
 Int32 currentVal = target, startVal, desiredVal;
 // Don't access target in the loop except in an attempt 
 // to change it because another thread may be touching it 
 do {
 // Record this iteration's starting value
 startVal = currentVal;
 // Calculate the desired value in terms of startVal and value
 desiredVal = Math.Max(startVal, value);
 // NOTE: the thread could be preempted here!
 // if (target == startVal) target = desiredVal
 // Value prior to potential change is returned
 currentVal = Interlocked.CompareExchange(ref target, desiredVal, startVal);
 // If the starting value changed during this iteration, repeat 
 } while (startVal != currentVal);
 // Return the maximum value when this thread tried to set it
 return desiredVal;
}

Now let me explain exactly what is going on here. Upon entering the method, currentVal is initialized to the value in target at the moment the method starts executing. Then, inside the loop, startVal is initialized to this same value. Using startVal, you can perform any operation you want. This operation can be extremely complex, consisting of thousands of lines of code. But, ultimately, you must end up with a result that is placed into desiredVal. In my example, I simply determine whether startVal or value contains the larger value.

Now, while this operation is running, another thread could change the value in target. It is unlikely that this will happen, but it is possible. If this does happen, then the value in desiredVal is based off an old value in startVal, not the current value in target, and therefore, we should not change the value in target. To ensure that the value in target is changed to desiredVal if no thread has changed target behind our thread’s back, we use Interlocked.CompareExchange. This method checks whether the value in target matches the value in startVal (which identifies the value that we thought was in target before starting to perform the operation). If the value in target didn’t change, then CompareExchange changes it to the new value in desiredVal. If the value in target did change, then CompareExchange does not alter the value in target at all.

CompareExchange returns the value that is in target at the time when CompareExchange is called, which I then place in currentVal. Then, a check is made comparing startVal with the new value in currentVal. If these values are the same, then a thread did not change target behind our thread’s back, target now contains the value in desiredVal, the while loop does not loop around, and the method returns. If startVal is not equal to currentVal, then a thread did change the value in target behind our thread’s back, target did not get changed to our value in desiredVal, and the while loop will loop around and try the operation again, this time using the new value in currentVal that reflects the other thread’s change.

Personally, I have used this pattern in a lot of my own code and, in fact, I made a generic method, Morph, which encapsulates this pattern.

delegate Int32 Morpher<TResult, TArgument>(Int32 startValue, TArgument argument, 
 out TResult morphResult);
static TResult Morph<TResult, TArgument>(ref Int32 target, TArgument argument, 
 Morpher<TResult, TArgument> morpher) {
 TResult morphResult;
 Int32 currentVal = target, startVal, desiredVal;
 do {
 startVal = currentVal;
 desiredVal = morpher(startVal, argument, out morphResult);
 currentVal = Interlocked.CompareExchange(ref target, desiredVal, startVal);
 } while (startVal != currentVal);
 return morphResult;
}

💡小结:CLR 保证对以下数据类型的变量的读写是原子性的: BooleanChar(S)Byte(U)Int16(U)Int32(U)IntPtrSingle 以及引用类型。这意味着变量中的所有字节都一次性读取或写入。虽然对变量的原子访问可保证读取或写入操作一次性完成,但由于编译器和 CPU 的优化,不保证操作 什么时候 发生。基元用户模式构造用于规划好这些原子性读取 / 写入 操作的时间。此外,这些构造还可强制对 (U)Int64Double 类型的变量进行原子性的、规划好了时间的访问。有两种基元用户模式线程同步构造:易变构造 (volatile construct) 在特定的时间,它在包含一个简单数据类型的变量上执行原子性的读或写操作;互锁构造 (interlocked construct) 在特定的时间,它在包含一个简单数据类型的变量上执行原子性的读和写操作。C# 编译器将你的 C# 构造转换成中间语言 (IL)。然后,JIT 将 IL 转换成本机 CPU 指令,然后由 CPU 亲自处理这些指令。此外,C# 编译器、JIT 编译器、甚至 CPU 本身都可能优化你的代码。C# 编译器、JIT 编译器和 CPU 对代码进行优化时,它们保证我们的意图会得到保留。也就是说,从单线程的角度看,方法会做我们希望它做的事情,虽然做的方式可能有别于我们在源代码中描述的方式。但从多线程的角度看,我们的意图并不一定能得到保留。x86 JIT 编译器比 x64 编译器更成熟,所以它在执行优化的时候更大胆。其他 JIT 编译器不执行这个特定的优化,所以程序会像预期的那样正常运行到结束。这使我们注意另一个有趣的地方;程序是否如预想的那样工作要取决于大量因素,比如使用的是编译器的什么版本和什么开关,使用的是哪个 JIT 编译器,以及代码在什么 CPU 上运行等。除此之外,要看到上面这个程序进入死循环,一定不能在调试器中运行它,因为调试器会造成 JIT 编译器生成未优化的代码 (目的是方便你进行单步调试)。静态 System.Threading.Volatile 类提供了两个静态方法,这些方法比较特殊。它们事实上会禁止 C# 编译器、JIT 编译器和 CPU 平常执行的一些优化。 Volatile.Write 方法强迫 location 中的值在调用时写入。此外,按照编码顺序,之前的加载和存储操作必须在调用 Volatile.Write 之前发生。 Volatile.Write 方法强迫 location 中的值在调用时读取。此外,按照编码顺序,之后的加载和存储操作必须在调用 Volatile.Read 之后发生。为了简化编程,C# 编译器提供了 volatile 关键字,它可应用于以下任何类型的静态或实例字段: Boolean(S)Byte(U)Int16(U)Int32 , (U)IntPtrSingleChar ,还可将 volatile 关键字应用于引用类型的字段,以及基础类型为 (S)Byte(U)Int16(U)Int32 的任何枚举字段。JIT 编译器确保对易变字段的所有访问都是以易变读取或写入的方式执行,不必显示调用 Volatile 的静态 ReadWrite 方法。另外, volatile 关键字告诉 C# 和 JIT 编译器不将字段缓存到 CPU 的寄存器中,确保字段的所有读写操作都在 RAM 中进行。 Interlocked 类中的每个方法都执行一次原子读写以及写入操作。此外, Interlocked 的所有方法都建立了完整的内存栅栏 (memory fence)。换言之,调用某个 Interlocked 方法之前的任何变量写入都在这个 Interlocked 方法调用之前执行;而这个调用之后的任何变量读取的都在这个调用之后读取。 Interlocked 的方法很好用,但主要用于操作 Int32 值。可以使用 Interlocked 的方法构造一个线程同步块来实现一个简单的自旋锁。这种锁最大的问题在于,在存在对锁的竞争的前提下,会造成线程 “自旋”。这个 “自旋” 会浪费宝贵的 CPU 时间,阻止 CPU 做其他更有用的工作。因此,自旋锁只应该用于保护那些会执行得非常快的代码区域。自旋锁一般不要在单 CPU 机器上使用,因为在这种机器上,一方面是希望获得锁的线程自旋,一方面是占有锁的线程不能快速释放锁。如果占有锁的线程的优先级低于想要获取锁的线程 (自旋线程),局面还会变得糟糕,因为占有所得线程可能根本没有机会运行 。这会造成 “活锁” 情形。Windows 有时会短时间地动态提升一个线程的优先级。因此,对于正在使用自旋锁的线程,应该禁止像这样的优先级提升。超线程机器同样存在自旋锁的问题。为了解决这些问题,许多自旋锁内部都有一些额外的逻辑;我将这些额外的逻辑称为 “黑科技”(Black Magic)。这里不打算过多讲解其中的细节,因为随着越来越多的人开始研究锁及其性能,这些逻辑也可能发生变化。FCL 提供了一个名为 System.Threading.SpinWait 的结构,它封装了人们关于这种 “黑科技” 的最新研究成果。FCL 还包含了一个 System.Threading.SpinLock 结构,它和前面展示的 SimpleSpinLock 类相似,只是使用了 SpinWait 结构来增强性能。 SpinLink 结构还提供了超时支持。很有器的一点是,我的 SimpleSpinLock 和 FCL 的 SpinLink 都是值类型。这意味着它们是轻量级的、内存友好的对象。但一定不要到底传递 SpinLock 实例,否则它们会被复制,而你会失去所有同步。虽然可以定义实例 SpinLock 字段,但不要将字段标记为 readonly ,因为在操作锁的时候,它的内部状态必须改变。

# Kernel-Mode Constructs

Windows offers several kernel-mode constructs for synchronizing threads. The kernel-mode constructs are much slower than the user-mode constructs. This is because they require coordination from the Windows operating system itself. Also, each method call on a kernel object causes the calling thread to transition from managed code to native user-mode code to native kernel-mode code and then return all the way back. These transitions require a lot of CPU time and, if performed frequently, can adversely affect the overall performance of your application.

However, the kernel-mode constructs offer some benefits over the primitive user-mode constructs, such as:

  • When a kernel-mode construct detects contention on a resource, Windows blocks the losing thread so that it is not spinning on a CPU, wasting processor resources.

  • Kernel-mode constructs can synchronize native and managed threads with each other.

  • Kernel-mode constructs can synchronize threads running in different processes on the same machine.

  • Kernel-mode constructs can have security applied to them to prevent unauthorized accounts from accessing them.

  • A thread can block until all kernel-mode constructs in a set are available or until any one kernel-mode construct in a set has become available.

  • A thread can block on a kernel-mode construct specifying a timeout value; if the thread can’t have access to the resource it wants in the specified amount of time, then the thread is unblocked and can perform other tasks.

The two primitive kernel-mode thread synchronization constructs are events and semaphores. Other kernel-mode constructs, such as mutex, are built on top of the two primitive constructs. For more information about the Windows kernel-mode constructs, see the book, Windows via C/C++, Fifth Edition (Microsoft Press, 2007) by myself and Christophe Nasarre.

The System.Threading namespace offers an abstract base class called WaitHandle. The WaitHandle class is a simple class whose sole purpose is to wrap a Windows kernel object handle. The FCL provides several classes derived from WaitHandle. All classes are defined in the System.Threading namespace. The class hierarchy looks like this.

WaitHandle
 	EventWaitHandle
 		AutoResetEvent
 		ManualResetEvent
 	Semaphore
 	Mutex

Internally, the WaitHandle base class has a SafeWaitHandle field that holds a Win32 kernel object handle. This field is initialized when a concrete WaitHandle-derived class is constructed. In addition, the WaitHandle class publicly exposes methods that are inherited by all the derived classes. Every method called on a kernel-mode construct represents a full memory fence. WaitHandle’s interesting public methods are shown in the following code (some overloads for some methods are not shown).

public abstract class WaitHandle : MarshalByRefObject, IDisposable {
 // WaitOne internally calls the Win32 WaitForSingleObjectEx function. 
 public virtual Boolean WaitOne();
 public virtual Boolean WaitOne(Int32 millisecondsTimeout);
 public virtual Boolean WaitOne(TimeSpan timeout);
 // WaitAll internally calls the Win32 WaitForMultipleObjectsEx function 
 public static Boolean WaitAll(WaitHandle[] waitHandles);
 public static Boolean WaitAll(WaitHandle[] waitHandles, Int32 millisecondsTimeout);
 public static Boolean WaitAll(WaitHandle[] waitHandles, TimeSpan timeout);
 // WaitAny internally calls the Win32 WaitForMultipleObjectsEx function 
 public static Int32 WaitAny(WaitHandle[] waitHandles);
 public static Int32 WaitAny(WaitHandle[] waitHandles, Int32 millisecondsTimeout); public 
static Int32 WaitAny(WaitHandle[] waitHandles, TimeSpan timeout);
 public const Int32 WaitTimeout = 258; // Returned from WaitAny if a timeout occurs
 // Dispose internally calls the Win32 CloseHandle function – DON’T CALL THIS.
 public void Dispose();
}

There are a few things to note about these methods:

  • You call WaitHandle’s WaitOne method to have the calling thread wait for the underlying kernel object to become signaled. Internally, this method calls the Win32 WaitForSingleObjectEx function. The returned Boolean is true if the object became signaled or false if a timeout occurs.

  • You call WaitHandle’s static WaitAll method to have the calling thread wait for all the kernel objects specified in the WaitHandle[] to become signaled. The returned Boolean is true if all of the objects became signaled or false if a timeout occurs. Internally, this method calls the Win32 WaitForMultipleObjectsEx function, passing TRUE for the bWaitAll parameter.

  • You call WaitHandle’s static WaitAny method to have the calling thread wait for any one of the kernel objects specified in the WaitHandle[] to become signaled. The returned Int32 is the index of the array element corresponding to the kernel object that became signaled, or WaitHandle.WaitTimeout if no object became signaled while waiting. Internally, this method calls the Win32 WaitForMultipleObjectsEx function, passing FALSE for the bWaitAll parameter.

  • The array that you pass to the WaitAny and WaitAll methods must contain no more than 64 elements or else the methods throw a System.NotSupportedException.

  • You call WaitHandle’s Dispose method to close the underlying kernel object handle. Internally, these methods call the Win32 CloseHandle function. You can only call Dispose explicitly in your code if you know for a fact that no other threads are using the kernel object. This puts a lot of burden on you as you write your code and test it. So, I would strongly discourage you from calling Dispose; instead, just let the garbage collector (GC) do the cleanup. The GC knows when no threads are using the object anymore, and then it will get rid of it. In a way, the GC is doing thread synchronization for you automatically!

💡注意:在某些情况下,当一个 COM 单线程套间线程阻塞时,线程可能在内部醒来以 pump 消息。例如,阻塞的线程会醒来处理发自另一个线程的 Windows 消息。这个设计是为了支持 COM 互操作性。对于大多数应用程序,这都不是一个问题 ———— 事实上,反而是一件好事。然而,如果你的代码在处理消息期间获得另一个线程同步锁,就可能发生死锁。如第 30 章所述,所有混合锁都在内部调用这些方法。所以,使用混合锁存在相同的利与弊。

The versions of the WaitOne and WaitAll that do not accept a timeout parameter should be prototyped as having a void return type, not Boolean. The reason is because these methods will return only true because the implied timeout is infinite (System.Threading.Timeout.Infinite). When you call any of these methods, you do not need to check their return value.

As already mentioned, the AutoResetEvent, ManualResetEvent, Semaphore, and Mutex classes are all derived from WaitHandle, so they inherit WaitHandle’s methods and their behavior. However, these classes introduce additional methods of their own, and I’ll address those now.

First, the constructors for all of these classes internally call the Win32 CreateEvent (passing FALSE for the bManualReset parameter) or CreateEvent (passing TRUE for the bManualReset parameter), CreateSemaphore, or CreateMutex functions. The handle value returned from all of these calls is saved in a private SafeWaitHandle field defined inside the WaitHandle base class.

Second, the EventWaitHandle, Semaphore, and Mutex classes all offer static OpenExisting methods, which internally call the Win32 OpenEvent, OpenSemaphore, or OpenMutex functions, passing a String argument that identifies an existing named kernel object. The handle value returned from all of these functions is saved in a newly constructed object that is returned from the OpenExisting method. If no kernel object exists with the specified name, a WaitHandleCannotBeOpenedException is thrown.

A common usage of the kernel-mode constructs is to create the kind of application that allows only one instance of itself to execute at any given time. Examples of single-instance applications are Microsoft Outlook, Windows Live Messenger, Windows Media Player, and Windows Media Center. Here is how to implement a single-instance application.

using System;
using System.Threading;
public static class Program {
 public static void Main() {
 Boolean createdNew;
 
 // Try to create a kernel object with the specified name
 using (new Semaphore(0, 1, "SomeUniqueStringIdentifyingMyApp", out createdNew)) {
 if (createdNew) {
 // This thread created the kernel object so no other instance of this
 // application must be running. Run the rest of the application here...
 } else {
 // This thread opened an existing kernel object with the same string name;
 // another instance of this application must be running now.
 // There is nothing to do in here, let's just return from Main to terminate
 // this second instance of the application.
 }
 }
 }
}

In this code, I am using a Semaphore, but it would work just as well if I had used an EventWaitHandle or a Mutex because I’m not actually using the thread synchronization behavior that the object offers. However, I am taking advantage of some thread synchronization behavior that the kernel offers when creating any kind of kernel object. Let me explain how the preceding code works. Let’s say that two instances of this process are started at exactly the same time. Each process will have its own thread, and both threads will attempt to create a Semaphore with the same string name (“SomeUniqueStringIdentifyingMyApp,” in my example). The Windows kernel ensures that only one thread actually creates a kernel object with the specified name; the thread that created the object will have its createdNew variable set to true.

For the second thread, Windows will see that a kernel object with the specified name already exists; the second thread does not get to create another kernel object with the same name, although if this thread continues to run, it can access the same kernel object as the first process’s thread. This is how threads in different processes can communicate with each other via a single kernel object. However, in this example, the second process’s thread sees that its createdNew variable is set to false. This thread now knows that another instance of this process is running, and the second instance of the process exits immediately.

# Event Constructs

Events are simply Boolean variables maintained by the kernel. A thread waiting on an event blocks when the event is false and unblocks when the event is true. There are two kinds of events. When an auto-reset event is true, it wakes up just one blocked thread, because the kernel automatically resets the event back to false after unblocking the first thread. When a manual-reset event is true, it unblocks all threads waiting for it because the kernel does not automatically reset the event back to false; your code must manually reset the event back to false. The classes related to events look like this.

public class EventWaitHandle : WaitHandle {
 public Boolean Set(); // Sets Boolean to true; always returns true
 public Boolean Reset(); // Sets Boolean to false; always returns true
} 
public sealed class AutoResetEvent : EventWaitHandle {
 public AutoResetEvent(Boolean initialState);
}
public sealed class ManualResetEvent : EventWaitHandle {
 public ManualResetEvent(Boolean initialState);
}

Using an auto-reset event, we can easily create a thread synchronization lock whose behavior is similar to the SimpleSpinLock class I showed earlier.

internal sealed class SimpleWaitLock : IDisposable {
 private readonly AutoResetEvent m_available;
 public SimpleWaitLock() {
 m_available = new AutoResetEvent(true); // Initially free
 } 
 public void Enter() {
 // Block in kernel until resource available
 m_available.WaitOne();
 }
 public void Leave() { 
 // Let another thread access the resource
 m_available.Set();
 }
 public void Dispose() { m_available.Dispose(); }
}

You would use this SimpleWaitLock exactly the same way that you’d use the SimpleSpinLock. In fact, the external behavior is exactly the same; however, the performance of the two locks is radically different. When there is no contention on the lock, the SimpleWaitLock is much slower than the SimpleSpinLock, because every call to SimpleWaitLock’s Enter and Leave methods forces the calling thread to transition from managed code to the kernel and back—which is bad. But when there is contention, the losing thread is blocked by the kernel and is not spinning and wasting CPU cycles—which is good. Note also that constructing the AutoResetEvent object and calling Dispose on it also causes managed to kernel transitions, affecting performance negatively. These calls usually happen rarely, so they are not something to be too concerned about.

To give you a better feel for the performance differences, I wrote the following code.

public static void Main() {
 Int32 x = 0;
 const Int32 iterations = 10000000; // 10 million
 // How long does it take to increment x 10 million times?
 Stopwatch sw = Stopwatch.StartNew();
 for (Int32 i = 0; i < iterations; i++) {
 x++;
 }
 Console.WriteLine("Incrementing x: {0:N0}", sw.ElapsedMilliseconds);
 // How long does it take to increment x 10 million times 
 // adding the overhead of calling a method that does nothing?
 sw.Restart();
 for (Int32 i = 0; i < iterations; i++) {
 M(); x++; M();
 }
 Console.WriteLine("Incrementing x in M: {0:N0}", sw.ElapsedMilliseconds);
 // How long does it take to increment x 10 million times 
 // adding the overhead of calling an uncontended SimpleSpinLock?
 SpinLock sl = new SpinLock(false);
 sw.Restart();
 for (Int32 i = 0; i < iterations; i++) {
 Boolean taken = false; sl.Enter(ref taken); x++; sl.Exit();
 }
 Console.WriteLine("Incrementing x in SpinLock: {0:N0}", sw.ElapsedMilliseconds);
 // How long does it take to increment x 10 million times 
 // adding the overhead of calling an uncontended SimpleWaitLock?
 using (SimpleWaitLock swl = new SimpleWaitLock()) {
 sw.Restart();
 for (Int32 i = 0; i < iterations; i++) {
 swl.Enter(); x++; swl.Leave();
 }
 Console.WriteLine("Incrementing x in SimpleWaitLock: {0:N0}", sw.ElapsedMilliseconds);
 }
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static void M() { /* This method does nothing but return */ }
When I run the preceding code, I get the following output.
Incrementing x: 8 Fastest
Incrementing x in M: 69 ~9x slower
Incrementing x in SpinLock: 164 ~21x slower
Incrementing x in SimpleWaitLock: 8,854 ~1,107x slower

As you can clearly see, just incrementing x took only 8 milliseconds. To call empty methods before and after incrementing x made the operation take nine times longer! Then, executing code in a method that uses a user-mode construct caused the code to run 21 (164 / 8) times slower. But now, see how much slower the program ran using a kernel-mode construct: 1,107 (8,854 / 8) times slower! So, if you can avoid thread synchronization, you should. If you need thread synchronization, then try to use the user-mode constructs. Always try to avoid the kernel-mode constructs.

# Semaphore Constructs

Semaphores are simply Int32 variables maintained by the kernel. A thread waiting on a semaphore blocks when the semaphore is 0 and unblocks when the semaphore is greater than 0. When a thread waiting on a semaphore unblocks, the kernel automatically subtracts 1 from the semaphore’s count. Semaphores also have a maximum Int32 value associated with them, and the current count is never allowed to go over the maximum count. Here is what the Semaphore class looks like.

public sealed class Semaphore : WaitHandle {
 public Semaphore(Int32 initialCount, Int32 maximumCount);
 public Int32 Release(); // Calls Release(1); returns previous count
 public Int32 Release(Int32 releaseCount); // Returns previous count
}

So now let me summarize how these three kernel-mode primitives behave:

  • When multiple threads are waiting on an auto-reset event, setting the event causes only one thread to become unblocked.

  • When multiple threads are waiting on a manual-reset event, setting the event causes all threads to become unblocked.

  • When multiple threads are waiting on a semaphore, releasing the semaphore causes releaseCount threads to become unblocked (where releaseCount is the argument passed to Semaphore’s Release method).

Therefore, an auto-reset event behaves very similarly to a semaphore whose maximum count is 1. The difference between the two is that Set can be called multiple times consecutively on an auto-reset event, and still only one thread will be unblocked, whereas calling Release multiple times consecutively on a semaphore keeps incrementing its internal count, which could unblock many threads. By the way, if you call Release on a semaphore too many times, causing its count to exceed its maximum count, then Release will throw a SemaphoreFullException.

Using a semaphore, we can re-implement the SimpleWaitLock as follows, so that it gives multiple threads concurrent access to a resource (which is not necessarily a safe thing to do unless all threads access the resource in a read-only fashion).

public sealed class SimpleWaitLock : IDisposable {
 private readonly Semaphore m_available;
 public SimpleWaitLock(Int32 maxConcurrent) {
 m_available = new Semaphore(maxConcurrent, maxConcurrent);
 }
 public void Enter() {
 // Block in kernel until resource available
 m_available.WaitOne();
 }
 public void Leave() {
 // Let another thread access the resource
 m_available.Release(1);
 }
 public void Dispose() { m_available.Close(); }
}

# Mutex Constructs

A Mutex represents a mutual-exclusive lock. It works similar to an AutoResetEvent or a Semaphore with a count of 1 because all three constructs release only one waiting thread at a time. The following shows what the Mutex class looks like.

public sealed class Mutex : WaitHandle {
 public Mutex();
 public void ReleaseMutex();
}

Mutexes have some additional logic in them, which makes them more complex than the other constructs. First, Mutex objects record which thread obtained it by querying the calling thread’s Int32 ID. When a thread calls ReleaseMutex, the Mutex makes sure that the calling thread is the same thread that obtained the Mutex. If the calling thread is not the thread that obtained the Mutex, then the Mutex object’s state is unaltered and ReleaseMutex throws a System.ApplicationException. Also, if a thread owning a Mutex terminates for any reason, then some thread waiting on the Mutex will be awakened by having a System.Threading.AbandonedMutexException thrown. Usually, this exception will go unhandled, terminating the whole process. This is good because a thread acquired the Mutex and it is likely that the thread terminated before it finished updating the data that the Mutex was protecting. If a thread catches AbandonedMutexException, then it could attempt to access the corrupt data, leading to unpredictable results and security problems.

Second, Mutex objects maintain a recursion count indicating how many times the owning thread owns the Mutex. If a thread currently owns a Mutex and then that thread waits on the Mutex again, the recursion count is incremented and the thread is allowed to continue running. When that thread calls ReleaseMutex, the recursion count is decremented. Only when the recursion count becomes 0 can another thread become the owner of the Mutex.

Most people do not like this additional logic. The problem is that these “features” have a cost associated with them. The Mutex object needs more memory to hold the additional thread ID and recursion count information. And, more importantly, the Mutex code has to maintain this information, which makes the lock slower. If an application needs or wants these additional features, then the application code could have done this itself; the code doesn’t have to be built into the Mutex object. For this reason, a lot of people avoid using Mutex objects.

Usually a recursive lock is needed when a method takes a lock and then calls another method that also requires the lock, as the following code demonstrates.

internal class SomeClass : IDisposable {
 private readonly Mutex m_lock = new Mutex();
 public void Method1() {
 m_lock.WaitOne();
 // Do whatever...
 Method2(); // Method2 recursively acquires the lock
 m_lock.ReleaseMutex();
 }
 public void Method2() {
 m_lock.WaitOne();
 // Do whatever...
 m_lock.ReleaseMutex();
 }
 public void Dispose() { m_lock.Dispose(); }
}

In the preceding code, code that uses a SomeClass object could call Method1, which acquires the Mutex, performs some thread-safe operation, and then calls Method2, which also performs some thread-safe operation. Because Mutex objects support recursion, the thread will acquire the lock twice and then release it twice before another thread can own the Mutex. If SomeClass has used an AutoResetEvent instead of a Mutex, then the thread would block when it called Method2’s WaitOne method.

If you need a recursive lock, then you could create one easily by using an AutoResetEvent.

internal sealed class RecursiveAutoResetEvent : IDisposable {
 private AutoResetEvent m_lock = new AutoResetEvent(true);
 private Int32 m_owningThreadId = 0;
 private Int32 m_recursionCount = 0;
 public void Enter() {
 // Obtain the calling thread's unique Int32 ID
 Int32 currentThreadId = Thread.CurrentThread.ManagedThreadId;
 // If the calling thread owns the lock, increment the recursion count
 if (m_owningThreadId == currentThreadId) {
 m_recursionCount++; 
 return;
 }
 // The calling thread doesn't own the lock, wait for it
 m_lock.WaitOne();
 // The calling now owns the lock, initialize the owning thread ID & recursion count
 m_owningThreadId = currentThreadId;
 m_recursionCount = 1;
 }
 public void Leave() {
 // If the calling thread doesn't own the lock, we have an error
 if (m_owningThreadId != Thread.CurrentThread.ManagedThreadId) 
 throw new InvalidOperationException();
 // Subtract 1 from the recursion count
 if (--m_recursionCount == 0) {
 // If the recursion count is 0, then no thread owns the lock
 m_owningThreadId = 0; 
 m_lock.Set(); // Wake up 1 waiting thread (if any)
 }
 }
 public void Dispose() { m_lock.Dispose(); }
}

Although the behavior of the RecursiveAutoResetEvent class is identical to that of the Mutex class, a RecursiveAutoResetEvent object will have far superior performance when a thread tries to acquire the lock recursively, because all the code that is required to track thread ownership and recursion is now in managed code. A thread has to transition into the Windows kernel only when first acquiring the AutoResetEvent or when finally relinquishing it to another thread.

💡小结:Windows 提供了几个内核模式的构造来同步线程。内核模式的构造比用户模式的构造慢得多,一个原因是它们要求 Windows 操作系统自身的配合,另一个原因是在内核对象上调用的每个方法都造成调用线程从托管代码转换为本机 (native) 用户模式代码,再转换为本机 (native) 内核模式代码。然后,还要朝相反的方向一路返回。这些转换需要大量 CPU 时间;经常执行会对应用程序的总体性能造成负面影响。但内核模式的构造具备基元用户模式构造所不具备的优点。1. 内核模式的构造检测到在一个资源上的竞争时,Windows 会阻塞输掉的线程,使它不占用一个 CPU “自旋”(spinning),无谓地浪费处理器资源。2. 内核模式的构造可实现本机 (native) 和托管 (managed) 线程相互之间的同步。3. 内核模式的构造可同步在同一台机器的不同进程中运行的线程。4. 内核模式对的构造可应用安全性设置,防止未经授权的账户访问它们。5. 线程可一直阻塞,直到集合中的所有内核模式构造都可用,或直到集合中的任何内核模式构造可用。6. 在内核模式的构造上阻塞的线程可指定超时值;指定时间内访问不到希望的资源,线程就可以解除阻塞并执行其他任务。事件和信号量是两种基元内核模式线程同步构造。至于其他内核模式构造,比如互斥体,则是在这两个基计构造上构建的。 System.Threading 命名空间提供了一个名为 WaitHandle 抽象基类。 WaitHandle 类是一个很简单的类,它唯一的作用就是包装一个 Windows 内核对象句柄。FCL 提供了几个从 WaitHandle 派生的类。所有类都在 System.Threading 命名空间中定义。 WaitHandle 基类内部有一个 SafeWaitHandle 字段,它容纳了一个 Win32 内核对象句柄。这个字段是在构造一个具体的 WaitHandle 派生类时初始化的。除此之外, WaitHandle 类公开了由所有派生类继承的方法。在一个内核模式的构造上调用的每个方法都代表一个完整的内存栅栏。这些方法有几点需要注意:1. 可以调用 WaitHandleWaitOne 方法让调用线程等待底层内核对象收到信号。这个方法在内部调用 Win32 WaitForSingleObjectEx 函数。如果对象收到信号,返回的 Booleantrue ;超时就返回 false 。2. 可以调用 WaitHandle 的静态 WaitAll 方法,让调用线程等待 WaitHandle[] 中指定的所有内核对象都收到信号。如果所有对象都收到信号,返回的 Booleantrue ;超时则返回 false 。这个方法在内部调用 Win32 WaitForMultipleObjectsEx 函数,为 bWaitAll 参数传递 TRUE 。3. 可以调用 WaitHandle 的静态 WaitAny 方法让调用线程等待 WaitHandle[] 中指定的任何内核对象收到信号。返回的 Int32 是与收到信号的内核对象对应的数组元素索引;如果在等待期间没有对象收到信号,则返回 WaitHandle.WaitTimeout 。这个方法在内部调用 Win32 WaitForMultipleObjectsEx 函数,为 bWaitALl 参数传递 FALSE 。4. 再传给 WaitAnyWaitAll 方法的数组中,包含的元素数不能超过 64 个,否则方法会抛出一个 System.NotSupportedException 。5. 可以调用 WaitHandleDispose 方法来关闭底层内核对象句柄。这个方法在内部调用 Win32 CloseHandle 函数。只有确定没有别的线程要使用内核对象才能显式调用 Dispose 。 你需要写代码并进行测试,这是一个巨大的负担。所以我强烈反对显式调用 Dispose ;相反,让垃圾回收器 (GC) 去完成清理工作。GC 知道什么时候没有线程使用对象,会自动进行清理。不接受超时参数的那些版本的 WaitOneWaitAll 方法应返回 void 而不是 Boolean 。原因是隐含的超时时间是无限长 ( System.Threading.Timeout.Infinite ),所以它们只会返回 trueAutoResetEventManualResetEventSemaphoreMutex 类都派生自 WaitHandle ,因此它们继承了 WaitHandle 的方法和行为。首先,所有这些类的构造器都在内部调用 Win32 CreateEvent (为 BManualReset 参数传递 FALSETRUE )、 CreateSemaphoreCreateMutex 函数。从所有这些调用返回的句柄值都保存在 WaitHandle 基类内部定义的一个私有 SafeWaitHandle 字段中。其次, EventWaitHandleSemaphoreMutex 类都提供了静态 OpenExisting 方法,它们在内部调用 Win32 OpenEventOpenSemaphoreOpenMutex 函数,并传递一个 String 实参 (标识现有的一个具名内核对象)。所有函数返回的句柄值都保存到从 OpenExisting 方法返回的一个新构造的对象中。如果指定名称的内核对象不存在,就抛出一个 WaitHandleCannotBeOpenedException 异常。事件 (event) 其实只是由内核维护的 Boolean 变量。事件为 false ,在事件上等待的线程就阻塞;事件为 true ,就解除阻塞。有两种事件,即自动重置事件和手动重置事件。当一个自动重置事件为 true 时,它只唤醒一个阻塞的线程,因为在解除第一个线程的阻塞后,内核将事件自动重置回 false ,造成其余线程继续阻塞。而当一个手动重置事件为 true 时,它解除正在等待它的所有线程的阻塞,因为内核不将事件自动重置回 false ;现在,你的代码必须将事件手动重置回 false 。可用自动重置事件轻松创建线程同步锁( SimpleWaitLock ),可采取和使用 SimpleSpinLock 时完全一样的方式使用这个 SimpleWaitLock 。事实上,外部行为是完全相同的;不过,两个锁的性能截然不同。锁上面没有竞争的时候, SimpleWaitLockSimpleSpinLock 慢得多,因为对 SimpleWaitLockEnterLeave 方法的每一个调用都强迫调用线程从托管代码转换为内核代码,再转换回来 —— 这是不好的地方。但在存在竞争的时候,输掉的线程会被内核阻塞,不会在那里 “自旋”,从而浪费 CPU 时间 —— 这是好的地方。还要注意,构造 AutoResetEvent 对象并在它上面调用 Dispose 也会造成从托管向内核的转换,对性能造成负面影响。这些调用一般很少发生,所以一般不必过于关心它们。线程同步能避免就尽量避免。如果一定要进行线程同步,就尽量使用用户模式的构造。内核模式的构造要尽量避免。信号量 (semaphore) 其实就是由内核维护的 Int32 变量。信号量为 0 时,在信号量上等待的线程会阻塞;信号量大于 0 时解除阻塞。在信号量上等待的线程解除阻塞时,内核自动从信号量的计数中减 1。信号量还关联了一个最大 Int32 值,当前计数绝不允许超过最大计数。自动重置事件在行为上和最大计数为 1 的信号量非常相似。两者的区别在于,可以在一个自动重置事件上连续多次调用 Set ,同时仍然只有一个线程解除阻塞。相反,在一个信号量上连续多次调用 Release ,会使它的内部计数一直递增,这可能解除大量线程的阻塞。顺便说一句,如果在一个信号量上多次调用 Release ,会导致它的计数超过最大计数,这时 Release 会抛出一个 SemaphoreFullException 。互斥体 (mutex) 代表一个互斥的锁。它的工作方式和 AutoResetEvent 或者计数为 1 的 Semaphore 相似,三者都是一次只释放一个正在等待的线程。互斥体有一些额外的逻辑,这造成它们比其他构造更复杂。首先, Mutex 对象会查询调用线程的 Int32 ID,记录是哪个线程获得了它。一个线程调用 ReleaseMutex 时, Mutex 确保调用线程就是获取 Mutex 的那个线程。如若不然, Mutex 对象的状态就不会改变,而 ReleaseMutex 会抛出一个 System.ApplicationException 。另外,拥有 Mutex 的线程因为任何原因而终止,在 Mutex 上等待的某个线程会因为抛出 System.Threading.AbandonedMutexException 异常而被唤醒。该异常通常会成为未处理的异常,从而终止整个进程。这是好事,因为线程在获取了一个 Mutex 之后,可能在更新完 Mutex 所保护的数据之前终止。如果其他线程捕捉了 AbandonedMutexException ,就可能视图访问损坏的数据,造成无法预料的结果和安全隐患。其次, Mutex 对象维护着一个递归计数 (recursion count),指出拥有该 Mutex 的线程拥有了它多少次。如果一个线程当前拥有一个 Mutex ,而后该线程再次在 Mutex 上等待,计数就会递增,这个线程允许继续进行。线程调用 ReleaseMutex 将导致计数递减。只有计数变成 0,另一个线程才能成为该 Mutex 的所有者。通常,当一个方法获取了一个锁,然后调用也需要锁的另一个方法,就需要一个递归锁。由于 Mutex 对象支持递归,所以线程获取两次锁,就要释放它两次。在此之后,另一个线程才能拥有这个 Mutex