LKML上一篇关于barrier文档草案的讨论---[经典]

楼主^#

更多发布于：2007-04-02 21:26

xiaozhaoz 2006-3-9 03:19

讨论一下，为什么他的那个例子：
+ (2) Multiprocessor interaction
+
+ When there's a system with more than one processor, these may be working
+ on the same set of data, but attempting not to use locks as locks are
+ quite expensive. This means that accesses that affect both CPUs may have
+ to be carefully ordered to prevent error.
+
+ Consider the R/W semaphore slow path. In that, a waiting process is
+ queued on the semaphore, as noted by it having a record on its stack
+ linked to the semaphore's list:
+
+ struct rw_semaphore {
+ ...
+ struct list_head waiters;
+ };
+
+ struct rwsem_waiter {
+ struct list_head list;
+ struct task_struct *task;
+ };
+
+ To wake up the waiter, the up_read() or up_write() functions have to read
+ the pointer from this record to know as to where the next waiter record
+ is, clear the task pointer, call wake_up_process() on the task, and
+ release the task struct reference held:
+
+ READ waiter->list.next;
+ READ waiter->task;
+ WRITE waiter->task;
+ CALL wakeup
+ RELEASE task
+
+ If any of these steps occur out of order, then the whole thing may fail.
+
+ Note that the waiter does not get the semaphore lock again - it just waits
+ for its task pointer to be cleared. Since the record is on its stack, this
+ means that if the task pointer is cleared _before_ the next pointer in the
+ list is read, then another CPU might start processing the waiter and it
+ might clobber its stack before up*() functions have a chance to read the
+ next pointer.
+
+ CPU 0 CPU 1
+ =============================== ===============================
+ down_xxx()
+ Queue waiter
+ Sleep
+ up_yyy()
+ READ waiter->task;
+ WRITE waiter->task;
+ <preempt>
+ Resume processing
+ down_xxx() returns
+ call foo()
+ foo() clobbers *waiter
+ </preempt>
+ READ waiter->list.next;
+ --- OOPS ---
+
+ This could be dealt with using a spinlock, but then the down_xxx()
+ function has to get the spinlock again after it's been woken up, which is
+ a waste of resources.
+
+ The way to deal with this is to insert an SMP memory barrier:
+
+ READ waiter->list.next;
+ READ waiter->task;
+ smp_mb(); //为什么这个地方需要smp_mb()?? 我觉得用mb()应该就可以了，因为他只要保证了read 在write和wakeup前面，就能保证sleep的任务不会被唤醒而导致访问乱序。
+ WRITE waiter->task;
+ CALL wakeup
+ RELEASE task
+
+ In this case, the barrier makes a guarantee that all memory accesses
+ before the barrier will happen before all the memory accesses after the
+ barrier. It does _not_ guarantee that all memory accesses before the
+ barrier will be complete by the time the barrier is complete

albcamus 2006-3-9 05:37

>+ smp_mb(); //为什么这个地方需要smp_mb()?? 我觉得用mb()应该就可以了，因为他只要保证了read 在write和wakeup前面，就能保证sleep的任务不会被唤醒而导致访问乱序。

看smp_mb()的定义，感觉mb()更严格一些：
include/asm-i386/system.h

#ifdef CONFIG_SMP
#define smp_mb()        mb()
#define smp_rmb()        rmb()
#define smp_wmb()        wmb()
#define smp_read_barrier_depends()        read_barrier_depends()
#define set_mb(var, value) do { xchg(&var, value); } while (0)
#else
#define smp_mb()        barrier()
#define smp_rmb()        barrier()
#define smp_wmb()        barrier()
#define smp_read_barrier_depends()        do { } while(0)
#define set_mb(var, value) do { var = value; barrier(); } while (0)
#endif

我不理解的是， mb()和barrier()有何区别？前者比后者严格，这是肯定的。但barrier()宏有用吗？它的定义是：
#define barrier() __asm__ __volatile__("": : :"memory")

只在inline asm的约束部分提醒GCC该指令会修改内存──这意味着GCC会据以做什么工作呢？

xiaozhaoz 2006-3-9 06:59

对barrier() 我的理解是这只是一个compiler barrier，这个barrier加入到代码中，会使cache invalidation

而mb是hardware barrier，在代码运行中，CPU会prevent from reordering cache visit.

我想知道，在SMP情况下， smp_mb()和mb()差别。从名字上看，好像
smp应该对多个CPU的cache 一致启作用，而mb只能保证本CPU cache访问一致。
但实际情况看样子不是这样。

当然在像x86这样的CPU smp_mb() == mb(), 但memory barrier 函数考虑的是所有CPU通用，所以其它CPU可能不是这样。

albcamus 2006-3-9 07:37

原帖由 xiaozhaoz 于 2006-3-9 14:59 发表
对barrier() 我的理解是这只是一个compiler barrier，这个barrier加入到代码中，会使cache invalidation

而mb是hardware barrier，在代码运行中，CPU会prevent from reordering cache visit.

我想知道，在 ...

>对barrier() 我的理解是这只是一个compiler barrier，这个barrier加入到代码中，会使cache invalidation
>而mb是hardware barrier，在代码运行中，CPU会prevent from reordering cache visit.
非常感谢! 看了代码和注释，我想若不这样理解，真的有些东西就讲不通了。

>好像smp应该对多个CPU的cache 一致启作用，而mb只能保证本CPU cache访问一致。
可是您给我看的那篇文章说：
A given CPU always perceives its own memory operations as occurring in program order. That is, memory-reordering issues arise only when a CPU is observing other CPUs' memory operations.
似乎只有一个主体访问内存时，无论如何也不会需求barrier。只有两个或更多主体（CPU、DMA控制器）访问内存，且其中一个观测另一个，就需要barrier了。

个人理解，一个CPU调用lock;前缀（或者xchg这样的指令），会导致其他的CPU也触发一定的动作来同步自己的缓存。在CLF看到一朋友说， CPU的#lock引脚连接到北桥芯片的#lock引脚，因此带lock;前缀的指令执行前，北桥芯片拉起#lock电平，锁住总线，直到指令执行完毕再放开。总线加锁会自动invalidate所有CPU的cache吧？如果是，那mb()也能保证所有CPU的cache一致的。

再商榷:)

xiaozhaoz 2006-3-9 09:04

原帖由 albcamus 于 2006-3-9 15:37 发表
个人理解，一个CPU调用lock;前缀（或者xchg这样的串行指令），会导致其他的CPU也触发一定的动作来同步自己的缓存。在CLF看到一朋友说， CPU的#lock引脚连接到北桥芯片的#lock引脚，因此带lock;前缀的指令执行前，北桥芯片拉起#lock电平，锁住总线，直到指令执行完毕再放开。总线加锁会自动invalidate所有CPU的cache吧？如果是，那mb()也能保证所有CPU的cache一致的。

你的说法是对的。
察看手册，看到以下说明：

LOCK ── Assert LOCK# Signal Prefix
Opcode Instruction Clocks Description
F0 LOCK 0 Assert LOCK# signal for the next instruction

Description

The LOCK prefix causes the LOCK# signal of the 80386 to be asserted
during execution of the instruction that follows it. In a multiprocessor
environment, this signal can be used to ensure that the 80386 has
exclusive use of any shared memory while LOCK# is asserted. The
read-modify-write sequence typically used to implement test-and-set on the
80386 is the BTS instruction.

lock 会使某个CPU独享 share memory（内存？？）。但是不会使cache invalidate.

cache 和内存的的一致性由cache的写策略决定，write through 还是 write back. 所以smp_rmb() 比smp_mb()花费小。

至于多CPU之间的cache如何保证的一致，现在还没理解，请指点。 :D

原帖由 xiaozhaoz 于 2006-3-9 17:04 发表

你的说法是对的。
察看手册，看到以下说明：

LOCK ── Assert LOCK# Signal Prefix
Opcode Instruction Clocks Description
F0 LOCK 0 Assert LOCK# signal for the next instruction

Description
...

您千万别客气，偶是ULK2还没看完的菜鸟，，:em06:

您说：
>lock 会使某个CPU独享 share memory（内存？？）。但是不会使cache invalidate.
我想， lock（或cpuid、xchg等）使得本CPU的缓存全部写入了内存，这个动作也会引起别的CPU的cache invalidate。 IA32在每个CPU内部实现了Snoopying(Bus Watching)技术，监视着总线上是否发生了写内存动作（由某个CPU或DMA控制器），只要发生了，就invalidate相关的cache line。因此只要lock指令导致本CPU的写内存，就必将导致所有CPU的相关的cache invalidate。

两个地方可能例外：1, 如果采用write-through策略，则根本不存在缓存一致性问题(Linux对全部的内存都使用write-back策略)； 2,TLB也是缓存，但它的一致性(至少在IA32上)不能通过Snoopying来解决，而是要发送INVALIDATE_TLB_VECTOR这个处理器间中断给其他的CPU。

不对的地方请指正:)

Solaris12 2006-3-9 10:14

原帖由 albcamus 于 2006-3-9 15:37 发表

>对barrier() 我的理解是这只是一个compiler barrier，这个barrier加入到代码中，会使cache invalidation
>而mb是hardware barrier，在代码运行中，CPU会prevent from reordering cache visit.
非常感 ...

碰巧这几天在看Solaris的锁机制，正好也涉及到类似的问题。对比x86和sparc的锁你就会发现，实际上，mutex_enter在x86就是只用lock，没有明着用barriers，但是sparc就不同了。再看看手册就知道，lock在这里起双重作用：

AMD64 Architecture Programmer's Manual, Volume 2, System Programming.

"Read/write barrier instructions
force all prior reads or writes to complete before
subsequent reads or writes are executed....
...
Serializing instructions, I/O instructions, and locked
instructions can also be used as read/write barriers." - Page 198, 199

"Locked Instructions - Before completing a locked instruction
(an instruction executed using the LOCK prefix), all
previous reads and writes must be written to memory, and
the locked instruction must complete before completing
subsequent writes." - Page 206

http://cvs.opensolaris.org/source/xref/on/usr/src/uts/intel/ia32/ml/lock_prim.s

   554 ENTRY_NP(mutex_enter)
   555 movq %gs:CPU_THREAD, %rdx /* rdx = thread ptr */
   556 xorl %eax, %eax /* rax = 0 (unheld adaptive) */
   557 lock ----> lock在此处也起了barrier的作用
   558 cmpxchgq %rdx, (%rdi) ----> 获得锁
   559 jnz mutex_vector_enter
   560 .mutex_enter_lockstat_patch_point:
   561 ret

http://cvs.opensolaris.org/source/xref/on/usr/src/uts/sparc/v9/ml/lock_prim.s

   382 ENTRY(mutex_enter)
   383 mov THREAD_REG, %o1
   384 casx [%o0], %g0, %o1 ! try to acquire as adaptive --> 获得锁
   385 brnz,pn %o1, 1f ! locked or wrong type
   386 membar #LoadLoad ---> mem barrier指令
   387 .mutex_enter_lockstat_patch_point:
   388 retl

Solaris12 2006-3-10 03:06

原帖由 albcamus 于 2006-3-9 15:37 发表
可是您给我看的那篇文章说：
A given CPU always perceives its own memory operations as occurring in program order. That is, memory-reordering issues arise only when a CPU is observing other CPUs' memory operations.
似乎只有一个主体访问内存时，无论如何也不会需求barrier。只有两个或更多主体（CPU、DMA控制器）访问内存，且其中一个观测另一个，就需要barrier了。

这个推论是正确的。CPU为了让pipeline更高效是会打乱内存读取顺序，但是，这都建立在分析指令间依赖关系之上，因此即便是乱序，对同一单元那部分存取指令也是顺序执行的。

因此，在单cpu的系统上，程序员还是可以假设CPU是按照程序给定的路径顺序执行指令的。正确性完全是由CPU自己保证的。

albcamus 2006-3-10 03:27

原帖由 Solaris12 于 2006-3-10 11:06 发表

这个推论是正确的。CPU为了让pipeline更高效是会打乱内存读取顺序，但是，这都建立在分析指令间依赖关系之上，因此即便是乱序，对同一单元那部分存取指令也是顺序执行的。

因此，在单cpu的系统上，程序员还 ...

嗯，昨天回去看了一下IA32手册，和一本"Modern Processor Design", 也确认了这个知识。

再补充一点，关于snoopying和SMP上的缓存一致性：

IA32的每个CPU都要实现MESI协议(M:Modified；E:Exclusive；S：Shared；I：Invalid)

CPU的总线监测单元始终监视着总线上所有的内存写操作，以便调整自己的Cache状态。网上招来的资料：

MESI协议是“修改（modified）、排它（exclusive）、共享（shared）、无效（invalid）”四个功能的简称，每个缓存模块必须按照MESI协议完成这4个独立的功能。

　　● 修改：如果某一内存数据区记录只存在于一个CPU缓存中，那么此CPU可以对此数据进行修改，而无需通知其他CPU。

　　● 排它：同一时间只能有一个CPU对同一内存数据区进行修改或者更新。

　　● 共享：如果某一内存数据区记录存在于多个CPU缓存中，那么CPU对此数据修改后，必须通知其他CPU。
　　● 无效：一旦CPU对缓存数据访问失效，那么就必须重新从内存中读取数据。

　　

Solaris12 2006-3-10 05:44

原帖由 albcamus 于 2006-3-10 11:27 发表
再补充一点，关于snoopying和SMP上的缓存一致性：

IA32的每个CPU都要实现MESI协议(M:Modified；E:Exclusive；S：Shared；I：Invalid)

刚才和别人讨论了一下这个问题，在SMP系统上，出现内存乱序的根本原因可能有以下几个：

1. 现代CPU并行执行指令，导致了内存的写入或者读入顺序的不可确定性。

2. 各个CPU内部的数据指令缓冲及各个CPU Cache之间的一致性问题。

因此前面提到的几条原则可以这么理解：

  1. A given CPU always perceives its own memory operations as occurring in program order. That is, memory-reordering issues arise only when a CPU is observing other CPUs' memory operations.

  单处理器系统出现的乱序CPU自己可以解决，只有SMP的系统上才会要求内核程序员考虑处理内存乱序。原因就是上面的两点。

  2. An operation is reordered with a store only if the operation accesses a different location than does the store.

  如果乱序的指令包含了store，那么必然其它操作访问的内存单元与这个store访问的内存单元无关。

  3. Aligned simple loads and stores are atomic.

  对已经对齐的数据进行简单的load和store操作是原子的。意味着非对齐的数据的load或者store可能会对其它CPU而言，存在乱序可能。

  4. Linux-kernel synchronization primitives contain any needed memory barriers, which is a good reason to use these primitives.

  任何操作系统的同步原语中都包含了必须的memory barriers指令，前面我给的Solaris也不例外。

xiaozhaoz 2006-3-10 06:40

昨天我查了一些资料，在CPU硬件上，Memory 和 cache 的一致性，正如两位说到的，cache的一致性靠CPU自身的硬件机制保证，我们只需要使用一些指令，如lock和mfence等。
lock指令完成的事情其实是产生一个lock signal，至于这个操作是不是拉高到北桥的引脚，没有查到。lock指令在P6后的CPU有一些优化，可能不会锁host总线。

如果要锁host总线，如访问内存，在两个CPU的系统中，MRM占用总线， LRM用PHIT和PHITM来监听是否命中本CPU中的cache，命中通知MRM，invalidate相关cache.
mfence指令是P4才有的，开销比lock小

这就是为什么smp中，smp_mb() == mb()， UP中 smp_mb() == barrier()。 barrier()只是compiler barrier,gcc 将barrier后的寄存器访问改成内存访问，以保证一致性。

不过有一个地方我们都没考虑到，X86的SMP结构一般是一个两层的总线结构，host总线和PCI总线（或其它IO总线），

也有三层结构的，如多个CPU共享L2或者L3 cache，所有在host总线上，还有一层local bus，我以前做的一个ADI的CPU就是这样一个多核结构，不知道HT的P4是不是也是这样的结构。

PCI总线通过PCI/host桥连接到host总线，内存连在host总线上。PCI IO要mmap到内存中，PCI设备要访问MMIO，要靠DMA来读写内存。这块内存可能被DMA和CPU同时访问，他们的一致性怎么考虑？ LKML中讨论的大部分是这方面的内容： memory vs IO； IO vs IO.

如果IO 在memory中的映射 MMIO，变量申明为volatile，不会缓存在cache中，没有cache和memory的一致问题。
但DMA访问的所有MMIO好像不可能都不缓存在cache中，这部分怎么保证？看他们的意思，DMA读写内存部分代码用memory barrier好像也很复杂。有谁对这块比较了解，介绍一下？

多谢各位的帮忙。:D

xiaozhaoz 2006-3-13 11:55

第一次总结：

现在常说的SMP是共享总线结构，同时共享memory， memory可能是内存，也可能是cache。

对于单CPU而言，CPU cache中的内容和memory中的内容的同步要注意：
1. 虽然CPU可能会更改执行顺序，但CPU更改后的指令在UP环境中是正确的。
2. CPU中的cache和memory的同步只需要考虑 DMA和CPU同时对memory访问导致的同步问题，这种问题要在编写驱动的时候使用合适的指令和mb来保证。也就是说，在UP中，只需要考虑cpu和dma的同步问题。

在SMP中，需要考虑CPU之间，CPU和dma之间的同步问题。

上面我们讨论的都是CPU之间的同步问题。
SMP是一个共享总线的结构，一般来说，存在两层总线， host总线和PCI总线或者其它IO总线。
host总线连接多个CPU和内存，host/PCI桥（就是通常说的北桥）
PCI总线连接host/PCI桥和 PCI主从设备，及PCI/Isa桥。就是通常说的南桥。

由此可见，PCI设备要将自己的register map到内存中，需要通过host/pci bridge，要靠host/pci bridge访问host总线，然后到达内存。内存的映射和访问这些工作由bridge＋dma完成。

而多个CPU要访问内存，也要通过host总线。

由上可见，一个CPU或者DMA要访问内存，必须锁总线，总线是共享的。同样为了使得内存的修改能被其它设备知晓，必须用signal通知机制，某个设备修改了内存，必须有监听总线的机制，然后通过某个signal通知到设备，如dma访问内存的时候，cpu监控总线，用HIT和HITM通知cpu修改的内容命令cache，所以相关cache要invalidate，一般是64bit。这个过程是一级一级cache往上走的过程。

为了防止dma中的数据cache在CPU中，大家一般采用申明为volatile的方法，这种方法会导致效率不高，CPU每次必须lock 总线，访问内存才能获得相应的内容。

上面介绍的都是硬件相关的东西。

软件上，代码执行顺序的更改可能被编译器和CPU更改。
为了保证访问内存代码按照指定顺序执行，必须使用smp_*mb*()宏。

在单CPU中，smp_*mb*()只是一个compiler barrier，仅仅是防止编译器错误地优化访问内存代码：
#define barrier() __asm__ __volatile__("": : :"memory")
volatile告诉编译器，这段代码不能忽略， "memory" 是编译器的clobber，告诉编译器，
1. 内存信息已经修改，在这条指令后面的寄存器的值必须从内存中重新获取
2. 代码的先后顺序必须按照原有的产生汇编代码

在SMP中，smp_*mb*()是一个hardware barrier和compiler barrier的组合
#define smp_mb() mb()
#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)
alternative()用来保持CPU指令兼容，在P4以前的CPU没有mfence指令，用lock; addl $0,0(%%esp)指令。alternative()包含"memory" clobber，所以包含compiler barrier功能。
lock的作用是发出lock信号，占用host总线，同时其它的CPU会监听总线。invalidate cache中的相应内容，一般每项64bit。

在哪些情况下需要使用memory barrier，参考：
http://www.linuxjournal.com/article/8211
http://www.linuxjournal.com/article/8212
和附件的文档

关于I/O DMA和CPU的memory barrier问题，欢迎大家继续讨论。

还有一个问题，为什么Linux发行版本中的glibc库没有分UP和SMP版本？正常来说，已经编译成二进制的系统库应该UP和SMP不兼容啊，因为锁的实现等都要靠mb.

喜欢0

走走看看开源好 Solaris vs Linux

发帖回复

« 返回列表

您需要登录后才可以回帖，登录或者注册

返回顶部

LKML上一篇关于barrier文档草案的讨论---[经典]

最新喜欢：