OS Support For Transactional Memory

Candy · **Posted:** Fri May 04, 2007 10:39 am

I'm still not sure. What if you have a disk resource and a memory resource that both need to work under one transaction? You can do either atomically, but not both, unless I'm missing something.

Colonel Kernel · **Posted:** Fri May 04, 2007 12:14 pm

Brendan wrote:

For an example, consider a large transaction that reads from 1000 addresses spread across 10 pages and writes to 100 addresses spread across 2 pages. In this case "compiler based STM" has all the overhead for 1000 reads and 100 writes, while "OS based STM" has the overhead for 10 reads and 2 writes.

I don't think typical transactions would be that large. The kind of lock-based code I find myself writing (that would still be a pain to re-write using lock-free techniques) usually updates less than 10 shared memory locations, all on the same page (likely near other shared variables that get updated concurrently). But I'm only one data point.

In general, the use of large transactions is discouraged for performance reasons anyway (in databases and TM).

Brendan wrote:

mystran wrote:

Personally I prefer software transactional memory for its flexibility, but to do it properly (i.e. -- not using paging) you really need intervention from the compiler, since it has to emit the proper read and write barriers... basically it instruments the code within an "atomic block" so that it can keep track of transactions.

To put it simply, I don't think transactional memory is really the OS' responsibility. HTM is up to the hardware (although this probably requires some kernel support) and STM is up to the compiler and language run-time environment.

I'm not mystran.

Quote:

OS based STM wouldn't be less flexible

You can't have small, fine-grained transactions. Applications have to rely on your OS' policies for optimistic versus pessimistic transaction management. With the compiler-based approach, transactions can be very fine-grained and you can change the transaction policy by choosing a different run-time library.

Candy wrote:

I'm still not sure. What if you have a disk resource and a memory resource that both need to work under one transaction? You can do either atomically, but not both, unless I'm missing something.

How often have you needed to change something in a file atomically, while holding a lock on a piece of shared memory?

Kevin McGuire · **Posted:** Fri May 04, 2007 5:56 pm

Colonel Kernel wrote:

You can't have small, fine-grained transactions. Applications have to rely on your OS' policies for optimistic versus pessimistic transaction management. With the compiler-based approach, transactions can be very fine-grained and you can change the transaction policy by choosing a different run-time library.

You're post was broad for me, but if you intention was just that then I suppose asking for clarification would be out of the way. However, if it is not out of the way I would interested in you're concept of these:

a. What would be considered a fine-grained transaction. I am assuming the data size of the transaction, right?
a1. What is the primary factor in preventing or limiting fine-grained transactions?

b. What functionality could a transaction policy provide?
b1. How could a application be limited to one policy because transactional memory was implemented in kernel?

Brandon wrote:

For an example, consider a large transaction that reads from 1000 addresses spread across 10 pages and writes to 100 addresses spread across 2 pages. In this case "compiler based STM" has all the overhead for 1000 reads and 100 writes, while "OS based STM" has the overhead for 10 reads and 2 writes. Also, "OS based STM" can make use of existing mechanisms built into the CPU (page faults, accessed & dirty flags) to keep track of accesses, while "compiler based STM" needs to insert additional code at every access.

a. I had one-hundred threads do one write each across two pages would that not increase the collision domain and come up around the same figure as the "compiler based STM" overhead considering each instance of the transaction expense to be one to one with you're paging mechanism?
It would appear that a worst case scenario would be all thread enter into a read state which is broken by one thread writing causing the outstanding reads to become invalid and have enter a retry phase which each literation of reads and a write the thread with a successful write/commit will exit the equation.

(Let c be the transaction count; Let t be the thread count with one write or read transaction per period.)
a1. If the expense is not one to one what would be a more appropriate one as you're guess or maybe a general idea of why it would be significantly more enough to make a difference?

And of course the compiler ability to (maybe) do a much finer grain and decrease the collision domain..

Colonel Kernel · **Posted:** Fri May 04, 2007 11:39 pm

Kevin McGuire wrote:

a. What would be considered a fine-grained transaction. I am assuming the data size of the transaction, right?

Data size, and also data locality. With compiler-based STM, you can have independent transactions involving words that are right next to each other in memory without having one transaction trip up the other one and cause it to re-try. With a paging-based approach, collisions are far more likely (as you've pointed out).

Quote:

a1. What is the primary factor in preventing or limiting fine-grained transactions?

The primary factor is your implementation technique. Compiler-based STM can be very fine-grained. Hardware transactional memory (HTM) is typically based on cache lines, so it is more coarse-grained, but still not as coarse-grained as a page-based approach, which is what Brendan is proposing.

Quote:

b. What functionality could a transaction policy provide?

It's not so much providing functionality as it is providing a choice of performance trade-offs. Depending on the implementation technique, there can be "optimistic" or "pessimistic" approaches to reading and writing in transaction. I sometimes get these two confused, so bear with me.

I think an optimistic write policy means that the transaction writes to the target location immediately, keeping an undo log off to the side. This makes commits cheap and roll-backs expensive. A pessimistic policy re-directs writes to a separate copy of the target data, making rollbacks cheap and commits expensive. Depending on the run-time behaviour of the transactions, the performance of the two approaches varies. In general it isn't really possible to claim that one is always better than the other, because the situation can change (just like there is no single page-replacement policy that is optimal for all applications).

Quote:

b1. How could a application be limited to one policy because transactional memory was implemented in kernel?

It would only be limited within whatever mechanism was offered by the kernel. So, if an app didn't use the kernel's mechanism, it could do whatever it liked. But that begs the question of why should the kernel implement it in the first place.

Especially if, as I suspect, a page-based approach would be sub-optimal for most applications (if I find any relevant papers that dispute this, I'll post them here).

Colonel Kernel · **Posted:** Sat May 05, 2007 12:11 am

Ok, here's one:

http://www.cse.ucsd.edu/~calder/papers/ ... 06-PTM.pdf

The down side is, even this technique assumes additional hardware support. I haven't finished reading it yet, but I'm getting a hunch that page-based TM is not feasible with today's CPUs...

Brendan · **Posted:** Sat May 05, 2007 1:48 am

Hi,

Colonel Kernel wrote:

Brendan wrote:

For an example, consider a large transaction that reads from 1000 addresses spread across 10 pages and writes to 100 addresses spread across 2 pages. In this case "compiler based STM" has all the overhead for 1000 reads and 100 writes, while "OS based STM" has the overhead for 10 reads and 2 writes.

I don't think typical transactions would be that large. The kind of lock-based code I find myself writing (that would still be a pain to re-write using lock-free techniques) usually updates less than 10 shared memory locations, all on the same page (likely near other shared variables that get updated concurrently). But I'm only one data point.

In general, the use of large transactions is discouraged for performance reasons anyway (in databases and TM).

For a high performance database engine that may need to handle high contention, I wouldn't use any form of transactional memory. I also wouldn't let any programmer who lacks the necessary experience with traditional locking anywhere need the code.

IMHO all forms of STM are only useful for low contention situations, and exist to make multi-threaded software easier to write rather than to improve performance.

Colonel Kernel wrote:

Quote:

b1. How could a application be limited to one policy because transactional memory was implemented in kernel?

It would only be limited within whatever mechanism was offered by the kernel. So, if an app didn't use the kernel's mechanism, it could do whatever it liked. But that begs the question of why should the kernel implement it in the first place.

Especially if, as I suspect, a page-based approach would be sub-optimal for most applications (if I find any relevant papers that dispute this, I'll post them here).

So far I've seen a few STM libraries for C and C++, and an obscure language that supports STM (Haskell).

If you're writing something in Pascal and want to use STM, what are your current choices? Would you write your own Pascal compiler (or Pascal library, if that's possible), or manually insert code to support transactions into your application? What if you're an assembly programmer like me?

Would someone who lacks the experience or time necessary to use traditional re-entrancy locking correctly have the experience or time necessary to implement their own STM support correctly?

Colonel Kernel wrote:

Ok, here's one:

http://www.cse.ucsd.edu/~calder/papers/ ... 06-PTM.pdf

The down side is, even this technique assumes additional hardware support. I haven't finished reading it yet, but I'm getting a hunch that page-based TM is not feasible with today's CPUs...

IMHO this paper suggests that proposed HTM techniques are broken, and then offers ways of fixing the proposed HTM techniques using page-based extensions. Ironically one of the reasons that the proposed HTM techniques are broken is that transactions may be too large for the cache.

Cheers,

Brendan

Colonel Kernel · **Posted:** Sat May 05, 2007 11:55 am

Brendan wrote:

For a high performance database engine that may need to handle high contention, I wouldn't use any form of transactional memory.

You misunderstood me. I was talking about recommended database transaction sizes, not about using STM in the implementation of a database engine (which really has little or nothing to do with implementing database transactions). It was an analogy.

Quote:

IMHO all forms of STM are only useful for low contention situations, and exist to make multi-threaded software easier to write rather than to improve performance.

True, but:

From what I've read, current STM approaches tend not to scale well to large transactions (hardware support helps a lot with this), so even if performance isn't the #1 goal, avoiding terrible performance should still be important.
Real contention in typical multi-threaded code is probably much lower than your intuition would suggest. However, "accidental contention" can be a huge problem if the implementation is too coarse-grained.

Quote:

So far I've seen a few STM libraries for C and C++, and an obscure language that supports STM (Haskell).

If you're writing something in Pascal and want to use STM, what are your current choices? Would you write your own Pascal compiler (or Pascal library, if that's possible), or manually insert code to support transactions into your application? What if you're an assembly programmer like me?

What I've been trying to tell you is that transactional memory is currently being researched. Those libraries you've seen are mostly for experimentation, to get a sense of the characteristics of the way multi-threaded code would interact with STM. They are research tools, not commercial tools.

You probably can't use garbage collection with Pascal without some experimental run-time library either. I know you can't use GC with assembler.

So the fact that TM is not universally available to everyone is nothing new. After all, it is research.

I think that transactional memory "for the masses" requires hardware, OS, and compiler support. Obviously this is a few years off from commercial reality.

This will all turn out to be moot anyway if your idea isn't feasible to implement. I'm assuming for now that since none of the papers talk about a paging-based approach on current hardware, that must mean there is some critical flaw with the idea that we're not seeing yet.

Quote:

Would someone who lacks the experience or time necessary to use traditional re-entrancy locking correctly have the experience or time necessary to implement their own STM support correctly?

Of course not. That's why researchers with PhDs that are much smarter than both of us are trying to figure this out.

Quote:

IMHO this paper suggests that proposed HTM techniques are broken, and then offers ways of fixing the proposed HTM techniques using page-based extensions. Ironically one of the reasons that the proposed HTM techniques are broken is that transactions may be too large for the cache.

I don't agree with that interpretation. What the paper is saying is that the first HTM techniques have really good performance, but impose hard resource limits on the programmer (cache size limits, and time limits imposed by interrupts and context switches). It isn't that the techniques are broken, it's that they are at too low a level of abstraction to be used directly. The proposal is really about virtualizing the transactions to abstract the time & space limits away from the programmer, in much the same way that virtual memory abstracts away the limits of physical memory (by analogy, not by implementation). The idea is that the current HTM techniques will make the common case fast, while the virtualization makes the whole system easy enough to actually use.

Brendan · **Posted:** Mon May 07, 2007 9:48 am

Hi,

Let's summerise the different techniques:

Compiler-based STM - allows different performance trade-offs to be made by different libraries and/or different compilers. Works by inserting additional code for each memory access (has "per access" overhead). The "commit()" has "per accessed location" overhead. Doesn't suffer from any "false sharing" problems. Is unlikely to allow programmers to mix languages and can't be implemented in a practical manner for some languages (e.g. assembly). No upgrade path to HTM (i.e. a binary compiled for STM still uses STM when HTM is available).

Page-based STM - doesn't allow different performance trade-offs to be made by different libraries and/or different compilers. Works by using CPUs paging support (has page fault penalty for first access to a page, but no "per access" overhead). The "commit()" has "per accessed page" overhead. Does suffer from "false sharing" problems. Allows programmers to mix languages and can be used by any language. Transparent upgrade path to HTM (i.e. a binary wouldn't know or care if it's using STM or HTM).

HTM - doesn't allow different performance trade-offs to be made by different libraries and/or different compilers. Works by using the CPU's caching (has cache miss penalty for first access to a cache line, but no "per access" overhead). The "commit()" has "per accessed cache line" overhead. Does suffer from "false sharing" problems but not as much as page-based STM (granularity is probably 128 bytes, instead of 4096 bytes). Allows programmers to mix languages and can be used by any language. Is unlikely to be usable in practice without linear memory manager assistance. Software designed for HTM won't work on computers without HTM support (unless an OS used something like page-based STM when HTM isn't supported). May never exist unless the CPU manufacturers consider HTM to be a profitable/marketable feature.

Colonel Kernel wrote:

This will all turn out to be moot anyway if your idea isn't feasible to implement. I'm assuming for now that since none of the papers talk about a paging-based approach on current hardware, that must mean there is some critical flaw with the idea that we're not seeing yet.

It is entirely possible to implement paged-based STM. The only critical flaw with page-based STM is that it's not very portable (i.e. you'd need different code on each architecture and the target architecture must have a suitable MMU).

It would also have different performance characteristics, but I wouldn't consider that a critical flaw (that'd be like refusing to consider interpretted languages because compiled code is faster, even though there's good reasons to use interpretted languages for some things).

I'd assume there's a variety of reasons why we can't find papers talking about the page-based approach:

Simon Peyton Jones

Of course just because we can't find papers talking about the page-based approach doesn't necessarily mean that they've never been written...

Cheers,

Brendan

Colonel Kernel · **Posted:** Mon May 07, 2007 1:22 pm

Brendan wrote:

Compiler-based STM - allows different performance trade-offs to be made by different libraries and/or different compilers.

The policy would be implemented by the run-time environment, not the compiler.

Quote:

The "commit()" has "per accessed location" overhead.

Only if the policy is "pessimistic" (i.e. -- lazy writes). For an "optimistic" policy (i.e. -- eager writes), the overhead is on abort, not commit.

Quote:

Is unlikely to allow programmers to mix languages

I disagree -- since most of the complexity is in the run-time environment, I think it would be possible for a cross-language environment like the CLR to support STM for any of the supported languages. The compilers would have to support it too, to some degree (it may be possible to implement a lot of the instrumentation at the step where native code is generated from IL).

Quote:

HTM - doesn't allow different performance trade-offs to be made by different libraries and/or different compilers.

While this is true, it may be possible to have different policies in the OS support code that goes along with the special HTM hardware... I haven't read enough about it to say for sure.

Quote:

Is unlikely to be usable in practice without linear memory manager assistance.

Probably more than just the linear memory manager. What part of the OS maintains transaction state across interrupts, thread switches, and thread migrations?

Quote:

May never exist unless the CPU manufacturers consider HTM to be a profitable/marketable feature.

Given the massive industry-wide push into multi-core, and the realization that new techniques are required to make multi-threaded programming easier (necessary in order to sell more multi-core chips), I think Intel and AMD will be all over this as soon as the research has stabilized somewhat and reached some consensus.

Quote:

It is entirely possible to implement paged-based STM.

I said feasible, not possible.

Quote:

The only critical flaw with page-based STM is that it's not very portable (i.e. you'd need different code on each architecture and the target architecture must have a suitable MMU).

That's not a critical flaw -- lots of kernel code is not portable. I was thinking about it some more, and my initial hunch that conflict detection would be too slow doesn't really hold up, assuming I understand your proposed implementation correctly:

A thread starts a transaction by accessing a page somewhere in "transaction space" (a process-global area), which causes a page fault.
The MM allocates a copy of the underlying physical page, and maps the copy as valid at the same virtual address, for the faulting thread only. Other threads would have their own mappings for that virtual page that still have the "not present" bit set. This is already a pain in the butt for OSes that (unlike yours) do not support thread-specific page mappings.
As the thread accesses more pages, more of these per-thread mappings are created. They are all chained together in some list maintained by the MM so that it can keep track of the thread's "transaction working set".
If other threads trigger a page fault somewhere in the first thread's transaction working set, the transaction would have to be aborted, which involves freeing all the pages copied by the MM on behalf of the thread in the transaction.
A commit would involve changing the page mappings for all the threads to point to the copies worked on by the thread that is doing the commit.

Did I miss anything?

Besides the issue of conflicts, which IMO are both expensive and likely with this scheme, it has another big flaw: Only a certain region of the address space can be used for transactional memory. Programs that have large, complex data structures allocated in the heap that are shared by multiple threads would need extra support from the user-level heap allocator as well (to avoid "false conflicts", it would have to try and interleave "related" allocations across many different pages -- but it doesn't know which allocations are "related"!). The compiler and linker would also have to arrange to have global data loaded in the transactional area. Suddenly this technique doesn't look very transparent to the rest of the system.

Quote:

It would also have different performance characteristics, but I wouldn't consider that a critical flaw (that'd be like refusing to consider interpretted languages because compiled code is faster, even though there's good reasons to use interpretted languages for some things).

If it's terribly slow under normal conditions, I would say it isn't really helping.

Quote:

Of course just because we can't find papers talking about the page-based approach doesn't necessarily mean that they've never been written...

True, but given the efficiency of Google and the popularity of the topic, I don't find this to be very likely.

Brendan · **Posted:** Tue May 08, 2007 3:58 pm

Hi,

Colonel Kernel wrote:

Quote:

The "commit()" has "per accessed location" overhead.

Only if the policy is "pessimistic" (i.e. -- lazy writes). For an "optimistic" policy (i.e. -- eager writes), the overhead is on abort, not commit.

I discount all optimistic policies, as they tend to allow transactions to read inconsistent state (see this Wikipedia page). While it's technically possible to detect this in exception handlers and recover (e.g. abort transactions if an exception occurs while a transaction is working on inconsistent data), it's messy and doesn't help if the code goes into an infinite loop.

Of course programmers could be very careful and (try to) avoid writing code that can't handle inconsistent state, but this introduces the possibility of bugs that result in race conditions that are very hard to find and debug. It just doesn't make any sense considering the main idea of STM is to make concurrency easier than traditional locking (rather than harder).

Colonel Kernel wrote:

Quote:

Is unlikely to allow programmers to mix languages

I disagree -- since most of the complexity is in the run-time environment, I think it would be possible for a cross-language environment like the CLR to support STM for any of the supported languages. The compilers would have to support it too, to some degree (it may be possible to implement a lot of the instrumentation at the step where native code is generated from IL).

I was thinking more of using GCC to compile some C and linking it with object files compiled by Delphi and NASM. I was also thinking of the current implementations of STM (which are mostly libraries, not run-time environments).

You are right though - it'd be technically possible for someone like Microsoft to create a suite of compatible compilers that all support the same STM implementation.

Colonel Kernel wrote:

Quote:

HTM - doesn't allow different performance trade-offs to be made by different libraries and/or different compilers.

While this is true, it may be possible to have different policies in the OS support code that goes along with the special HTM hardware... I haven't read enough about it to say for sure.

I guess it'd also be technically possible to have different policies in an OS's page-based STM support code too...

Colonel Kernel wrote:

Quote:

Is unlikely to be usable in practice without linear memory manager assistance.

Probably more than just the linear memory manager. What part of the OS maintains transaction state across interrupts, thread switches, and thread migrations?

I don't know, but whatever part of the OS does would probably just call the linear memory manager routines for shifting data from the CPUs tranactional cache to temporary pages of RAM (where the temporary pages of RAM would need to be handled by the linear memory manager during "transaction abort exceptions" and "transaction commit exceptions").

Colonel Kernel wrote:

Quote:

May never exist unless the CPU manufacturers consider HTM to be a profitable/marketable feature.

Given the massive industry-wide push into multi-core, and the realization that new techniques are required to make multi-threaded programming easier (necessary in order to sell more multi-core chips), I think Intel and AMD will be all over this as soon as the research has stabilized somewhat and reached some consensus.

I see 4 possibilities:

automatic parallelization

Colonel Kernel wrote:

Besides the issue of conflicts, which IMO are both expensive and likely with this scheme, it has another big flaw: Only a certain region of the address space can be used for transactional memory. Programs that have large, complex data structures allocated in the heap that are shared by multiple threads would need extra support from the user-level heap allocator as well (to avoid "false conflicts", it would have to try and interleave "related" allocations across many different pages -- but it doesn't know which allocations are "related"!). The compiler and linker would also have to arrange to have global data loaded in the transactional area. Suddenly this technique doesn't look very transparent to the rest of the system.

I'm used to having one heap for process space and another heap for thread space. Transactional space would need at least another heap, but there's no reason transactional space can't have several more heaps (one for each set of related data). How processes manage their heaps is up to the processes (and to be honest, I have a tendancy to use statically allocated space with large sparse arrays, "allocation on demand" and explicit "freeLinearPage()" calls, with no heap managers at all).

The compiler and linker doesn't have to arrange for global data to be loaded into the transactional area. It works the same as thread space and the ".bss" section in process space - it's all just uninitialised data (conceptually filled with zeros).

Colonel Kernel wrote:

Quote:

It would also have different performance characteristics, but I wouldn't consider that a critical flaw (that'd be like refusing to consider interpretted languages because compiled code is faster, even though there's good reasons to use interpretted languages for some things).

If it's terribly slow under normal conditions, I would say it isn't really helping.

The overhead (currently) consists of:

Note: the spinlock ensures that only one CPU can be messing with page table entries in (any thread's) transactional space at a time. It's used in the page fault handler (for any page faults caused by accesses to transactional space) and in the "commit()" code (only for transactions that modified data and weren't marked as "should abort" before the commit).

There's also overhead when transactions need to be retried. For my current algorithm, transactions that don't modify data never need to be retried, and when transactions that do modify data conflict the newest transaction is aborted.

Of course compiler-based STM has been improved by different researchers over time, and I'm still thinking about possible optimizations for page-based STM.

Colonel Kernel wrote:

Quote:

Of course just because we can't find papers talking about the page-based approach doesn't necessarily mean that they've never been written...

True, but given the efficiency of Google and the popularity of the topic, I don't find this to be very likely.

I'm not convinced that the infinite powers of Google extend to hardcopies sitting on shelves in university libraries that were never released in electronic form to the general public, or to work done by large commercial companies like Intel or IBM who may not want competitors to see their results (regardless of whether the results are actually useful or not). As for the papers that Google does list, I've probably only looked at 10% of them...

Cheers,

Brendan

ruibo · **Posted:** Tue May 29, 2007 7:56 am

you should read this paper:

Tradeoffs in Transactional Memory Virtualization

Candy · **Posted:** Tue May 29, 2007 9:58 am

ruibo wrote:

you should read this paper:

Tradeoffs in Transactional Memory Virtualization

Do you have a link?

Brynet-Inc · **Posted:** Tue May 29, 2007 10:09 am

A quick google search shows a few results Candy..

Is this what you're referring to ruibo?

http://carlstrom.com/publications/tcc_asplos2006.pdf

Colonel Kernel · **Posted:** Tue May 29, 2007 10:30 am

I've started reading it (am being interrupted by this annoyingly persistent thing called my job

). It looks like a discussion of how to use paging to virtualize TM beyond the limits imposed by hardware-based approaches. I don't think it's a scheme that can perform well enough without dedicated hardware support (I could be wrong though... I'm only a few pages in).

@ruibo: Thanks for the tip!

Brendan · **Posted:** Wed May 30, 2007 7:03 am

Hi,

Colonel Kernel wrote:

I've started reading it (am being interrupted by this annoyingly persistent thing called my job

). It looks like a discussion of how to use paging to virtualize TM beyond the limits imposed by hardware-based approaches. I don't think it's a scheme that can perform well enough without dedicated hardware support (I could be wrong though... I'm only a few pages in).

First they explain that HTM isn't really that useful by itself, and propose "XTM" which virtualizes HTM so that it can cope with task switches, working sets larger than HTM alone can handle, etc.

They also show that their XTM can work without any hardware support at all, which makes it pure paged-based STM, and they provide a comparison between XTM with hardware support and XTM with no hardware support (or a comparison between "HTM with stuff to make it practical" and "page-based STM"). See Figure 9 and section 7.6 "XTM-only Transactional Memory".

Their results show that (for their implementation) pure paged-based STM is between 3 times slower and 8.3 times slower than XTM with hardware support, and the difference depends on how much overflows the hardware (where paged-based STM is 8.3 times slower than "HTM with no need for virtualization").

If you assume that "HTM with no need for virtualization" is as fast as we could reasonably expect, then IMHO 8.3 times slower is quite acceptable considering that HTM can't exist until CPU manufacturers implement it while page-based STM can exist now.

@ruibo: Thanks - it's a nice find!

Cheers,

Brendan

OSDev.org

OS Support For Transactional Memory

Who is online