|
#1
|
|
|
|
|
Hi,
this question is a bit of an odd ball I'm afraid. What I need to know to be cache conscious when implementing a hash table is how an OS Linux/UNIX reads data from RAM into a cache line. Let's assume I have: * a CPU with a cache line size of 64 bytes * the page size of my OS set to 4 kb * my OS is a 32-bit system * my own memory allocator serving me one full page at a time How do I need to layout my data elements on a page to ensure they are placed cache consciously? Eg: 1) A page is allocated. It is empty apart from a page header of 40 bytes. 2) I want to insert a Data element1 of 50 bytes into the page. Questions: 1) Do I need to align the page placement with my cache line size? Meaning I have to write my element with an offset of 64 bytes from the page start, leaving 14 bytes un-used? Page ------------------- I Page header 40bytes I 16 bytes un-used I Data element1 50 bytes That would mean wasting a lot of space, especially since Data element2 would also have to take cache line size into account leaving another 14 un-used. Please, say it isn't so... 2) Say I have a function blork with the following variables declared in C: void blork(void) { struct foo { int var1; int var2; }; int var3; int var4 ..... Will the cache line look like this: var1, var2, var3, var4.... or var3, var4.... Thanks in advance /Sune |
|
|
|
#2
|
|
|
|
|
Sune wrote:
> Hi, > > this question is a bit of an odd ball I'm afraid. What I need to know > to be cache conscious when implementing a hash table is how an OS > Linux/UNIX reads data from RAM into a cache line. The operating system does not manage the cache directly. Cache replacement happens far too frequently for OS involvement. It is built built into the cache hardware. Moreover, many processors have more than one cache. A processor may have, for example, a large off-processor cache implemented as SRAM, a smaller but faster on-chip cache, and a very small, very fast cache physically close to the part of the processor that actually executes instructions. At any given level, data and code may be kept in the same cache or in different caches. In multiprocessor, multiple core, and multiple thread situations there are choices about which caches are shared and how the sharing works. The various caches may have different line sizes. Do you understand "set associativity"? Do you have a higher level objective than "being cache conscious"? There are general rules that help e.g. if your objective is to get good performance for various types of data. Are you dealing with data that can be modified and is actively shared between processors? That creates different issues from data that is either infrequently shared or infrequently modified. If you do need the exact rules, you need to know the processor make and model, and read the specifications for that processor. Patricia |
|
#3
|
|
|
|
|
On 23 Aug, 15:50, Patricia Shanahan <p> wrote:
[..] > are general rules that help e.g. if your objective is to get good > performance for various types of data. Are you dealing with data that > can be modified and is actively shared between processors? That creates > different issues from data that is either infrequently shared or > infrequently modified. > > If you do need the exact rules, you need to know the processor make and > model, and read the specifications for that processor. > > Patricia Hi Patricia, thanks for trying to set me straight. Ok, so my concern is performance, and a way to improve performance is to place data in memory according to its use. The issue get's more complicated with SMP so let's skip this for a while and discuss a single processor and L1 data cache and data placement in memory only. I think this is called spatial locality. Given this and the knowledge of my using AMD and Intel processors, can you help me with the 2 questions above? Thanks /Sune |
|
#4
|
|
|
|
|
Sune wrote:
> On 23 Aug, 15:50, Patricia Shanahan <p> wrote: > > Hi Patricia, > > thanks for trying to set me straight. Ok, so my concern is > performance, and a way to improve performance is to place data in > memory according to its use. The issue get's more complicated with SMP > so let's skip this for a while and discuss a single processor and L1 > data cache and data placement in memory only. I think this is called > spatial locality. > > Given this and the knowledge of my using AMD and Intel processors, can > you help me with the 2 questions above? .... Here are some quick notes. I may post some more advice later. These ideas depend only on the general outline of how caches are managed, so following them will work on many processors. The only code I've consciously tuned for a specific processor's cache design is some benchmark-critical code such as a matrix multiply that was going to affect Linpack benchmark performance. The main thing you can control is spacial locality. Referencing a single byte brings in a whole cache line. You want the rest of the cache line to contain data that will be used in the near future. Do not add any padding - padding is certain to be unused. (The SMP issue is important, because in some multiprocessor situations it is important to avoid putting certain variables in the same cache line. If so, padding can be useful.) Ideally, keep access-related data together. Unfortunately, programmer estimates of usage patterns tend to be reliable only for large data structures that are accessed by a relatively simple algorithm. I've had some surprises using a logic analyzer connected to the processor-memory bus to see what is really happening. Keep your data small. Caches work best when the program can run for significant periods without going outside a cache. For memory intensive access to large data structures, try to work in memory address order as much as possible. That both increases spacial re-use, and many processors have sequential access prefetch capability that will bring in the next line before there is a cache miss for it. Cache collisions involve cache line addresses that differ by a large power of two. If you need to do something like matrix transposition that is likely to involve some big stride memory access, structure the matrix to avoid large strides that are powers of two or close to powers of two. Once you get some data into cache, do as much as possible with it. Loop tiling can help if you have nested loops: [url down] Patricia |
|
#5
|
|
|
|
|
On 23 Aug, 16:33, Patricia Shanahan <p> wrote:
[..] > > Cache collisions involve cache line addresses that differ by a large > power of two. If you need to do something like matrix transposition that > is likely to involve some big stride memory access, structure the matrix > to avoid large strides that are powers of two or close to powers of two. > > Once you get some data into cache, do as much as possible with it. Loop > tiling can help if you have nested loops:[..] > > Patricia Ok, thanks. I get what you are saying. So now the only thing left is my questions ;-) Maybe I don't know enough to be able to ask the right things... I allocate a page (page aligned) which starts on address 0xA000. Immediately there is a header of 40 bytes. How should I place my data ? For example if I assume I need to store my data in cache line aligned this is what I need to do for question1 above: H is header ? is padding to make the data layout cache line aligned 0xA000: HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH?????????? ?????????????? 0xA040: DDDDDDDDDDDDDataElement111111111111111111111111111 1111111??????????????? 0xA080: DDDDDDDDDDDDDataElement222222222222222222222222222 2222??????????????? : If I have a pointer referencing address 0xA050 (the E in DataElement1) the following is read into the cache line: DDDDDDDDDDDDDataElement111111111111111111111111111 1111111??????????????? Am I correct? Thanks again /Sune |
|
#6
|
|
|
|
|
Sune wrote:
.... > So now the only thing left is my questions ;-) Maybe I don't know > enough to be able to ask the right things... > > I allocate a page (page aligned) which starts on address 0xA000. > Immediately there is a header of 40 bytes. How should I place my > data ? For example if I assume I need to store my data in cache line > aligned this is what I need to do for question1 above: *Why* do you assume you need to store your data cache line aligned? Patricia |
|
#7
|
|
|
|
|
Responding to Sune...
> this question is a bit of an odd ball I'm afraid. What I need to know > to be cache conscious when implementing a hash table is how an OS > Linux/UNIX reads data from RAM into a cache line. Shanahan's advice is very good. However, I would like to push back on why you want to optimize at this level at all... > Say I have a function blork with the following variables declared in > C: > > void blork(void) > { > struct foo { > int var1; > int var2; > }; > > int var3; > int var4 > .... > > Will the cache line look like this: > var1, var2, var3, var4.... > > or > var3, var4.... It is very difficult to optimize CPU caching at the 3GL level. You really need to do it at the Assembly level. That's because the hardware caching algorithm is optimized around the ALU pipeline and instruction set so you really need to understand what instructions are used and what their order is to do it properly. But a good 3GL optimizing compiler will generate object code that is not at all intuitive from the source listing and can be very difficult to understand. [Some compilers that have an option to print line-by-line Assembly with the source do not even optimize when they do that; the Assembly displayed is what you would get in an unoptimized debug mode compilation. And that would not do you much good.] Essentially you need to look at the compiler's Assembly output and design macros that will force access the of data in a preferred instruction order given the CPU caching algorithm (in addition to modifying the data structures to play well with the macros!). That would be <relatively> easy in a language like BLISS that has a good macro facility but it will be both very difficult and tedious in C. In addition, such optimization will be very dependent on what you are doing with the data. So if you access the same data in different execution contexts (i.e., needing different instructions) the optimization might be fine for one context but worse than the compiler's way of doing things for another. IOW, you may need multiple sets of macros for accessing the same data in different contexts. Finally, a good 3GL optimizing compiler chooses instructions and puts them in a particular order for good reasons. So optimizing the instruction order just to optimize the CPU cache could negate the compiler's attempts to optimize other things for a net loss in performance -- though that should be rare unless the macros force different instructions to be used. (I suggest using macros above because you probably only want to override the compiler optimization in particular situations.) Corollary: Since such low level optimization is highly dependent on the hardware platform, your optimization can become obsolete when the hardware is upgraded. Thus your custom optimization may hard-wire in a performance hit, especially if the compiler is also upgraded. Bottom line: the days of ordering FORTRAN COMMONs by variable size and using two additions rather than one multiplication by 3 are long gone. It is usually not worth the trouble to do optimization at the CPU cache level unless you have a highly specialized application. Generally speaking, managing the overall memory access via things like nested loop order is much more effective because the scale is larger, especially when the application is large enough to have VM page faulting problems. Usually you can do that by simply looking at your critical 3GL procedures and viewing each data element access as loading a register from memory. If you view those register loads as defining clumps of data that should be close to adjacent in memory and/or all accessed in a closed scope, you can then optimize your data structures, iterations, and statements so that happens. (The optimizations one routinely does to minimize VM page faulting are useful and usually will be sufficient, even if VM paging itself isn't an issue.) Then the hardware caching algorithm will almost always do the right thing. FWIW I would not do such optimization unless I had hard data to indicate it was really needed and I would look closely at the data to make sure it wasn't the data collection itself that was skewing the results. Unless you are doing embedded systems with a brain dead processor, have hard real-time constraints, or have to use a cretinous C compiler, I think it would be highly unlikely that worrying about the CPU cache itself would be worthwhile. |
|
#8
|
|
|
|
|
On 23 Aug, 19:19, Patricia Shanahan <p> wrote:
> Sune wrote: > > ... >> > *Why* do you assume you need to store your data cache line aligned? > > Patricia For example, this: 'INSN_CACHE_LINE_WIDTH The length in bytes of each cache line. The cache is divided into cache lines which are disjoint slots, each holding a contiguous chunk of data fetched from memory. Each time data is brought into the cache, an entire line is read at once. The data loaded into a cache line is always aligned on a boundary equal to the line size. ' is stated about gcc here: http://www.delorie.com/gnu/docs/gcc/gccint_136.html Similar statements can be found here: [url down] If I'm misinterpreting this I'd obviously need some support. Thanks again /Sune |
|
#9
|
|
|
|
|
On 23 Aug, 19:22, "H. S. Lahman" <hlah> wrote:
[..] > think it would be highly unlikely that worrying about the CPU cache > itself would be worthwhile. > > -- > Life is the only flaw in an otherwise perfect nonexistence > -- Schopenhauer > > H. S. Lahman > H.lah...@verizon.net > software blog:[..] Ok, point taken. But if we loose the alignment on the stack and just go on to this: I allocate a page (page aligned) which starts on address 0xA000. Immediately there is a header of 40 bytes. How should I place my data ? For example if I assume I need to store my data in cache line aligned this is what I need to do for question1 above: H is header ? is padding to make the data layout cache line aligned 0xA000: HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH?????????? ?????????????? 0xA040: DDDDDDDDDDDDDataElement111111111111111111111111111 1111111??????????????? 0xA080: DDDDDDDDDDDDDataElement222222222222222222222222222 2222??????????????? : If I have a pointer referencing address 0xA050 (the E in DataElement1) the following is read into the cache line: DDDDDDDDDDDDDataElement111111111111111111111111111 1111111??????????????? The reason for doing this would be to assure that I don't use 2 cache lines for data that easily fits into 1. Comments? BRs /Sune |
|
#10
|
|
|
|
|
Sune wrote:
.... > The reason for doing this would be to assure that I don't use 2 cache > lines for data that easily fits into 1. > > Comments? You seem to be very focussed on a single memory reference. That is usually a mistake when thinking about caches. If you are not getting *many* uses out of a typical cache miss you are in deep performance trouble. If your data structure and algorithms are such that you are doing a lot of scattered references with no possibility of cache reuse, you need to reexamine your choices at a far higher level. Patricia |
|
#11
|
|
|
|
|
On Sun, 23 Aug 2009 12:00:26 -0700, Patricia Shanahan wrote:
> Sune wrote: > ... > > You seem to be very focused on a single memory reference. That is > usually a mistake when thinking about caches. If you are not getting > *many* uses out of a typical cache miss you are in deep performance > trouble. > > If your data structure and algorithms are such that you are doing a lot > of scattered references with no possibility of cache reuse, you need to > reexamine your choices at a far higher level. I agree with Patricia's advice: 0) "travel light": don't make your data structure larger than needed 1) once a cache slot is "faulted in" , make sure most of it's contents are valuable to the program at that time: avoid pulling in a cache line because you need only one byte. Since your application appears to be a hashtable: in most cases linear probing has a better locality then chaining. (chaining is "cleaner" if your app requires a lot of deletes, or if the table's size is very variable). So the choise of your hashing/overflow scheme will depend on the usage pattern of your hash-table. YMMV. 2) if your data happens to be on a diskpage: forget about L1/L2 cache; the performance will be dominated by disk I/O anyway. HTH, AvK |
|
#12
|
|
|
|
|
On 23 Aug, 19:07, Sune <sune_ahlg> wrote:
.... [..] > : > > If I have a pointer referencing address 0xA050 (the E in DataElement1) > the following is read into the cache line: > DDDDDDDDDDDDDataElement111111111111111111111111111 1111111??????????????? > > The reason for doing this would be to assure that I don't use 2 cache > lines for data that easily fits into 1. > > Comments? If I understand correctly you are concerned about your data splitting across cache lines. Your quote earlier also suggests you think this will be a problem. In your first post you say that you want to implement a hash table and, to an extent, you are right to have a small concern over cache performance - but only a small one. First, you should be aware that if it is OK if a 50-byte element is split over two cache lines. Once both parts are in cache the CPU will not slow down when you change from accessing the part in the first cache line to accessing the part in the second cache line. You can treat both cache lines as the same. They will both be equally fast. Second, caches work best a) when you repeatedly access the same areas of memory or b) when you step through memory sequentially. In the latter case they effectively read ahead for bytes you will want in the future. Since you are implementing a hash table you will have little sequential access but the cache will retain data already read and you might also get some benefit when your hash function indexes entries adjacent to ones you aready have cached. For example, given a 64-byte line size, if you have cached the last 10 bytes of one entry you will also have cached the first 54 bytes of the next. Third, modern machines have multiple levels of cache. Your outer levels may have larger cache lines. Fetching from these will still be quicker than fetching from memory. Fourth, if you don't need to access all 50 bytes at once you can split your 50-byte records into smaller parts and store then in two hash tables. HOWEVER, this is rarely a good idea. It can help for much larger records but I doubt it will in this case. Fifth, and this is the most important point for performance, if your 50-byte records contain any 16-bit, 32-bit or 64-bit quantities, make sure you keep those values aligned to their natural boundaries. So, for example, 32-bit values that you load and store from 32-bit registers would be aligned to 4-byte boundaries. So, re. your question on alignment you'd normally be best to place your records adjacent to each other but ensure that any values within the records are properly aligned. Don't use any padding other than that required for field alignment. To reiterate: it's ok to split records across cache lines even if they will fit into one line. Just make sure your field alingments are right. James |
|
#13
|
|
|
|
|
Moi wrote:
> On Sun, 23 Aug 2009 12:00:26 -0700, Patricia Shanahan wrote: >> I agree with Patricia's advice: > 0) "travel light": don't make your data structure larger than needed > 1) once a cache slot is "faulted in" , make sure most of it's contents > are valuable to the program at that time: avoid pulling in a cache > line because you need only one byte. > > Since your application appears to be a hashtable: in most cases linear > probing has a better locality then chaining. I agree with this, except for the idea of hashtable as an application. A hashtable is a data structure, and associated algorithms, implementing a mapping function. If the program is in serious performance trouble due to cache misses in map lookup, I would question the choice of data structure. Patricia |
|
#14
|
|
|
|
|
>this question is a bit of an odd ball I'm afraid. What I need to know
>to be cache conscious when implementing a hash table is how an OS >Linux/UNIX reads data from RAM into a cache line. The OS doesn't. The CPU does. It's handled at a level below the OS. If there were cache-loading code, it's something you'd definitely want in cache. >Let's assume I have: >* a CPU with a cache line size of 64 bytes >* the page size of my OS set to 4 kb >* my OS is a 32-bit system The bit-ness of a CPU or OS is meaningful description of anything. For example, 32-bit Windows and 32-bit Linux are *MUCH* more different than 32-bit Linux and 64-bit Linux. >* my own memory allocator serving me one full page at a time >How do I need to layout my data elements on a page to ensure they are >placed cache consciously? Place variables that are USED together PHYSICALLY together (e.g. in the same struct), where possible. >1) >A page is allocated. It is empty apart from a page header of 40 >bytes. > >2) >I want to insert a Data element1 of 50 bytes into the page. > >Questions: > >1) >Do I need to align the page placement with my cache line size? Meaning >I have to write my element with an offset of 64 bytes from the page >start, leaving 14 bytes un-used? First, performance is not required, so cache optimization is not required. One of the basic principles of cache *pessimization* is to spread everything out so that no two variables are in the same cache line, to maximize the number of cache misses. A cache miss vs. a cache hit is a much bigger performance difference than a variable at the end of the cache line vs. a variable at the beginning of a cache line. Put variables that are used together physically together, so hopefully they are in the same cache line. Scan multi-dimensional arrays in physical order if possible. >Say I have a function blork with the following variables declared in >C: > >void blork(void) >{ >struct foo { >int var1; >int var2; >}; > >int var3; >int var4 >.... > >Will the cache line look like this: >var1, var2, var3, var4.... > >or >var3, var4.... There is no guarantee that the order of declaration of auto variables has any effect at all on their ordering in memory. The ordering might be based on a SHA-256 hash of the variable *name*. 4 ints will usually fit in the same cache line. |
|
#15
|
|
|
|
|
Gordon Burditt wrote:
..... > > There is no guarantee that the order of declaration of auto variables > has any effect at all on their ordering in memory. The ordering > might be based on a SHA-256 hash of the variable *name*. 4 ints > will usually fit in the same cache line. > In any case, in most code in C or similar, the active area of the stack is rarely not in cache. It is used too often, for call/return, for auto variables, and for compiler generated temporaries. Patricia |
|
|
|
|
| Similar Threads | |
| Being cache conscious - How is data read from memory into cache line? Hi, this question is a bit of an odd ball I'm afraid. What I need to know to be cache conscious when implementing a hash table is how an OS Linux/UNIX reads data from RAM... |
|
| Optimizing algorithms for L1 Data Cache (and maybe even L1 Instruction Cache) Hello, Knowning how cpu's work and how they get instructions and data from main memory and which tricks are used is relevant and handy for optimizing algorithms. So I... |
|
| Command-line tool to release cache-memory? Hi! I´d like to know if there is any command-line utility to free up the used cache memory (the one in RAM memory)... I´ve done some research and I´ve seen thousands of... |
|
| 2x2MB L3 Cache, Third Level Cache Memory Hi all, in the registry setting HKLM\CurrentConrtolSet\Control\Session Manager\Memory Management you can adjust the second level cache memory (L2 cache) with the value... |
|
| SHOW MEMORY /CACHE /FULL = 164% Read Hit Rate ??? ES-40 (4x 667-MHz) 3-GB memory VMS V-7.2-1 I increased the VCC_MAXSIZE to 1024000 (512-MB). I do the following: |
|
|
All times are GMT. The time now is 09:39 PM. | Privacy Policy
|