keyongtech


  keyongtech > linux.development.apps > 08/2009

 #1  
08-23-09, 12:38 PM
Sune
Hi,

this question is a bit of an odd ball I'm afraid. What I need to know
to be cache conscious when implementing a hash table is how an OS
Linux/UNIX reads data from RAM into a cache line.

Let's assume I have:
* a CPU with a cache line size of 64 bytes
* the page size of my OS set to 4 kb
* my OS is a 32-bit system
* my own memory allocator serving me one full page at a time

How do I need to layout my data elements on a page to ensure they are
placed cache consciously?

Eg:

1)
A page is allocated. It is empty apart from a page header of 40
bytes.

2)
I want to insert a Data element1 of 50 bytes into the page.

Questions:

1)
Do I need to align the page placement with my cache line size? Meaning
I have to write my element with an offset of 64 bytes from the page
start, leaving 14 bytes un-used?
Page -------------------
I Page header 40bytes I 16 bytes un-used I Data element1 50 bytes

That would mean wasting a lot of space, especially since Data element2
would also have to take cache line size into account leaving another
14 un-used.

Please, say it isn't so...


2)
Say I have a function blork with the following variables declared in
C:

void blork(void)
{
struct foo {
int var1;
int var2;
};

int var3;
int var4
.....

Will the cache line look like this:
var1, var2, var3, var4....

or
var3, var4....

Thanks in advance
/Sune
 #2  
08-23-09, 02:50 PM
Patricia Shanahan
Sune wrote:
> Hi,
>
> this question is a bit of an odd ball I'm afraid. What I need to know
> to be cache conscious when implementing a hash table is how an OS
> Linux/UNIX reads data from RAM into a cache line.


The operating system does not manage the cache directly. Cache
replacement happens far too frequently for OS involvement. It is built
built into the cache hardware.

Moreover, many processors have more than one cache. A processor may
have, for example, a large off-processor cache implemented as SRAM, a
smaller but faster on-chip cache, and a very small, very fast cache
physically close to the part of the processor that actually executes
instructions. At any given level, data and code may be kept in the same
cache or in different caches. In multiprocessor, multiple core, and
multiple thread situations there are choices about which caches are
shared and how the sharing works. The various caches may have different
line sizes.

Do you understand "set associativity"?

Do you have a higher level objective than "being cache conscious"? There
are general rules that help e.g. if your objective is to get good
performance for various types of data. Are you dealing with data that
can be modified and is actively shared between processors? That creates
different issues from data that is either infrequently shared or
infrequently modified.

If you do need the exact rules, you need to know the processor make and
model, and read the specifications for that processor.

Patricia
 #3  
08-23-09, 03:06 PM
Sune
On 23 Aug, 15:50, Patricia Shanahan <p> wrote:
[..]
> are general rules that help e.g. if your objective is to get good
> performance for various types of data. Are you dealing with data that
> can be modified and is actively shared between processors? That creates
> different issues from data that is either infrequently shared or
> infrequently modified.
>
> If you do need the exact rules, you need to know the processor make and
> model, and read the specifications for that processor.
>
> Patricia


Hi Patricia,

thanks for trying to set me straight. Ok, so my concern is
performance, and a way to improve performance is to place data in
memory according to its use. The issue get's more complicated with SMP
so let's skip this for a while and discuss a single processor and L1
data cache and data placement in memory only. I think this is called
spatial locality.

Given this and the knowledge of my using AMD and Intel processors, can
you help me with the 2 questions above?

Thanks
/Sune
 #4  
08-23-09, 03:33 PM
Patricia Shanahan
Sune wrote:
> On 23 Aug, 15:50, Patricia Shanahan <p> wrote:
>
> Hi Patricia,
>
> thanks for trying to set me straight. Ok, so my concern is
> performance, and a way to improve performance is to place data in
> memory according to its use. The issue get's more complicated with SMP
> so let's skip this for a while and discuss a single processor and L1
> data cache and data placement in memory only. I think this is called
> spatial locality.
>
> Given this and the knowledge of my using AMD and Intel processors, can
> you help me with the 2 questions above?

....

Here are some quick notes. I may post some more advice later. These
ideas depend only on the general outline of how caches are managed, so
following them will work on many processors. The only code I've
consciously tuned for a specific processor's cache design is some
benchmark-critical code such as a matrix multiply that was going to
affect Linpack benchmark performance.

The main thing you can control is spacial locality. Referencing a single
byte brings in a whole cache line. You want the rest of the cache line
to contain data that will be used in the near future. Do not add any
padding - padding is certain to be unused. (The SMP issue is important,
because in some multiprocessor situations it is important to avoid
putting certain variables in the same cache line. If so, padding can be
useful.)

Ideally, keep access-related data together. Unfortunately, programmer
estimates of usage patterns tend to be reliable only for large data
structures that are accessed by a relatively simple algorithm. I've had
some surprises using a logic analyzer connected to the processor-memory
bus to see what is really happening.

Keep your data small. Caches work best when the program can run for
significant periods without going outside a cache.

For memory intensive access to large data structures, try to work in
memory address order as much as possible. That both increases spacial
re-use, and many processors have sequential access prefetch capability
that will bring in the next line before there is a cache miss for it.

Cache collisions involve cache line addresses that differ by a large
power of two. If you need to do something like matrix transposition that
is likely to involve some big stride memory access, structure the matrix
to avoid large strides that are powers of two or close to powers of two.

Once you get some data into cache, do as much as possible with it. Loop
tiling can help if you have nested loops:
[url down]

Patricia
 #5  
08-23-09, 05:27 PM
Sune
On 23 Aug, 16:33, Patricia Shanahan <p> wrote:
[..]
>
> Cache collisions involve cache line addresses that differ by a large
> power of two. If you need to do something like matrix transposition that
> is likely to involve some big stride memory access, structure the matrix
> to avoid large strides that are powers of two or close to powers of two.
>
> Once you get some data into cache, do as much as possible with it. Loop
> tiling can help if you have nested loops:[..]
>
> Patricia


Ok, thanks. I get what you are saying.

So now the only thing left is my questions ;-) Maybe I don't know
enough to be able to ask the right things...

I allocate a page (page aligned) which starts on address 0xA000.
Immediately there is a header of 40 bytes. How should I place my
data ? For example if I assume I need to store my data in cache line
aligned this is what I need to do for question1 above:

H is header
? is padding to make the data layout cache line aligned

0xA000:
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH?????????? ??????????????
0xA040:
DDDDDDDDDDDDDataElement111111111111111111111111111 1111111???????????????
0xA080:
DDDDDDDDDDDDDataElement222222222222222222222222222 2222???????????????
:

If I have a pointer referencing address 0xA050 (the E in DataElement1)
the following is read into the cache line:
DDDDDDDDDDDDDataElement111111111111111111111111111 1111111???????????????

Am I correct?

Thanks again
/Sune
 #6  
08-23-09, 06:19 PM
Patricia Shanahan
Sune wrote:
....
> So now the only thing left is my questions ;-) Maybe I don't know
> enough to be able to ask the right things...
>
> I allocate a page (page aligned) which starts on address 0xA000.
> Immediately there is a header of 40 bytes. How should I place my
> data ? For example if I assume I need to store my data in cache line
> aligned this is what I need to do for question1 above:


*Why* do you assume you need to store your data cache line aligned?

Patricia
 #7  
08-23-09, 06:22 PM
H. S. Lahman
Responding to Sune...

> this question is a bit of an odd ball I'm afraid. What I need to know
> to be cache conscious when implementing a hash table is how an OS
> Linux/UNIX reads data from RAM into a cache line.


Shanahan's advice is very good. However, I would like to push back on
why you want to optimize at this level at all...

> Say I have a function blork with the following variables declared in
> C:
>
> void blork(void)
> {
> struct foo {
> int var1;
> int var2;
> };
>
> int var3;
> int var4
> ....
>
> Will the cache line look like this:
> var1, var2, var3, var4....
>
> or
> var3, var4....


It is very difficult to optimize CPU caching at the 3GL level. You
really need to do it at the Assembly level. That's because the hardware
caching algorithm is optimized around the ALU pipeline and instruction
set so you really need to understand what instructions are used and what
their order is to do it properly. But a good 3GL optimizing compiler
will generate object code that is not at all intuitive from the source
listing and can be very difficult to understand. [Some compilers that
have an option to print line-by-line Assembly with the source do not
even optimize when they do that; the Assembly displayed is what you
would get in an unoptimized debug mode compilation. And that would not
do you much good.]

Essentially you need to look at the compiler's Assembly output and
design macros that will force access the of data in a preferred
instruction order given the CPU caching algorithm (in addition to
modifying the data structures to play well with the macros!). That would
be <relatively> easy in a language like BLISS that has a good macro
facility but it will be both very difficult and tedious in C.

In addition, such optimization will be very dependent on what you are
doing with the data. So if you access the same data in different
execution contexts (i.e., needing different instructions) the
optimization might be fine for one context but worse than the compiler's
way of doing things for another. IOW, you may need multiple sets of
macros for accessing the same data in different contexts.

Finally, a good 3GL optimizing compiler chooses instructions and puts
them in a particular order for good reasons. So optimizing the
instruction order just to optimize the CPU cache could negate the
compiler's attempts to optimize other things for a net loss in
performance -- though that should be rare unless the macros force
different instructions to be used. (I suggest using macros above because
you probably only want to override the compiler optimization in
particular situations.)

Corollary: Since such low level optimization is highly dependent on the
hardware platform, your optimization can become obsolete when the
hardware is upgraded. Thus your custom optimization may hard-wire in a
performance hit, especially if the compiler is also upgraded.

Bottom line: the days of ordering FORTRAN COMMONs by variable size and
using two additions rather than one multiplication by 3 are long gone.
It is usually not worth the trouble to do optimization at the CPU cache
level unless you have a highly specialized application. Generally
speaking, managing the overall memory access via things like nested loop
order is much more effective because the scale is larger, especially
when the application is large enough to have VM page faulting problems.

Usually you can do that by simply looking at your critical 3GL
procedures and viewing each data element access as loading a register
from memory. If you view those register loads as defining clumps of data
that should be close to adjacent in memory and/or all accessed in a
closed scope, you can then optimize your data structures, iterations,
and statements so that happens. (The optimizations one routinely does to
minimize VM page faulting are useful and usually will be sufficient,
even if VM paging itself isn't an issue.) Then the hardware caching
algorithm will almost always do the right thing.

FWIW I would not do such optimization unless I had hard data to indicate
it was really needed and I would look closely at the data to make sure
it wasn't the data collection itself that was skewing the results.
Unless you are doing embedded systems with a brain dead processor, have
hard real-time constraints, or have to use a cretinous C compiler, I
think it would be highly unlikely that worrying about the CPU cache
itself would be worthwhile.
 #8  
08-23-09, 07:00 PM
Sune
On 23 Aug, 19:19, Patricia Shanahan <p> wrote:
> Sune wrote:
>
> ...
>>

> *Why* do you assume you need to store your data cache line aligned?
>
> Patricia


For example, this:
'INSN_CACHE_LINE_WIDTH
The length in bytes of each cache line. The cache is divided into
cache lines which are disjoint slots, each holding a contiguous chunk
of data fetched from memory. Each time data is brought into the cache,
an entire line is read at once. The data loaded into a cache line is
always aligned on a boundary equal to the line size. '

is stated about gcc here:
http://www.delorie.com/gnu/docs/gcc/gccint_136.html

Similar statements can be found here:
[url down]

If I'm misinterpreting this I'd obviously need some support.

Thanks again
/Sune
 #9  
08-23-09, 07:07 PM
Sune
On 23 Aug, 19:22, "H. S. Lahman" <hlah> wrote:
[..]
> think it would be highly unlikely that worrying about the CPU cache
> itself would be worthwhile.
>
> --
> Life is the only flaw in an otherwise perfect nonexistence
>     -- Schopenhauer
>
> H. S. Lahman
> H.lah...@verizon.net
> software blog:[..]


Ok, point taken. But if we loose the alignment on the stack and just
go on to this:

I allocate a page (page aligned) which starts on address 0xA000.
Immediately there is a header of 40 bytes. How should I place my
data ? For example if I assume I need to store my data in cache line
aligned this is what I need to do for question1 above:

H is header
? is padding to make the data layout cache line aligned

0xA000:
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH?????????? ??????????????
0xA040:
DDDDDDDDDDDDDataElement111111111111111111111111111 1111111???????????????
0xA080:
DDDDDDDDDDDDDataElement222222222222222222222222222 2222???????????????
:

If I have a pointer referencing address 0xA050 (the E in DataElement1)
the following is read into the cache line:
DDDDDDDDDDDDDataElement111111111111111111111111111 1111111???????????????

The reason for doing this would be to assure that I don't use 2 cache
lines for data that easily fits into 1.

Comments?

BRs
/Sune
 #10  
08-23-09, 08:00 PM
Patricia Shanahan
Sune wrote:
....
> The reason for doing this would be to assure that I don't use 2 cache
> lines for data that easily fits into 1.
>
> Comments?


You seem to be very focussed on a single memory reference. That is
usually a mistake when thinking about caches. If you are not getting
*many* uses out of a typical cache miss you are in deep performance
trouble.

If your data structure and algorithms are such that you are doing a lot
of scattered references with no possibility of cache reuse, you need to
reexamine your choices at a far higher level.

Patricia
 #11  
08-23-09, 09:13 PM
Moi
On Sun, 23 Aug 2009 12:00:26 -0700, Patricia Shanahan wrote:

> Sune wrote:
> ...
>
> You seem to be very focused on a single memory reference. That is
> usually a mistake when thinking about caches. If you are not getting
> *many* uses out of a typical cache miss you are in deep performance
> trouble.
>
> If your data structure and algorithms are such that you are doing a lot
> of scattered references with no possibility of cache reuse, you need to
> reexamine your choices at a far higher level.


I agree with Patricia's advice:
0) "travel light": don't make your data structure larger than needed
1) once a cache slot is "faulted in" , make sure most of it's contents
are valuable to the program at that time: avoid pulling in a cache
line because you need only one byte.

Since your application appears to be a hashtable: in most cases linear
probing has a better locality then chaining.
(chaining is "cleaner" if your app requires a lot of deletes, or if the
table's size is very variable).
So the choise of your hashing/overflow scheme will depend on the usage
pattern of your hash-table. YMMV.

2) if your data happens to be on a diskpage: forget about L1/L2 cache;
the performance will be dominated by disk I/O anyway.

HTH,
AvK
 #12  
08-23-09, 09:14 PM
James Harris
On 23 Aug, 19:07, Sune <sune_ahlg> wrote:

....

[..]
> :
>
> If I have a pointer referencing address 0xA050 (the E in DataElement1)
> the following is read into the cache line:
> DDDDDDDDDDDDDataElement111111111111111111111111111 1111111???????????????
>
> The reason for doing this would be to assure that I don't use 2 cache
> lines for data that easily fits into 1.
>
> Comments?


If I understand correctly you are concerned about your data splitting
across cache lines. Your quote earlier also suggests you think this
will be a problem. In your first post you say that you want to
implement a hash table and, to an extent, you are right to have a
small concern over cache performance - but only a small one.

First, you should be aware that if it is OK if a 50-byte element is
split over two cache lines. Once both parts are in cache the CPU will
not slow down when you change from accessing the part in the first
cache line to accessing the part in the second cache line. You can
treat both cache lines as the same. They will both be equally fast.

Second, caches work best a) when you repeatedly access the same areas
of memory or b) when you step through memory sequentially. In the
latter case they effectively read ahead for bytes you will want in the
future.

Since you are implementing a hash table you will have little
sequential access but the cache will retain data already read and you
might also get some benefit when your hash function indexes entries
adjacent to ones you aready have cached. For example, given a 64-byte
line size, if you have cached the last 10 bytes of one entry you will
also have cached the first 54 bytes of the next.

Third, modern machines have multiple levels of cache. Your outer
levels may have larger cache lines. Fetching from these will still be
quicker than fetching from memory.

Fourth, if you don't need to access all 50 bytes at once you can split
your 50-byte records into smaller parts and store then in two hash
tables. HOWEVER, this is rarely a good idea. It can help for much
larger records but I doubt it will in this case.

Fifth, and this is the most important point for performance, if your
50-byte records contain any 16-bit, 32-bit or 64-bit quantities, make
sure you keep those values aligned to their natural boundaries. So,
for example, 32-bit values that you load and store from 32-bit
registers would be aligned to 4-byte boundaries.

So, re. your question on alignment you'd normally be best to place
your records adjacent to each other but ensure that any values within
the records are properly aligned. Don't use any padding other than
that required for field alignment.

To reiterate: it's ok to split records across cache lines even if they
will fit into one line. Just make sure your field alingments are
right.

James
 #13  
08-23-09, 09:31 PM
Patricia Shanahan
Moi wrote:
> On Sun, 23 Aug 2009 12:00:26 -0700, Patricia Shanahan wrote:
>> I agree with Patricia's advice:

> 0) "travel light": don't make your data structure larger than needed
> 1) once a cache slot is "faulted in" , make sure most of it's contents
> are valuable to the program at that time: avoid pulling in a cache
> line because you need only one byte.
>
> Since your application appears to be a hashtable: in most cases linear
> probing has a better locality then chaining.


I agree with this, except for the idea of hashtable as an application. A
hashtable is a data structure, and associated algorithms, implementing a
mapping function. If the program is in serious performance trouble
due to cache misses in map lookup, I would question the choice of data
structure.

Patricia
 #14  
08-23-09, 09:38 PM
Gordon Burditt
>this question is a bit of an odd ball I'm afraid. What I need to know
>to be cache conscious when implementing a hash table is how an OS
>Linux/UNIX reads data from RAM into a cache line.


The OS doesn't. The CPU does. It's handled at a level below the OS.
If there were cache-loading code, it's something you'd definitely want
in cache.

>Let's assume I have:
>* a CPU with a cache line size of 64 bytes
>* the page size of my OS set to 4 kb
>* my OS is a 32-bit system

The bit-ness of a CPU or OS is meaningful description of anything.
For example, 32-bit Windows and 32-bit Linux are *MUCH* more
different than 32-bit Linux and 64-bit Linux.

>* my own memory allocator serving me one full page at a time


>How do I need to layout my data elements on a page to ensure they are
>placed cache consciously?


Place variables that are USED together PHYSICALLY together (e.g.
in the same struct), where possible.

>1)
>A page is allocated. It is empty apart from a page header of 40
>bytes.
>
>2)
>I want to insert a Data element1 of 50 bytes into the page.
>
>Questions:
>
>1)
>Do I need to align the page placement with my cache line size? Meaning
>I have to write my element with an offset of 64 bytes from the page
>start, leaving 14 bytes un-used?


First, performance is not required, so cache optimization is not
required.

One of the basic principles of cache *pessimization* is to spread
everything out so that no two variables are in the same cache line,
to maximize the number of cache misses. A cache miss vs. a cache
hit is a much bigger performance difference than a variable at the
end of the cache line vs. a variable at the beginning of a cache
line.

Put variables that are used together physically together, so hopefully
they are in the same cache line. Scan multi-dimensional arrays in
physical order if possible.

>Say I have a function blork with the following variables declared in
>C:
>
>void blork(void)
>{
>struct foo {
>int var1;
>int var2;
>};
>
>int var3;
>int var4
>....
>
>Will the cache line look like this:
>var1, var2, var3, var4....
>
>or
>var3, var4....


There is no guarantee that the order of declaration of auto variables
has any effect at all on their ordering in memory. The ordering
might be based on a SHA-256 hash of the variable *name*. 4 ints
will usually fit in the same cache line.
 #15  
08-23-09, 09:50 PM
Patricia Shanahan
Gordon Burditt wrote:
.....
>
> There is no guarantee that the order of declaration of auto variables
> has any effect at all on their ordering in memory. The ordering
> might be based on a SHA-256 hash of the variable *name*. 4 ints
> will usually fit in the same cache line.
>


In any case, in most code in C or similar, the active area of the stack
is rarely not in cache. It is used too often, for call/return, for auto
variables, and for compiler generated temporaries.

Patricia

Similar Threads
Being cache conscious - How is data read from memory into cache line?

Hi, this question is a bit of an odd ball I'm afraid. What I need to know to be cache conscious when implementing a hash table is how an OS Linux/UNIX reads data from RAM...

Optimizing algorithms for L1 Data Cache (and maybe even L1 Instruction Cache)

Hello, Knowning how cpu's work and how they get instructions and data from main memory and which tricks are used is relevant and handy for optimizing algorithms. So I...

Command-line tool to release cache-memory?

Hi! I´d like to know if there is any command-line utility to free up the used cache memory (the one in RAM memory)... I´ve done some research and I´ve seen thousands of...

2x2MB L3 Cache, Third Level Cache Memory

Hi all, in the registry setting HKLM\CurrentConrtolSet\Control\Session Manager\Memory Management you can adjust the second level cache memory (L2 cache) with the value...

SHOW MEMORY /CACHE /FULL = 164% Read Hit Rate ???

ES-40 (4x 667-MHz) 3-GB memory VMS V-7.2-1 I increased the VCC_MAXSIZE to 1024000 (512-MB). I do the following:


All times are GMT. The time now is 09:39 PM. | Privacy Policy