



AGI enthusiasts may be puzzled by my previous post: Since we don’t understand what exactly intelligence is or how to build it, surely it’s completely backwards to focus in such excruciating detail on the transient technological details of computer components ordered from newegg.com!
In answer, I say this:
So, moving on… my new computer has now been completely assembled and I installed 64-bit Windows Vista on it (curse me if you like, but I’m not really interested in operating system wars and Windows works fine for my needs). Now that it’s working, I can make some analysis and actual measurements of instruction execution and the memory hierarchy. The figures I arrived at are simple approximations derived from reading specifications, running SiSoft Sandra and writing some short C/Assembly test programs. Detailed performance analysis is a hugely complicated task in general.
Instruction execution performance can be characterized with throughput and latency. What I mean by these:
Throughput is different than latency because the processor overlaps the execution of multiple instructions so parts of several instructions can be executing at the same time.
Consider the instruction mulps xmm0, xmm1. This multiplies each of four floating point numbers stored in two registers together, storing the result (so it does four floating point multiplies). This instruction has a latency of 4 cycles, but if the algorithm being computed can do other work in the meantime using other resources, the throughput can reach one instruction per cycle.
Now let’s look at the instruction movaps xmm0, [rax]. This means: move a vector of four floating-point numbers (16 bytes total) from the memory address stored in the 64-bit register rax into the 128-bit register xmm0. The memory address must be an even multiple of 16 bytes. The throughput of this instruction is as high as one instruction per cycle, but the instruction latency varies wildly depending on whether and where the memory data is cached.
Each core has a 32KB “level 1″ data cache with a 3 cycle latency. Most algorithms can get pretty close to filling in the level-1 cache access latency delays with other instructions, so that latency isn’t too much of a problem.
The CPU has 12MB of “level 2″ cache. The latency of L2 cache access is approximately 18 cycles. So getting the needed data transferred from L2 to L1 before it is needed, and then working with it for a while before moving on to more data, is important to avoid waiting for access to the L2 cache.
The 16GB of main memory on my computer is much worse. The latency for accessing it is something on the order of 160 cycles. Since most AGI-related algorithms are likely to operate on vast amounts of data, it is crucial to try hard to prefetch data before it is needed, and then work with it for a while — because the throughput of main memory is only something like 10GB/sec.
Two architectural features — Vector Computation and Memory Latency — are the most important concepts to keep in mind when trying to figure out how to make best use of these amazing CPU chips.










More Options ...

Categories
Tag Cloud
Blog RSS
Comments RSS


Void (Default)
Life
Earth
Wind
Water
Fire
Lightweight
10:29 pm - November 14th, 2008
[...] Here is the original: Hardware [...]
11:23 pm - November 14th, 2008
Great article, Thanks!