14 Nov 2008 @ 9:50 PM 
 

Hardware

 

AGI enthusiasts may be puzzled by my previous post: Since we don’t understand what exactly intelligence is or how to build it, surely it’s completely backwards to focus in such excruciating detail on the transient technological details of computer components ordered from newegg.com!

In answer, I say this:

  • I suspect that there are many different designs that can exhibit general intelligence. I don’t have any proof of that, but just about anybody fascinated by AGI accepts that at least two such designs exist — brains and one (yet to be found) computer-hosted AGI implementation. If there are two, it seems likely to me that there are more than two. So, given no strong reason to prefer one type of design over another, why not search for designs that are natural fits for the available computing machinery? And that does not mean Turing Machines.
  • I am moving away from AGI as an explicit goal, at least for the time being. Now that my head has come out of the clouds a bit, mundane technology issues seem suddenly relevant to things I might do.
  • It’s interesting. These chips are one of the pinnacles of our technical achievement as a civilization. As a techno-geek, I think the details are glorious.
  • An unhurried step-by-step look at computational building blocks might generate some useful ideas about how to usefully assemble them into more complicated structures.

So, moving on… my new computer has now been completely assembled and I installed 64-bit Windows Vista on it (curse me if you like, but I’m not really interested in operating system wars and Windows works fine for my needs). Now that it’s working, I can make some analysis and actual measurements of instruction execution and the memory hierarchy. The figures I arrived at are simple approximations derived from reading specifications, running SiSoft Sandra and writing some short C/Assembly test programs. Detailed performance analysis is a hugely complicated task in general.

Instruction execution performance can be characterized with throughput and latency. What I mean by these:

  • Throughput: how many instructions can get executed, assuming no inter-instruction dependencies.
  • Latency: how long it takes for an individual instruction to complete.

Throughput is different than latency because the processor overlaps the execution of multiple instructions so parts of several instructions can be executing at the same time.

Consider the instruction mulps xmm0, xmm1. This multiplies each of four floating point numbers stored in two registers together, storing the result (so it does four floating point multiplies). This instruction has a latency of 4 cycles, but if the algorithm being computed can do other work in the meantime using other resources, the throughput can reach one instruction per cycle.

Now let’s look at the instruction movaps xmm0, [rax]. This means: move a vector of four floating-point numbers (16 bytes total) from the memory address stored in the 64-bit register rax into the 128-bit register xmm0. The memory address must be an even multiple of 16 bytes. The throughput of this instruction is as high as one instruction per cycle, but the instruction latency varies wildly depending on whether and where the memory data is cached.

Each core has a 32KB “level 1″ data cache with a 3 cycle latency. Most algorithms can get pretty close to filling in the level-1 cache access latency delays with other instructions, so that latency isn’t too much of a problem.

The CPU has 12MB of “level 2″ cache. The latency of L2 cache access is approximately 18 cycles. So getting the needed data transferred from L2 to L1 before it is needed, and then working with it for a while before moving on to more data, is important to avoid waiting for access to the L2 cache.

The 16GB of main memory on my computer is much worse. The latency for accessing it is something on the order of 160 cycles. Since most AGI-related algorithms are likely to operate on vast amounts of data, it is crucial to try hard to prefetch data before it is needed, and then work with it for a while — because the throughput of main memory is only something like 10GB/sec.

Two architectural features — Vector Computation and Memory Latency — are the most important concepts to keep in mind when trying to figure out how to make best use of these amazing CPU chips.

Tags Categories: Computer Hardware Posted By: Derek
Last Edit: 07 Dec 2008 @ 09 45 PM

E-mailPermalink
 

Responses to this post » (2 Total)

 
  1. Hardware | The Technology 2.0 Blog said...
    10:29 pm - November 14th, 2008

    [...] Here is­ the o­rig­in­a­l:  Hard­ware [...]

  2. Chuck said...
    11:23 pm - November 14th, 2008

    Great article, Thanks!

 

Leave A Comment ...

 

 XHTML:
You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
\/ More Options ...
Change Theme...
  • Role »
  • Posts »
  • Comments »
Change Theme...
  • VoidVoid (Default)
  • LifeLife
  • EarthEarth
  • WindWind
  • WaterWater
  • FireFire
  • LiteLightweight
  • No Child Pages...
  • No Child Pages...
  • No Child Pages...