14 Nov 2008 @ 9:50 PM 

AGI enthusiasts may be puzzled by my previous post: Since we don’t understand what exactly intelligence is or how to build it, surely it’s completely backwards to focus in such excruciating detail on the transient technological details of computer components ordered from newegg.com!

In answer, I say this:

  • I suspect that there are many different designs that can exhibit general intelligence. I don’t have any proof of that, but just about anybody fascinated by AGI accepts that at least two such designs exist — brains and one (yet to be found) computer-hosted AGI implementation. If there are two, it seems likely to me that there are more than two. So, given no strong reason to prefer one type of design over another, why not search for designs that are natural fits for the available computing machinery? And that does not mean Turing Machines.
  • I am moving away from AGI as an explicit goal, at least for the time being. Now that my head has come out of the clouds a bit, mundane technology issues seem suddenly relevant to things I might do.
  • It’s interesting. These chips are one of the pinnacles of our technical achievement as a civilization. As a techno-geek, I think the details are glorious.
  • An unhurried step-by-step look at computational building blocks might generate some useful ideas about how to usefully assemble them into more complicated structures.

So, moving on… my new computer has now been completely assembled and I installed 64-bit Windows Vista on it (curse me if you like, but I’m not really interested in operating system wars and Windows works fine for my needs). Now that it’s working, I can make some analysis and actual measurements of instruction execution and the memory hierarchy. The figures I arrived at are simple approximations derived from reading specifications, running SiSoft Sandra and writing some short C/Assembly test programs. Detailed performance analysis is a hugely complicated task in general.

Instruction execution performance can be characterized with throughput and latency. What I mean by these:

  • Throughput: how many instructions can get executed, assuming no inter-instruction dependencies.
  • Latency: how long it takes for an individual instruction to complete.

Throughput is different than latency because the processor overlaps the execution of multiple instructions so parts of several instructions can be executing at the same time.

Consider the instruction mulps xmm0, xmm1. This multiplies each of four floating point numbers stored in two registers together, storing the result (so it does four floating point multiplies). This instruction has a latency of 4 cycles, but if the algorithm being computed can do other work in the meantime using other resources, the throughput can reach one instruction per cycle.

Now let’s look at the instruction movaps xmm0, [rax]. This means: move a vector of four floating-point numbers (16 bytes total) from the memory address stored in the 64-bit register rax into the 128-bit register xmm0. The memory address must be an even multiple of 16 bytes. The throughput of this instruction is as high as one instruction per cycle, but the instruction latency varies wildly depending on whether and where the memory data is cached.

Each core has a 32KB “level 1″ data cache with a 3 cycle latency. Most algorithms can get pretty close to filling in the level-1 cache access latency delays with other instructions, so that latency isn’t too much of a problem.

The CPU has 12MB of “level 2″ cache. The latency of L2 cache access is approximately 18 cycles. So getting the needed data transferred from L2 to L1 before it is needed, and then working with it for a while before moving on to more data, is important to avoid waiting for access to the L2 cache.

The 16GB of main memory on my computer is much worse. The latency for accessing it is something on the order of 160 cycles. Since most AGI-related algorithms are likely to operate on vast amounts of data, it is crucial to try hard to prefetch data before it is needed, and then work with it for a while — because the throughput of main memory is only something like 10GB/sec.

Two architectural features — Vector Computation and Memory Latency — are the most important concepts to keep in mind when trying to figure out how to make best use of these amazing CPU chips.

Tags Categories: Computer Hardware Posted By: Derek
Last Edit: 07 Dec 2008 @ 09 45 PM

E-mailPermalinkComments (2)
 02 Nov 2008 @ 2:51 AM 

As I write this, the fastest fairly normal PC CPU is the just-released Intel Xeon X7460, a 2.66 GHz 6-core monster.  Seems like I’ve been stuck in boring quad-core land forever, and six cores would be a nifty upgrade.  Unfortunately, the top-end ultra sexy chips aren’t cheap and in fact it looks like the X7460 at present costs several thousand dollars apiece.  I’m not willing to pay that, so now the question of which CPU to get is a performance/cost tradeoff.  Sparing you the boring details of my bargain hunting and comparison shopping, I settled on a set of two Intel Xeon E5410.  These are 2.33 GHz quad-core processors with 12MB of L2 cache and a 1333 MHz bus speed.

It’s remarkable to me how similar in spirit are the architectures of current mainstream processors like this Xeon and the architecture of the old Thinking Machines CM-5, which I wrote about here a while ago.  In both, the basic idea is to have a number of roughly independent computer processors operating in parallel, and each of those processors has the capability of vector processing — performing computations on several array elements at the same time.  Most commonly that involves floating point arithmetic.

The four cores of each E5410 together execute about 9.3 billion instructions per second.  Instructions that use the VLIW vector unit can do several operations at once.  Probably the most useful case for me involves four simultaneous operations (each operating on a 32-bit number), so that’s a max of 37.3 billion operations per chip.  74.6 billion for both CPUs combined.  Nice.

The challenge comes in keeping the computation units fed.  From reading the documentation, each core has a small amount of directly-accessible memory, and a few megabytes per core of L2 cache, but algorithms manipulating large amounts of data will need continual access to a lot more memory than that… which means rapidly communicating data from the main memory banks to the CPUs. That path appears to have 10.6 GB/sec of bandwidth for each chip (which would total 21 for both chips).  That’s 2.65 GB/sec per core.  A vector register is 128 bits, so each core can do 166 million register loads/saves per second, or one every 14 cycles.  Thus the required arithmetic intensity to keep the chips busy is 14 in the worst case where the cache is ineffective.  I’ll put together some simple benchmark programs to make sure these numbers are right.

So:  it would be best if the core data manipulations of AGI-related algorithms could be expressed as parallel streams of numerical operations on short vectors, with a moderately high (or very high) ratio of computation to memory access.

My new computer will have 16 gigabytes of memory — an arbitrary choice based on budgetary constraints (the memory is $25 per gigabyte).

Peeking a little bit into the future, I will probably replace the machine sometime in 2011.  Unless some unexpected shift occurs in the technology, I can make a pretty good guess as to what the CPUs of that machine will look like:  Each one will have 8 cores and will have 2-way hyperthreading.  The vector registers will increase in size to 256 bits (Intel AVX).  Given a modest improvement in clock speed, this all adds up to maybe 6-8 times the performance for each CPU.  I do not expect memory bandwidth to increase at the same rate, so the required arithmetic intensity will increase.  I hope to be able to afford 64 gigabytes of memory for that machine.

Tags Categories: Computer Hardware Posted By: Derek
Last Edit: 07 Dec 2008 @ 09 46 PM

E-mailPermalinkComments (0)
 31 Oct 2008 @ 12:17 AM 

I freely admit to being a “computer geek” and I’m a big fan of Moore’s Law.  Because of this, I replace my PC every couple of years with reasonably up-to-date technology.  It’s time to do it again, and as I started reading about what the state of the art is these days I realized somewhat to my surprise that I have always bought pre-assembled PCs, and I’ve never actually built a PC from a bunch of parts to suit my own taste.

Besides the fun factor involved in this little project, the computer will be my experimental platform for the next couple years and I’m really interested in figuring out exactly what it’s capable of, and how to put the technology to good use in AGI-related tasks.  What sorts of substrates are good fits for the hardware architecture?  (I’ll post more about what I mean by a computational substrate sometime soon).

So I acquired a motherboard (Supermicro X7DWA-N) and I’m reading the manual.  I think I tried to read a motherboard manual once years ago but it was a rather poor translation.  The X7DWA manual, however, is actually pretty interesting!

Tags Categories: Computer Hardware Posted By: Derek
Last Edit: 07 Dec 2008 @ 09 47 PM

E-mailPermalinkComments (0)
\/ More Options ...
Change Theme...
  • Role »
  • Posts »
  • Comments »
Change Theme...
  • VoidVoid (Default)
  • LifeLife
  • EarthEarth
  • WindWind
  • WaterWater
  • FireFire
  • LiteLightweight
  • No Child Pages...
  • No Child Pages...
  • No Child Pages...