Two Hypothetical Supercomputers
From douzzer Mon Dec 18 07:51:14 EST 1995
Newsgroups: comp.arch
Subject: Re: Smoking, Hairy Deck of Cards
References: <eeeDJrzqM.J9H@netcom.com>
Organization: MIT Brain and Cognitive Sciences
This spherical architecture is what you see in the cooling tanks of
the movie "Until the End of the World." For a long while I puzzled
over this image of the computer-as-metallic-sphere, but a month or two
ago had a revelation and realized why this makes perfect sense.
Not to overly harp upon a point of limited metaphorical viability, but
this is reminescent of the architecture of the mammalian brain. The
cortex contains the bulk of the population of computing nodes, and
surrounds an interconnectional infrastructure. This analogy extends
only so far, since the computer I'm about to describe exhibits the
traditional separation of memory, computation, and connection
componentry.
The inside of the sphere is empty except for an optically transparent
gas. The inner face of the sphere consists of a "wallpaper" of optical
modulators and demodulators. Each emitter is capable of emitting at a
variety of wavelengths: one wavelength is reserved for each
emitter/detector node (the "control" band), and a broadband range of
wavelengths is used as a "bearer" band. Each detector detects signals
in its reserved wavelength independent of the angle of incidence, and
detects bearer signals only at a particular angle of incidence
corresponding to exactly one emitter (the peer). The control bands are
used to configure bearer band "sessions."
The latency and bandwidth between any two nodes is fixed regardless of
the density of signals between the other nodes; no scalability issue
arises, due to the wavelength multiplexing in the control band and the
aperture multiplexing in the bearer band. The latency is a function of
the diameter of the sphere and the speed with which an aperture's
parameters can be altered. With a one foot sphere this should be <10ns
latency (accounting for request generation and transit, reception and
aperture setup time, and acknowledgement generation, transit, and
reception) and 5x10^13 bits/second (1 bit for every 10 cycles of green
light, as a rough approximation) or about 6 terabytes/second bearer
throughput. If a node wants to communicate with an adjacent node, it
would need to use a relay node, since the glancing angle would be
optically unworkable.
Just beyond the optical modulator layer (with associated control
logic) is the computing layer. This is basically CPU's, maybe with
three-dimensional integration (stacks of two-dimensional chips
reminescent of today's technology, densely connected with buses), and
definitely with a high degree of internal parallelism (superscalar,
pipelined). They'll have the same sorts of advancements all CPU's of
the future will have, i.e. very intelligent and flexible predictive
branching, bypasses, and instruction issue, plus a substantial
quantity of embedded FPGA real estate with direct access to the
register file(s). Just beyond this computing layer is a
very-high-speed memory, i.e. a cache. The cache will be very smart and
configurable on an application-specific level, perhaps by way of
another FPGA. Somewhere around this level will be the memory
management logic. Beyond the cache and memory management logic is the
medium speed memory, maybe some sort of RAM. Beyond this is lower
speed memory, perhaps holographic memory (holographic memory comes
with the perk that it is inherently capable of certain types of
recognition tasks in a massively parallel fashion). Some of the nodes
would have I/O beyond this level, of the type that would connect to a
visual I/O device, or an interface to mass storage or an external
network.
Multiple spheres could be interconnected in at least two obvious ways:
either through a node with a networking interface attached to it, or
through an optical cable bolted into each sphere which would provide
an ultra-wide-band connection with near-intrasphere throughput.
Spheres connected with the latter technique would clearly need to
avoid control band collisions (their wavelength spaces would be
superimposed). If the emitters could be made capable of precisely
controlling the direction of a focussed bearer beam, then the
intersphere cable would be able to support maximum throughput between
any number of peers in neighboring spheres; perhaps this capability is
overkill, but it's worth pursuing.
This system would be actively cooled. Instant smoking deck of cards!
However, unlike a Cray, failures in this system would be fairly easy
to tolerate, diagnose, and repair, by dint of its modularity and the
mechanical accessibility and extractability of its components.
Notice that the processor interconnection technique exploits the
characteristic of bosons (in this case photons) that they are not
subject to Pauli's exclusion principle, and so can support arbitrarily
many intersecting yet non-interacting pathways. This basic
architecture can probably also be used in a quantum computer, where
the shell would have a dramatically different composition.
-douzzer
From douzzer Wed Sep 17 11:27:26 EDT 1997
From: Daniel Pouzzner <douznews@kill-9.ai.mit.edu>
Newsgroups: comp.sys.super
Subject: Re: QUESTION: TeraFLOPS-in-a-box?
References: <341F6607.8051C880@telstar.com.au> <5vomlg$b96$1@news.rchland.ibm.com>
Sender: <douznews@kill-9.ai.mit.edu>
Organization: (private)
cecchi@signa.rchland.ibm.com more or less described the proverbial
"smoking hairy deck of cards" (do a dejanews search for the thread of
same name, in comp.arch I think).
If you want a teraflop on your lap, that ain't the way to do it.
Instead, you'd want to go MASSIVELY parallel, with tens of thousands
of processors each cranking at only 50MHz or so, and each using a
milliwatt or less power.
A module would consist of a hundred or so chips stacked vertically,
separated from eachother by fully insulating dielectric layers, with
bus lines running vertically to interconnect the chips. Each chip is
maybe 4 cm^2 and has some memory, say 32mbytes, some processors, say
32 of them, and control and interface logic. The CPU's implement
threads of control in hardware, so that each CPU can have dozens of
lightweight threads proceeding at once, and an I/O block on one thread
causes a single-cycle switch to another runnable thread - all the
CPU's are almost always getting actual work done.
So you get 3200 CPU's per module. Assuming each CPU dissipates a
milliwatt and is capable of 40 actual mflop's (reasonable given the
architecture and parameters) each module gets you a usable 128
gigaflops and dissipates about 3 watts. The module, all packaged up,
is about the size of a matchbox. Put together 8 of them and you've
got over a teraflop, with dissipation of 25 watts or so. That'll fit
nicely inside a present-day laptop with a present-day battery I
believe.
-Daniel Pouzzner