For educational purposes..
Has anyone done something where they implement some sort of message passing API on a network of Arduinos. Since they cost only $20 each, and have a fairly facile development environment, it seems you could put together a simple demonstration of parallel processing and various message passing things.
For instance, you could introduce errors in the message links and do experiments with Byzantine General type algorithms, or with multiple parallel routes, etc.
I've not actually tried hooking up multiple arduinos through a USB hub to one PC, but if that works, it gives you a nice "head node, debug console" sort of interface.
Smaller, lighter, cheaper than lashing together MiniITX mobos or building a Wal-Mart Cluster.
I started tinkering with Arduinos a couple of months ago. Got lots of
related goodies for Christmas, so I've been looking like a mad scientist
building arduino things lately. I'm still a beginner arduino hacker, but
I'd be game for giving this a try, if anyone else wants to give this a go.
The Arduino Due, which is overdue in the marketplace, will have a
Cortex-M3 ARM processor.
--
Prentice
I think something like the Raspberry Pi might be easier for this sort
of task. They'll also be about $25, but they'll run something like
ARM/linux. Not out yet thought.
http://www.raspberrypi.org/
--
- - - - - - - - - - - - - - - - - - - - -
Nathan Moore
Associate Professor, Physics
Winona State University
- - - - - - - - - - - - - - - - - - - - -
Completely superior chip that Cortex-M3.
Though i couldn't program much for it so far - difficult to get
contract jobs for.
Can do fast multiplication 32 x 32 bits.
You can even implement RSA very fast on that chip.
Runs at 70Mhz or so?
Usually writing assembler for such CPU's is more efficient by the way
than using
a compiler. Compilers are not so efficient, to say polite, for
embedded cpu's.
Writing assembler for such cpu's is pretty straightforward, whereas
in HPC things are far more complicated
because of vectorization.
AVX is the latest there. Speaking of AVX, is there already lots of
HPC support for AVX?
I see that after years of wrestling the George Woltman released some
prime number
code (GWNUM), of course as always: in beta for the remainder of this
century, which uses AVX.
Claims are that it's a tad faster than the existing SIMD codes. I saw
claims of even above 20% faster,
which is really a lot at that level of engineering; usually you work
6 months for 0.5% speedup.
If you improve algorithm, you still lose it from this code, as your C/
C++ code will be default a factor 10 slower if not more.
I remember how i found a clever caching trick in 2006 for a Numeric
Theoretic Transform (that's a FFT but then in integers, so without
the rounding errors that the floating point FFT's give), yet after
some hard work there my C code still was factor 8 slower than Woltman's
SIMD assembler.
That's all very expensive considering the cpu's are under $1 i'd guess.
I actually might need some of this stuff some months from now to
build some robots.
Yes.. better the widget that one can whip on down to Radio Shack and buy o my way home from work than the ghostware that may live for Christmas futue.
Also, does the Raspberry PI $25 price point include a power supply? The Aruino runs off the USB 5V power, so it's one less thing to hassle with.
I don't know that performance is all that important in this application. I's more to experiment with message passing in a multiprocessor system. Slw is fine.
(I can't think of a computational application for a ArdWulf (combining Itaian and Saxon) that wouldn't be blown away by almost any single computer, ncluding something like a smart phone)
Realistically, you're looking at bitbanging kinds of serial interfaces.
I can see several network implementations: SPI shared bus, Hypercubes, toridal surfaces, etc.
Completely superior chip that Cortex-M3.
Though i couldn't program much for it so far - difficult to get contract jobs for.
Can do fast multiplication 32 x 32 bits.
You can even implement RSA very fast on that chip.
Runs at 70Mhz or so?
Usually writing assembler for such CPU's is more efficient by the way than using a compiler. Compilers are not so efficient, to say polite, for embedded cpu's.
Writing assembler for such cpu's is pretty straightforward, whereas in HPC things are far more complicated because of vectorization.
-->> ah, but this is not really a HPC application. It's a cluster computer architecture demonstration platform. The Java based arduino environment is pretty simple and multiplatform. Yes, it uses a sort of weird C-like language, but there it is... it's easy to use.
Yes..
And there's been a bunch of "value clusters" over the years (StoneSouperCoputer, for instance)..
But that's still $3k.
I could see putting together 8 nodes for a few hundred dollars. Arduino Un R3 is about $25 each in quantity.
Think in terms of a small class where you want to have, say, 10 mini-clustrs, one per student. No sharing, etc.
Jim, your microcontroller cluster is not a rather good idea.
Latency didn't keep up with the CPU speeds...
Todays nodes have a CPU core or 12 and soon 16 which can execute,
let's take a simple integer example in my chessprogram and its IPC,
about 24 instructions per cycle
So nothing SIMD, just simple integer instructions most of it, of
course loads which effectively
come from L1 play an overwhelming role there.
typical latencies to do a random memory read from the remote nodes,
even with the latest networks,
it's between 0.85 and 1.9 microseconds. Let's take optimistic 1
microsecond. RDMA read...
So in that timeframe you can execute 24k+ instructions.
IPC at the cheapo cpu's is far under 1 effectively. Around 0.25 for
most codes.
Cpu's of 70Mhz can execute 1 instruction in each 280 Mhz. Now we are
busy with rough measures here.
Let's call that 1/4 millisecond.
Even USB 1.1 has to sticks latencies far under 1 millisecond.
So actual latency of todays clusters is factor 25k worse than this
'cluster'.
In fact your microcontrollercluster here has latencies that you do
not even have core to core
within a single CPU today.
There is still too much years 80s and years 90s software out there,
simply
doesn't scale at all at modern hardware.
Let me not quote too many names there as i've done before.
They were just too lazy to throw away their old code and start over
new writing a new parallel concept
that works at todays hardware.
If we involve GPU's now then there is gonna be an even bigger problem
and that's that bandwidth of the network
can't keep up with what a single GPU delivers. Who is to blame for
that is quite a complicated discussion,
if anyone has to be blamed anyway.
We just need more clever algorithms there.
Hah, how easy it is to make a mistake, sorry for that.
I didn't even multiply by the Ghz frequency of the cpu's yet.
So if it's 3Ghz or so, it's actually closer to factor 75k faster than
24k.
Furthermore another problem is that you cant fully load networks of
course.
So to keep the network functioning great you want to do such
hammering over the network no more than once each 750k instructions.
-----Original Message-----
From: beowulf-bounces*******] On Behalf Of Vincent Diepeveen
Sent: Wednesday, January 11, 2012 2:47 PM
To: Beowulf Mailing List
Subject: Re: [Beowulf] A cluster of Arduinos
Jim, your microcontroller cluster is not a rather good idea.
Latency didn't keep up with the CPU speeds...
--- You're missing the point of the cluster. It's not for performance (where I can't imagine that the slowest single CPU PC out there wouldn't blow the figurative doors off). It's to provide a very inexpensive way to experiment/play/demonstrate loosely coupled multiprocessor systems.
--> for example, you could experiment with redundant message routing across a fabric of nodes. The algorithms are fairly simple, and this gives you a testbed which is qualitatively different than just simulating a bunch of nodes on a single PC. There is pedagogical value in a system where you can force a link error by just disconnecting the cable, and your blinky lights on each node show what's going on.
There is still too much years 80s and years 90s software out there, written by the guys who wrote books about how to parallellize, which simply doesn't scale at all at modern hardware.
--> I think that a lot of the theory of parallel processes is speed independent, and while some historical approaches might not be used in a modern system for good implementation reasons, students and others still need to learn about them, if only as the canonical approach. Sure, you could do a simulation on a single PC (and I've seen them, in Simulink, and in other more specialized tools), but there's a lot of appeal to a hands-on-the-cheap-hardware approach to learning.
--> To take an example, if you set a student a problem of lighting a LED on each node in a specified node order at specified intervals, and where the node interconnects are not specified in advance, that's a fairly interesting homework problem. You have to discover the network connectivity graph, then figure out how to pass the message to the appropriate node at the appropriate time. This is a classic "hot plug network discovery" kind of problem, and in the face of intermittent links, it's of great interest.
--> While that particular problem isn't exactly HPC, it DOES relate to HPC in a world where you cannot assume perfect processor nodes and perfect communications links. And that gets right to the whole "scalability" thing in HPC. It wasn't til the implementation of Error Correcting Codes in logic that something like the Q7A computer was even possible, because it was so large that you couldn't guarantee that all the tubes would be working all the time. Likewise with many other aspects of modern computing.
--> And, of course, in the spaceflight world, this kind of thing is even more important. A concept of growing importance is the "fractionated spacecraft" where all of the functions that would have been all in one physical vehicle are now spread across many smaller pieces. And one might reallocate spacecraft fractional pieces between different virtual spacecraft. Maybe right now, you need a lot of processing power to do image compression and analysis, so you want to allocate a lot of "processing pieces" to the job, with an ad hoc network connection among them. Later, you don't need them, so you can release them to other uses. The pieces might be in the immediate vicinity, or they might be some distance away, which affects the data rate in the link and its error rates.
--> You can legitimately ask whether this sort of thing (the fractionated spacecraft) is a Beowulf (defined as a cluster supercomputer built of commodity components) and I would say it shares many of the same properties, especially in the early Beowulf days before multicores and fancy interconnects were fashionable for multi-thousand processor clusters. It's that idea of building a large complex device out of many basically identical subunits, using open source/simple software to manage it.
-->> in summary, it's not about performance.. it's about a teaching tool for networking in the context of cluster computing. You claim we need to cast off the shackles of old programming styles and get some new blood and ideas. Well, you need to get people interested in parallel computing and learning the basics (so at least they don't reinvent the square wheel). One way might be challenges such as parallelization of game play; another might be working with parallelized database; the way I propose is with experimenting with message passing parallelization using dirt cheap hardware.
_______________________________________________
Beowulf mailing list, Beowulf*******/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf*******sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Yes this was impossible to explain to a bunch of MiT folks as well,
the more of a true SMP system it is.
It's obvious that you missed that point.
Writing code for a multicore is tougher, from SMP constraints viewpoint,
than for a bunch of 70Mhz cpu's that have a millisecond latency to
the other cpu's.
So it's far from demonstrating clusterprogramming. Lightyears away.
Emulation at a simple quadcore is in fact better representative than
this.
If you want to get closer to clusterprogramming than this, just buy
yourself off ebay
some barcelona core SMP system with 4 sockets. Say with energy
efficient 1.8Ghz CPU's.
So with one of the first incarnations of hypertransport, as of course
later on it dramatically improved.
Latency from cpu to cpu is some 300+ ns if you lookup randomly.
Even good programmers in game tree search have big problems working
with those latencies.
Clusters are having latencies that are far worse than that. Yet as
cpu speeds no longer increase much
and number of cores doesn't double that quickly, clusters are the way
to go if you're CPU hungry.
Setting up small clusters is cheap as well. If i put in the name
'mellanox' in ebay i see bunches of
cheap cards out there and also switches.
With a single switch you can teach half a dozen students. You can
just connect the machines you already
got there onto a few switches and write MPI code like that.
Average cost per student also will be a couple of hundreds of dollars.
Vincent
Whatever happpened to hacking on hardware just for the fun of it?
Just because it's not going to be useful doesn't mean you won't learn
from the experience, even if the lesson is only "don't do it again".
:-)
--
Christopher Samuel - Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel*******Phone: +61 (0)3 903 55545
http://www.vlsci.unimelb.edu.au/
I thought the plan was for them to be powered from the HDMI connector,
but it appears I was wrong, it looks like it can use either microUSB
or the GPIO header.
http://elinux.org/RaspberryPiBoard
# The board takes fixed 5V input, (with the 1V2 core voltage generated
# directly from the input using the internal switch-mode supply on the
# BCM2835 die). This permits adoption of the micro USB form factor,
# which, in turn, prevents the user from inadvertently plugging in
# out-of-range power inputs; that would be dangerous, since the 5V
# would go straight to HDMI and output USB ports, even though the
# problem should be mitigated by some protections applied to the input
# power: The board provides a polarity protection diode, a voltage
# clamp, and a self-resetting semiconductor fuse.
--
Christopher Samuel - Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel*******Phone: +61 (0)3 903 55545
http://www.vlsci.unimelb.edu.au/
-----Original Message-----
From: beowulf-bounces*******] On Behalf Of Vincent Diepeveen
Sent: Wednesday, January 11, 2012 4:37 PM
To: Beowulf Mailing List
Subject: Re: [Beowulf] A cluster of Arduinos
Yes this was impossible to explain to a bunch of MiT folks as well, some of whom wrote your book i bet - yet the slower the processor, the more of a true SMP system it is.
It's obvious that you missed that point.
Writing code for a multicore is tougher, from SMP constraints viewpoint, than for a bunch of 70Mhz cpu's that have a millisecond latency to the other cpu's.
-> Yes, that's true... but that's also what I would think of as more advanced than understanding basic message passing or non-tightly-coupled multiprocessing systems. And there are lots of applications for the latter. Some might not be as sexy as others, but they exist.
So it's far from demonstrating clusterprogramming. Lightyears away.
Emulation at a simple quadcore is in fact better representative than this.
If you want to get closer to clusterprogramming than this, just buy yourself off ebay some barcelona core SMP system with 4 sockets. Say with energy efficient 1.8Ghz CPU's.
So with one of the first incarnations of hypertransport, as of course later on it dramatically improved.
Latency from cpu to cpu is some 300+ ns if you lookup randomly.
Even good programmers in game tree search have big problems working with those latencies.
-> but that's an entirely different sort of problem space and instructional area.
Clusters are having latencies that are far worse than that. Yet as cpu speeds no longer increase much and number of cores doesn't double that quickly, clusters are the way to go if you're CPU hungry.
Setting up small clusters is cheap as well. If i put in the name 'mellanox' in ebay i see bunches of cheap cards out there and also switches.
-> Oh, Im sure the surplus market is full of things one could potentially use. But I suspect that by the time you lash together your $40 cards and $20 cables and several hundred $ switch, you're up in the total system price >$1k. And you're using surplus, so there's a support issue. If you're tinkering for yourself in the garage or as a one-off, then surplus is a fine way to go. If you want to be able to give a list of "go buy this" to a teacher, it needs to be off-the-shelf currently being manufactured stuff.
-> Say you want to set up 10 demo systems with 8 nodes each, so that each student in a small class has their own to work with. There's a big difference between $30 Arduinos and $200 netbooks.
With a single switch you can teach half a dozen students. You can just connect the machines you already got there onto a few switches and write MPI code like that.
-> The whole point is to give a student exclusive access to the system, without needing to share. Sure, we've all done the shared "computer lab" resource thing and managed to learn(In the late 1970s, I would have done quite a lot to have on demand access to an 029 keypunch). That's part of what *personal* computers is all about. My program doesn't work right, I just hit the reset button and start over.
-> I confess, too, that there is an aspect of the "mass of boards on the desktop with cables strewn around", which is a learning experience in itself. On the other hand, the Arduino experience is a lot less hassle than, say, a mass of PC mobos, network cards, and power supplies and trying to get them to boot off the net or a USB drive.
Average cost per student also will be a couple of hundreds of dollars.
-> that's the "total cost of several thousand dollars divided by N students who share it" I suspect. We could get into a little BOM battle, and I'd venture that I can keep the off the shelf parts cost under $500, and give each student a dedicated system to play with. The only part that I don't know right off the top of my head is the actual interconnect hardware. I think you'd want to design some sort of board with a bunch of connectors that connects to the Arduinos with ribbon cables. But even there, that could be "here's your PCBExpress file.. order the board and you get 3 for $50"
-> over the years I've been involved in several of these "what can we set up for a demonstration", and I've converged to the realization that what you need is a parts list (preferably preloaded at Newark or DigiKey or Mouser or similar) and an explicit set of instructions. A setup that starts out with:
1) Find 8 motherboards on eBay or newegg with these sorts of specs
2) Find 8 power supplies that match the mother boards
Is doomed to failure. You need "buy 3 of those and 6 of these, and hook them up this way"
This is the beauty of the whole Arduino culture. In fact, it's a bit too much of that.. there's not a lot of good overview tutorial material.. but lots of "here's how to do specific task X"... I got started looking at Arduinos because I want to build a multichannel temperature controller to smoke/cure sausage.
But I've used just about every small single board computer out there: Rabbit, Basic Stamp, various PIC boards, etc. not to mention various MiniITX and PC schemes. So far, the Arduino is the winner on dirt cheap and simple combined. Spend $30, plug in USB cable, load java environment, done. Now I know why all those projects at the science fair are using them. You get to focus on what you want to do, rather than getting a computer working.
Vincent
On Jan 12, 2012, at 12:24 AM, Lux, Jim (337C) wrote:
>
>
> -----Original Message-----
> From: beowulf-bounces*******[mailto:beowulf-
> bounces*******] On Behalf Of Vincent Diepeveen
> Sent: Wednesday, January 11, 2012 2:47 PM
> To: Beowulf Mailing List
> Subject: Re: [Beowulf] A cluster of Arduinos
>
> Jim, your microcontroller cluster is not a rather good idea.
>
> Latency didn't keep up with the CPU speeds...
>
> --- You're missing the point of the cluster. It's not for performance
> (where I can't imagine that the slowest single CPU PC out there
> wouldn't blow the figurative doors off). It's to provide a very
> inexpensive way to experiment/play/demonstrate loosely coupled
> multiprocessor systems.
>
> --> for example, you could experiment with redundant message
> routing across a fabric of nodes. The algorithms are fairly simple,
> and this gives you a testbed which is qualitatively
> different than just simulating a bunch of nodes on a single PC.
> There is pedagogical value in a system where you can force a link
> error by just disconnecting the cable, and your blinky lights on each
> node show what's going on.
>
>
> There is still too much years 80s and years 90s software out there,
> written by the guys who wrote books about how to parallellize, which
> simply doesn't scale at all at modern hardware.
>
> --> I think that a lot of the theory of parallel processes is
> speed independent, and while some historical approaches might not be
> used in a modern system for good implementation reasons, students and
> others still need to learn about them, if only as the
> canonical approach. Sure, you could do a simulation on a single
> PC (and I've seen them, in Simulink, and in other more specialized
> tools), but there's a lot of appeal to a hands-on-the-cheap- hardware
> approach to learning.
>
> --> To take an example, if you set a student a problem of lighting
> a LED on each node in a specified node order at specified intervals,
> and where the node interconnects are not specified in advance, that's
> a fairly interesting homework problem. You have to discover the
> network connectivity graph, then figure out how to
> pass the message to the appropriate node at the appropriate time.
> This is a classic "hot plug network discovery" kind of problem, and in
> the face of intermittent links, it's of great interest.
>
> --> While that particular problem isn't exactly HPC, it DOES relate
> to HPC in a world where you cannot assume perfect processor nodes and
> perfect communications links. And that gets right to the whole
> "scalability" thing in HPC. It wasn't til the implementation of Error
> Correcting Codes in logic that something like the Q7A computer was
> even possible, because it was so large that you couldn't guarantee
> that all the tubes would be working all the time. Likewise with many
> other aspects of modern computing.
>
> --> And, of course, in the spaceflight world, this kind of thing is
> even more important. A concept of growing importance is the
> "fractionated spacecraft" where all of the functions that would have
> been all in one physical vehicle are now spread across many smaller
> pieces. And one might reallocate spacecraft fractional pieces between
> different virtual spacecraft. Maybe right now, you need a lot of
> processing power to do image compression and analysis, so you want to
> allocate a lot of "processing pieces" to the job, with an ad hoc
> network connection among them. Later, you don't need them, so you
> can release them to other uses. The pieces might be in the immediate
> vicinity, or they might be some distance away, which affects the data
> rate in the link and its error rates.
>
> --> You can legitimately ask whether this sort of thing (the
> fractionated spacecraft) is a Beowulf (defined as a cluster
> supercomputer built of commodity components) and I would say it shares
> many of the same properties, especially in the early Beowulf days
> before multicores and fancy interconnects were fashionable for
> multi-thousand processor clusters. It's that idea of building a large
> complex device out of many basically identical subunits, using open
> source/simple software to manage it.
>
>
> -->> in summary, it's not about performance.. it's about a teaching
> tool for networking in the context of cluster computing. You claim we
> need to cast off the shackles of old programming styles and get some
> new blood and ideas. Well, you need to get people interested in
> parallel computing and learning the basics (so at least they don't
> reinvent the square wheel). One way might be challenges such as
> parallelization of game play; another might be working with
> parallelized database; the way I propose is with experimenting with
> message passing parallelization using dirt cheap hardware.
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf*******sponsored by Penguin
> Computing To change your subscription (digest mode or unsubscribe)
> visit http://www.beowulf.org/mailman/listinfo/beowulf
>
_______________________________________________
Beowulf mailing list, Beowulf*******/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf*******sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Interesting...
That seems to be a growing trend, then. So, now we just have to wait for them to actually exist. The $35 B style board has Ethernet, and assuming one could netboot and operate "headless", then a stack o'raspberry PIs and a cheap Ethernet switch might be an alternate approach.
The "per node" cost is comparable to the Arduino, and it's true that Ethernet is probably more congenial in the long run.
Drawing 700mA off the microUSB, though.. That's fairly hefty (although not a big deal in general.. you might need to have some better power supply scheme for a basket o'pi cluster. (Arduino Uno runs around 40-50 mA)
The whole purpose of PC's is that they are generic to use. I remember
how in past decision taking bought low clocked junk for big price -
much against the wish of the sysadmins who wanted a PC for every
student exclusively. Outdated slow junk is not interesting
to students. Now you and i might like that CPU as it's under $1, but
to them it's just 70Mhz, factor 500 slower than their home PC single
core
is. What impresses is if you got something that can beat their own
machine at home.
In the end in science we basically learn a lot easier if we can take
a look into the future - so being faster than a single PC is a good
example of that.
So let them do that. If you take care you launch 1 proces on each
machine, then at quadcore machines, not to mention i7's with
hyperthreading, you can have 24 computers on 1 switch that serve 24
students, each using 12 logical cores.
And for demonstration purposes you can run succesful applications
also at all 24 computers at the same time.
Hey there is switches with even more slots.
Average price per student is gonna beat the crap out any junk
solution you show up with - besides how many are you gonna buy?
Those computers are already there, one for each student i suspect.
So they can exclusively toy and toy - for the switch it's not a real
problem except if they really mess up.
But most important they learn something - by toying with 70Mhz
hardware that's not representative and only intersting to experts like
you and me, who are real good in embedded programming, they don't
learn much.
There is no replacement for the real thing to test upon.
Besides if you go program at embedded processors, writing good fast
single CPU code mine is probably gonna kick the hell out of you writing
the same program at 8 CPU's. Probably by factor 10+ it'll be single
core faster than you at 8.
p.s. not that it's disturbing Jim but your replies are typed within
my original message always, so tough to read sometimes what you typed
into
the message i posted here - maybe this apple macbookpro's
mailing system doesn't know how to handle it - FYI i want to reformat
it to linux anyway -
getting sick being hacked silly each time by about every other
consultant,
but well this is all off topic - so hence the postscriptum.
Regarding Ethernet switches, I had cause recently to look for an USB
powered switch
Such things exist, they are promoted for gamers.
http://www.scan.co.uk/products/8-port-eten-pw-108-pocket-siz[..]
ng-10-100-switch-usb-powered-lan-party!
You could imagine a cluster being powered by those USB adapters which
fit into the cigarette
lighter socket of a car.
How about a cluster which fits in the glovebox or under the seat of a
car?
The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.
Take this advice in any other area, let's say, Chemical Engineering or
Mechanical Engineering, and the students are going to come out the of
the experience with chemical burns at least to at most blowing up half
of the building. In the best case all they do is screw up very, very
expensive equipment. So I have to respectfully disagree that learning
is only possible and students will only be interested when working on
the stuff of the "future." I think this is likely the reason why many
introductory engineering classes incorporate use of Lego Mindstorm
robots rather than lunar rovers (or even overstock lunar rovers :D).
Point in case, I got interested in HPC/Beowulfery back in 2006, read
RGBs book and a few other texts on it, and finally found a small group
(4) of unused PIIIs to play on in the attic of one of my college's
buildings. Did I learn how to setup a reasonable cluster? Yes. Was it
slow as dirt compared to then modern Intel and AMD processors? Of
course. But did the experience get me so completely hooked on
HPC/Cluster research that I went on to pursue a PHD on the topic?
Absolutely.
Granted, I'm just one data point, but I think Jim's idea has all the
right components for a great educational experience.
Best,
ellis
You can get an ethernet "shield" for arduino to add ethernet
capabilities, but at $35-50 each, you cost savings just went out the
window, especially when compared to the Raspberry Pi. You can also buy
the Arduino Ethernet, which is an arduino board with Ethernet built in,
but at a cost of ~$60, is no better a value than buying an arduino and
the ethernet shield separately.
The arduino can be powered by USB, or a 9V power supply, so if you plan
on using lots of them (as Jim is, theoretically), you don't have to
worry about overloading the USB bus.
--
Prentice
Powering off the cigarette lighter socket (or 12V power socket as they're
now labeled) is probably feasible, but those USB widgets can't source a
lot of power. Certainly not amps.
The average guy is not interested in knowing all details regarding
how to
play tennis with a wooden racket from the 1980s, just around
the time when McEnroe was on the tennisfield playing there.
Most people are more interested in whether you can win that grandslam
with what you produce.
The nerds however are interested in how well you can do with a wooden
racket
from 1980s,therefore projecting your own interest upon those students
will just
get them desinterested and you will be judged by them as an
irrelevant person
in their life, whose name they soon forget.
Vincent
That is also the purpose of the Arduino. That's why they open-sourced
it's hardware design.
Wrong. What impresses students is teaching something they didn't already
know, or showing them how to do something new. Using baking soda and
vinegar to build a volcano, is very low-tech, but it still impresses
students of all ages (even in this modern Apple i-everything world) and
it's done with ingredients just about everyone already has in their
kitchen.
Show them sodium acetate crystallizing out of a supersaturated solution,
and their heads practically explode. Also very low-tech.
--
Prentice
Vincent, I think the only person projecting here is you. You refer to
the 'average guy'. The word 'average' itself implies that statistics
have been collected and analyzed. Can you please show us your
statistics, and how you collected them, to determine what the average
guy is interested in? And what about the average girl, what is she
interested in? If you are merely citing the work of other researchers,
please include citations.
--
Prentice
Guys, let's just let this one die in it's traditional form of Vincent
disagrees with the list and there is nothing more that can be done. I
recently read a blog that suggested (due to similar threads following
these trajectories) that the Wulf list wasn't what it used to be.
Let's save the flames for editors,
ellis
Very simple,
Wooden tennis rackets were dirt cheap in 90s.
No one bought them.
Instead they all bought for the tennis court a light frame racket
with big blade;
in fact those were pretty expensive in some cases.
Why did no one use suddenly those wooden rackets anymore?
How many people watch upcoming Australian Grandslam?
A lot.
How many will watch 1 or 2 dudes toy with a few embedded processors
using
a language no one has heard of? Only a handful.
Having spent some time recently in Human Resources meetings about how to
better recruit software people for JPL, I'd say that something that
appeals to nerds and gives them something to do is not all bad. Part of
the educational process is to find and separate the people who are
interested and have a passion. I'm not sure that someone who starts
getting into clusters mostly because they are interested in breaking into
the Top500 is the target audience in any case.
If you look over the hobby clusters out there, the vast majority are "hey,
I heard about this interesting idea, I scrounged up N old/small/slow/easy
to find computers and tried to cluster them and do something. I learned
something about cluster administration, and it was fun, but I don't use it
anymore"
This is exactly the population you want to hit. Bring in 100 advanced
high school (grade 11-12 in US) students. Have them all use cheap
hardware to do a cluster. Some fraction will think, "this is kind of
cool, maybe I should major in CS instead of X" Some fraction will think,
"how lame, why not make the single processor faster", and they can be
CompEng or EE majors looking at how to reduce feature sizes and get the
heat out.
It's just like biology or chemistry classes. In high school biology
(9th/10th grade) most of it is mundane memorization (Krebs cycle, various
descriptive stuff. Other than the use of cheap cmos cameras, microscopes
used at this level haven't really changed much in the last 100 years (and
the microscopes at my kids' school are probably 10-20 years old). They
also do some more modern molecular biology in a series of labs partly
funded by Amgen: Some recombinant DNA to put fluorescent proteins in a
bacteria, running some gels, etc. The vast majority of the students will
NOT go on to a career in biology, but some fraction do, they get
interested in some aspect, and they wind up majoring in bio, or being a
pre-med, etc.
Not everyone is looking for the world beater. A lot of kids start with
Kart racing, even though even the fastest Karts aren't as fast as F1 (or
even a Smart Car). How many engineers started with dismantling the
lawnmower engine?
For my own work, I'd rather have people who are interested in solving
problems by ganging up multiple failure prone processors, rather than
centralizing it all in one monolithic box (even if the box happens to have
multiple cores).
Ah no medicine seems to cure you.
Let me remember the original posting of Jim:
"it seems you could put together a simple demonstration of parallel
processing and various message passing things."
The insights presented here obviously render this platform as no good
for that,
not inspiring and for sure the clever students will total get
desinterested and a bunch,
out of desinterest probably not even finish the course.
Working with stuff that isn't even within factor 500 of the speed of
a normal CPU that doesn't motivate,
doesn't inspire and basically learns a person very little.
Embedded cpu's are for professionals, leave it like that.
They are too hard for you to program efficiently.
Your example here will just take care a big number of students don't
want
to have to do anything with those studies, as there is a few lame nerds
there who toy with equipment that's factor 50k slower (adding to the
factor 500
the object oriented slowdown of factor 100) than what they have
at home, and it can do nothing useful.
But in this specific case you'll just scare away students and the
real clever ones
will get total desinterested as you are busy with lame duck speed
type cpu's.
If you'd build a small marsrover with it that would be something else
of course.
snip
This is going to be an exascale issue. i.e. how to compute on a systems
whose parts might be in a constant state of breaking. An other interesting
question is how do you know you are getting the right answer on a *really*
large system?
Of course I spend much of my time optimizing really small
systems.
--
Doug
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
You have made it abundantly clear you aren't interested in enrolling in
such a course. Thanks for your comments.
On a related note, as I was thinking about 'lame duck' education, I
remembered that I took an undergraduate machine learning course in which
we designed players for connect-four, which would compete using recently
learned techniques against other students in the class. Despite that
particular game being a solved one, we all had a blast and got quite
competitive trying to beat each other out using the recently acquired
skills. I would encourage Jim to do something similar once the basics
of cluster administration are done -- perhaps a mini SC Cluster
Competition would be a neat application for the Arduinos?
Best,
ellis
-----Original Message-----
From: Douglas Eadline [mailto:deadline*******]
Sent: Thursday, January 12, 2012 8:49 AM
To: Lux, Jim (337C)
Cc: beowulf*******
Subject: Re: [Beowulf] A cluster of Arduinos
snip
>
>
> For my own work, I'd rather have people who are interested in solving
> problems by ganging up multiple failure prone processors, rather than
> centralizing it all in one monolithic box (even if the box happens to
> have multiple cores).
>
This is going to be an exascale issue. i.e. how to compute on a systems whose parts might be in a constant state of breaking. An other interesting question is how do you know you are getting the right answer on a *really* large system?
Of course I spend much of my time optimizing really small systems.
--
Your point about scaling is well taken.. so far, the computing world has largely dealt with things by trying to make the processor perfect and error free. Some limited areas of error correction are popular (RAM). But think in a bigger area... say your arithmetic unit has some infrequent unknown errors (e.g. FDIV bug on Pentium).. could clever algorithm design and multiple processors (or multi cores) mitigate this (e.g. instead of just computing Z = X/Y you also compute Z1 = (X*2)/(Y*2).. and compare answers... that exact example's not great because you've added 2 operations, but I can see that there are other clever techniques that might be possible.. )
What is nice if you can do things like temporal redundancy (do the calculation twice, and if it's different, do it a third time), or even better some sort of "check calculation" that takes small time compared to mainline calculation.
This, I think, is somewhere that even the big iron/cluster folks could be doing some research. What are optimum communication fabrics to support this kind of "side calculation" which may have different communication patterns and data flow than the "mainline". It has a parallel in things like CRC checks in communications protocols. A lot of hardware has a dedicated little CRC checker that is continuously calculating the CRC as the bits arrive, so that when you get to the end of the frame, the answer is already there.
And Doug, your small systems have a lot of the same issues, perhaps because that small Limulus might be operated in environments other than what the underlying hardware was designed for. I know people who have been rudely surprised when they found that the design environment for a laptop is a pretty narrow temperature range (e.g. office desktop) and when they put them in a car, subject to 0C or 40C temperatures, if not wider, that things don't work quite as well as expected.
Very small systems (few nodes) have the same issues, in some environments (e.g. a cluster subject to single event upsets or functional interrupts in a high radiation environment with a lot of high energy charged particles. it's not so much a total dose thing, but a SEE thing)
For Juno (which is in polar orbit around Jupiter), we shielded everything in a vault (a 1 meter cube with 1cm thick titanium walls) and still it's an issue. We don't get very long before everything is cooked.
And I think that a non-trivially small cluster (e.g. more than 4 nodes, I think) you could do a lot of experimentation on techniques.
(oddly, simulated fault injection is one of the trickier parts)
_______________________________________________
Beowulf mailing list, Beowulf*******sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-----Original Message-----
From: beowulf-bounces*******] On Behalf Of Ellis H. Wilson III
Sent: Thursday, January 12, 2012 9:26 AM
To: beowulf*******
Subject: Re: [Beowulf] A cluster of Arduinos
On 01/12/2012 10:21 AM, Vincent Diepeveen wrote:
> On Jan 12, 2012, at 4:10 PM, Lux, Jim (337C) wrote:
>> This is exactly the population you want to hit. Bring in 100
>> advanced high school (grade 11-12 in US) students. Have them all use
>> cheap hardware to do a cluster. Some fraction will think, "this is
>> kind of cool, maybe I should major in CS instead of X" Some fraction
>> will think,
>
> Your example here will just take care a big number of students don't
> want to have to do anything with those studies, as there is a few lame
> nerds there who toy with equipment that's factor 50k slower (adding to
> the factor 500 the object oriented slowdown of factor 100) than what
> they have at home, and it can do nothing useful.
>
> But in this specific case you'll just scare away students and the real
> clever ones will get total desinterested as you are busy with lame
> duck speed type cpu's.
You have made it abundantly clear you aren't interested in enrolling in such a course. Thanks for your comments.
On a related note, as I was thinking about 'lame duck' education, I remembered that I took an undergraduate machine learning course in which we designed players for connect-four, which would compete using recently learned techniques against other students in the class. Despite that particular game being a solved one, we all had a blast and got quite competitive trying to beat each other out using the recently acquired skills. I would encourage Jim to do something similar once the basics of cluster administration are done -- perhaps a mini SC Cluster Competition would be a neat application for the Arduinos?
----------------------------------------
Ooohh.. that sounds *very* cool..
A bunch of slow processors.
A simple problem to solve (e.g. 3D tic-tac-toe) for which there might even be published parallel approaches
The challenge is effectively using the limited system, warts and all.
The RaspberryPI might be a better vehicle, if it hits the price/availability targets: Comparable to Arduinos in price, but a bit more sophisticated and less contrived.
We've been talking about what kind of software competitions JPL could run as a recruiting tool at Universities, and that's along those lines. Hmm... I wonder if they'd be willing to spend recruiting funds on that? (probably not.. we're all poor this fiscal year)
And, on the undergrad education thing... At UCLA, I had to write stuff in MIXAL to run on a simulated MIX machine and complained mightily to the TAs, who just pointed to the sacred texts of Knuth, rather than giving an intelligent response as to why we didn't do something like work in PDP-11 ASM or System/360 BAL. (UCLA at the time had a monster 360, but I don't know that they had many 11s, and realistically, BAL is not something I'd inflict on 2nd quarter first year students. We were a PL/I or PL/C shop in the first couple years' classes for the most part, although there were people doing Algol)
OTOH, I suspect was an atypical incoming student for 1977.
I had, the previous year, done the Pascal courses at UCSD with p-machines running on LSI-11s as well as the Pascal system on the big Burroughs B6700, which uses a form of ALGOL as the machine language and is a stack machine to boot (how cool is that? Burroughs always did have cool machines.. Hey, they built ILLIAC IV). I had also done some ASM stuff on an 11/20 under RT-11. I guess that's characteristic of the differences in philosophy between different CS departments (UCSD was heading more in the direction of Software Engineering being part of the School of Engineering and Applied Sciences, while UCLA it was part of the Math department. Little did I know, as a cybernetics major, what the difference was: It sure as heck isn't manifested in the course catalog, at least in a form that a incoming student could discern. Going back now, I could probably look at catalogs from the various universities of the era and divine their philosophies, but that's clearly 2020 hindsight
)
_______________________________________________
Beowulf mailing list, Beowulf*******sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Jim,
Have you ever interacted with the "Modeling Instruction" folks over at
ASU? http://modeling.asu.edu/
They've done, for HS Physics, more or less what you're talking about
in terms of making the subject engaging, compelling, and diven by
student, not teacher, interest.
--
- - - - - - - - - - - - - - - - - - - - -
Nathan Moore
Associate Professor, Physics
Winona State University
- - - - - - - - - - - - - - - - - - - - -
I will be curious to see where these things show up since
all you really need is a power plug. (a little nervous actually).
I agree. Four nodes is really small. BTW, the most fun in designing
this system is a set of tighter constraints than are found on the typical
cluster. Noise, power, space, cabling, low cost packaging, etc. I have
been asked about a rack mount version, we'll see.
One thing I find interesting is the core/node efficiency.
(what I call "effective cores") In general *on some codes*, I found that
less cores (1P micro-atx 4-cores) is more efficient than many
cores (2P server 12-core). Seems obvious, but I like to test things.
I would assume, because in a sense, the black swan* is
by definition hard to predict.
(* the book by Nick Taleb, not the movie)
--
Doug
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
Yes.. That *will* be interesting... And wait til someone has a cluster of
Limuluses (Not sure of the proper alliterative collective noun, nor the
plural form.. A litany of limuli? A school? A murder?)
Yes, because we're using, in general, commodity components/assemblies,
we're subject to the results of optimizations and market/business forces
in other user spaces. Someone designing a media PC for home use might not
care about electrical efficiency (there's no big yellow energy tags on
computers, yet), but would care about noise. Someone designing a rack
mounted server cares not a whit about noise, but really cares about a 10%
change in power consumption.
And, drop on top of that the non-synchronized differences in
development/manufacturing/fabrication generations for the underlying
parts. Consumer stuff comes out for the winter selling season. Commercial
stuff probably is on a different cycle. It's not like everyone uses the
same "model year changeover".
Not so much that, as the actual mechanics of fault injection. Think about
testing error detection and recovery for Flash memory. The underlying
specification error rate is something like 1E-9 or 1E-10/read, and that's
a worst case kind of spec, so errors aren't too common (I.e. You can't
just run and wait for them to occur). SO how do you cause errors to occur
(without perturbing the system.)...
In the flash case, because we developed our own flash controller logic in
an FPGA, we can add "error injection logic" to the design, but that's not
always the case. How would you simulate upsets in a CPU core? (short of
blasting it with radiation.. Which is difficult and expensive.. I wish it
was as easy as getting a little Co60 gamma source and putting it on top of
the chip.. We hike to somewhere that has an accelerator (UC Davis,
Brookhaven, etc) and shoot protons and heavy ions at it.
Black swans in this case would be things like the Pentium divide bug.
Yes.. That *would* be a challenge, but hey, we've got folks in our JPL
Laboratory for Reliable Software (LARS) who sit around thinking of how to
do that, among other things. ( http://lars-lab.jpl.nasa.gov/ ) Hmm.. I'll
have to go talk to those guys about clusters of pi or arduinos... They're
big into formal verifications, too, and model based verification. So you
could have a modeled system in SysML or UML and compare its behavior with
that on your prototype.
the "plus 80" branding is pretty ubiquitous now, and the best part
is that commodity ATX parts are starting to show up at gold levels.
server vendors have offered gold or platinum for a while now, but it's
probably more important in the home, since personal machines spend more
time idling, thus running the PSU at low demand. poor-quality PSUs
are remarkably bad at low utilization.
regards, mark hahn.
Ahem, not around here, they're all black [1]. Now a white swan, that
would be something to see!
[1] http://www.flickr.com/photos/earthinmyeyes/4608041877/
cheers!
Chris
--
Christopher Samuel - Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel*******Phone: +61 (0)3 903 55545
http://www.vlsci.unimelb.edu.au/