Comments on You Infinite Snake: Programming Language Wars: The Movie

Hacker circles, huh? Pretty nice term used. I fair...

2010-12-01T21:41:40.716-08:00

Hacker circles, huh? Pretty nice term used. I fairly believe that since internet was invented, the programming language wars has been evolving and has become more and more serious. Expressing thoughts and arguments are somehow the main ingredients in the programming language wars. Let us just sit back and observe what and where these programming language wars would last and take us.

@Brighten Godfrey > I'm not sure what the B...

2010-09-18T09:33:33.601-07:00

@Brighten Godfrey > I'm not sure what the Benchmarks Game has done, but I wonder if there are very different implementations or algorithms for the two cases.

Select Python 3 and be amazed when the movie shows running on quad core increases source code length!

(Incidentally, you may think that gzipped source code size has something to do with language "expressiveness" but if you're going to claim that's what it was intended to measure then I think you need to be able to show where the benchmarks game website says that.)

@Brighten Godfrey > I'm not sure what the B...

2010-09-17T11:31:20.456-07:00

@Brighten Godfrey > I'm not sure what the Benchmarks Game has done, but I wonder if there are very different implementations or algorithms for the two cases.

If you don't know that the programs for the two cases are the same then you don't know that they are directly comparable.

Here's an example where they are not - the spectral-norm Scala program is (elapsed time) faster on single-core but slower on quad-core than the spectral-norm Scala #2 program.

Would your visualization lead us to think that somehow Scala source code size increases when the programs are used on quad-core?

@Brighten Godfrey > on chameneos-redux with N=6M, moving from 1 to 4 cores JRuby

Here are the 1 core sample measurements

58.872s 58.971s 98,000KB
57.208s 57.326s 96,196KB
57.836s 57.929s 161,480KB
61.132s 61.215s 98,028KB
57.392s 57.465s 64,348KB
58.516s 58.596s 161,292KB

Here are corresponding 4 core sample measurements

327.648s 161.779s 50,004KB
315.780s 158.034s 50,248KB
300.635s 148.473s 50,400KB
317.620s 156.989s 50,024KB
312.956s 155.470s 50,140KB
318.024s 156.417s 51,000KB

@tgamblin > Shocking! Who knew? The obvious i...

2010-09-17T10:57:55.253-07:00

@tgamblin > Shocking! Who knew?

The obvious is always obvious once it's been pointed out.

I think you underestimate how ill informed people can be.

@Isaac:You learn that having 4 cores on the machin...

2010-09-17T09:58:57.158-07:00

@Isaac:You learn that having 4 cores on the machine doesn't necessarily mean your programs will go 4 times faster.

Shocking! Who knew?

Anyway, gripes aside, I do actually like the animation. Nice work, Brighten.

@tgamblin > If they're not parallel benchma...

2010-09-17T09:39:34.883-07:00

@tgamblin > If they're not parallel benchmarks, then why test them on four cores? What do you learn?

You learn that having 4 cores on the machine doesn't necessarily mean your programs will go 4 times faster.

You learn what kind of changes might be required in different language implementations to make use of those 4 cores.

@tgamblin > I would start by setting the number of bodies higher than 5.

You'd start by magically creating some more planets between Jupiter and Saturn! :-)

@tgamblin > If they're not really intended to be parallel benchmarks, none of these points matter anyway.

OK

Isaac wrote: "Just exclude language implement...

2010-09-16T18:30:36.741-07:00

Isaac wrote: "Just exclude language implementations from the single core data that are not in the quad-core data - and re-run your analysis, and see what happens."

Since the question you asked above doesn't seem central to my main point I'll forgo doing that calculation now but you can figure it out the number you're interested in by inspecting the figure above. (Also, when I said "I suspect that the size of the Pareto-optimal set does in general depend..." I was referring to Pareto-optimal sets in general (e.g., a random data set), not this particular data set.)

You took the summary data for x86 single core and x86 quad core, and used the best scores from the single core data to normalize both the single core and quad core data?

That's right. If they had been normalized by different values, they wouldn't be directly comparable.

So now let's look at tgamblin's question - "Also, Ruby moves up and to the right and uses *less* memory with more cores… wtf?" JRuby memory measurements for some tasks are a lot less on quad core than single core. Look at the quad-core measurements. Look at the one-core measurements.

I'm not sure what the Benchmarks Game has done, but I wonder if there are very different implementations or algorithms for the two cases. For example, on chameneos-redux with N=6M, moving from 1 to 4 cores JRuby gets substantially slower (57 sec -> 167 sec) but uses substantially less memory (161,480 KB -> 49,580 KB).

@Isaac Gouy:Those aren't "parallel benchm...

2010-09-16T16:30:08.822-07:00

@Isaac Gouy:Those aren't "parallel benchmarks" - the task set was established and measured on single core Intel Pentium 4 five years ago.

If they're not parallel benchmarks, then why test them on four cores? What do you learn? This is exactly my point. The parallel benchmarks aren't even things people want to do in parallel.

@Isaac Gouy:Please design a program that uses multi core for that task and then contribute your program to the benchmarks game.

I might, someday when I have a little free time. But for anyone who tries this before me: I would start by setting the number of bodies higher than 5. If that constant is a requirement, then I imagine no one has succeeded in parallelizing the benchmark because it doesn't actually do much work.

Also, it's not too hard to find implementations that get speedup.

Caveat: you could argue that parallelizing the 5-body problem is interesting, and that you could perhaps parallelize over time. Maybe. People have looked into it. But that's a research problem, and I would counter that no language is going to help you do that.

@Isaac Gouy:What OpenMP benchmarks? Which benchmarks have OpenMP as a requirement?

The ones that use it -- you need to look at the code to see this. I didn't say any of them required it.

My point is that maybe the benchmarks, instead of comparing languages for parallelism, should look at the programming model instead (OpenMP, MPI, threads, Erlang actors, PGAS, etc.). Otherwise you're falsely attributing any speedup you see to the language, when maybe it only sped up because a particular parallel paradigm was available there.

Maybe you've answered my question, though. If they're not really intended to be parallel benchmarks, none of these points matter anyway.

@Brighten Godfrey > If you press the movie butt...

2010-09-16T14:13:54.670-07:00

@Brighten Godfrey > If you press the movie button in the bottom left, it will transition to results on a quad-core box. Still normalized by the best single-core score.

I think I got this wrong earlier, so let's be clear -

You took the summary data for x86 single core and x86 quad core, and used the best scores from the single core data to normalize both the single core and quad core data?

So now let's look at tgamblin's question - Also, Ruby moves up and to the right and uses *less* memory with more cores… wtf?

JRuby memory measurements for some tasks are a lot less on quad core than single core.

Look at the quad-core measurements.

Look at the one-core measurements.

Is JVM GC behaviour the same when affinity forces the process onto one core and when four cores are available?

Can memory use measurements vary wildly depending on JVM GC behaviour?

2010-09-16T13:48:48.572-07:00

This comment has been removed by the author.

2010-09-16T12:43:24.410-07:00

This comment has been removed by the author.

2010-09-16T12:18:01.311-07:00

This comment has been removed by the author.

@tgamblin > Does it make sense to have parallel...

2010-09-16T12:00:58.574-07:00

@tgamblin > Does it make sense to have parallel benchmarks for everything?

Those aren't "parallel benchmarks" - the task set was established and measured on single core Intel Pentium 4 five years ago.

@tgamblin > n-body is not parallelized… despite being one of the more interesting parallelizable problems in there

Please design a program that uses multi core for that task and then contribute your program to the benchmarks game.

(So far people have tried to come up with a multi core program that's faster than the sequential implementations but have not succeeded - so they don't bother contributing them.)

@tgamblin > Another complaint: the OpenMP benchmarks

What OpenMP benchmarks? Which benchmarks have OpenMP as a requirement?

@Brighten Godfrey > I suspect that the size of ...

2010-09-16T11:43:14.770-07:00

@Brighten Godfrey > I suspect that the size of the Pareto-optimal set does in general depend...

Just exclude language implementations from the single core data that are not in the quad-core data - and re-run your analysis, and see what happens.

There are a lot of interesting comments about how ...

2010-09-15T21:32:06.548-07:00

There are a lot of interesting comments about how flawed these benchmarks are. Let me just point out that the Benchmarks Game folks are quick to acknowledge that. If you don't like the benchmarks, then perhaps you at least will find that this visualization can reveal some new questionable properties of the data!

Elvind: Yes, this visualization uses the fastest implementation of each language; as you've pointed out, even with in a language there can be a significant tradeoff between the metrics. I wouldn't say that means the data is flawed. It's just showing one point in the tradeoff space — an incomplete picture of each language.

Isaac wrote, regarding the fraction of languages that are Pareto-optimal: "Does it make any difference to your statistics that the single core data simply includes more language implementations than the quad core data?"

My main point was that a reasonably large fraction of languages are Pareto-optimal. Beyond that I'm not making any claims, and I wouldn't read too much into the particular numbers 27% and 48%. (But yes, I suspect that the size of the Pareto-optimal set does in general depend in nontrivial ways on the number of points in the set. I would guess that larger sets imply more Pareto-optimal points but a smaller fraction of points being Pareto-optimal. Maybe theorists have analyzed this... In this particular case, the larger set has fewer Pareto-optimal points and a smaller fraction of points being Pareto-optimal.)

I guess I'm happy there are quad-core numbers ...

2010-09-15T18:33:04.954-07:00

I guess I'm happy there are quad-core numbers on here, but are they really useful? I looked through the benchmarks and support for parallelism is spotty, and I'm not really sure they represent things people even want to do in parallel. Does it make sense to have parallel benchmarks for everything?

For example, binary-trees benchmark "allocates MANY, MANY binary trees". Yeah, ok, that's embarrassingly parallel. Yay! What are you going to do with the trees once you have them allocated, and how much memory bandwidth do you need to do it all on your quad-core machine?

Also disappointing: n-body is not parallelized… despite being one of the more interesting parallelizable problems in there. And the various ports of it (e.g. to OCaml) look like they're really just line-for-line ports of the C version. Wouldn't you write this differently if you were really an OCaml person?

Another complaint: the OpenMP benchmarks in there don't seem to use any inter-thread communication more complicated than a reduction, and they're GNU OpenMP. Nobody uses GNU OpenMP because it sucks compared to all the other implementations. Granted, they're commercial, but the GNU guys only test their stuff for correctness, not performance.

Finally, is parallelism really rightly expressed as a function of the programming language? You've got a few different parallel programming models represented here: threads, OpenMP, Erlang's actor model (if you can call that different from threads), and essentially one of those is used in a few different languages. Why not compare the programming model instead? You can do threads in pretty much any of these languages, and it'd be interesting to see how that does against models (like OpenMP and, I assume, Erlang) where the compiler helps out with the parallelization to some degree.

Also, Ruby moves up and to the right and uses *less* memory with more cores… wtf?

Ok, I think I've fallen for this sufficiently now :-).

If you have measurements, why not improve them wit...

2010-09-15T10:01:48.487-07:00

If you have measurements, why not improve them with a visualization?

"Improve them" is good by definition - unfortunately a visualization can also be a meaningless mush of bogus relationships and misunderstanding :-)

To better visualize which languages' implementations took advantage of parallelism ... Some stay in the same spot, probably indicating that the Computer Language Benchmarks Game doesn't have the best implementations.

That seems to be a plucked-out-of-thin-air guess.

Shouldn't you have figured out what (if anything) it indicates?

For the single-core case, 27% of the languages are on the list above; for quad-core, 48% made the cut.

Does it make any difference to your statistics that the single core data simply includes more language implementations than the quad core data?

Guillaume Marceau produced informative plots ... showing the tradeoffs between metrics which the CLBG has now included on their site

No, Guillaume Marceau's plots are not included on the website.

Yes, because Guillaume Marceau showed people were interested in seeing code-used time-used scatterplots - I designed a new chart.

FWIW, there's also been some theoretical work ...

2010-09-15T06:18:05.777-07:00

FWIW, there's also been some theoretical work on comparing programming languages on expressiveness. E.g. http://www.uclouvain.be/en-80042.html

Ashish

Same is somewhat true for C/C++ implementation. Us...

2010-09-15T04:54:38.151-07:00

Same is somewhat true for C/C++ implementation. Using specialized SIMD instructions to get better performance, or using the GMP library which reduces all languages to: who can call a library faster.

I think these data are flawed since although writi...

2010-09-15T01:41:07.996-07:00

I think these data are flawed since although writing vanilla programs is encouraged (simple but good code), some of the entries have done an extraordinary effort to make their program fast. For instance look at the Haskell entry for k-nucleotide. It rolls it's own hash table. It claims that there is no library for hash table, but there is one, and using it would cut the code size to about a quarter in size.