Programming Language Wars: The Movie

In computer science and hacker circles, the programming language wars have, it seems, been raging since the beginning of time. A little electronic archaeology reveals some amusing exchanges:

  • "By all means create your own dialect of FORTH. While your at it, you can add the best features of PL-I, F77 and CORAL66. Then, look me up when you get out of college and we'll show you how it's done when you have to make a living" [1985 thread]
  • "This debate ... is very much like two engineers engaged in building a three-mile bridge arguing over the brand of blue-print paper they use." [1987 thread]

Passionate arguments can often be improved by actual measurements. How fast, expressive, and efficient is a particular language? That's what The Computer Language Benchmarks Game set out to provide, measuring time, source code length, and memory use of several dozen languages across a set of benchmarks.

If you have measurements, why not improve them with a visualization? And so I present to you an interactive, multi-dimensional, dynamic, whizbang-o-matic rendering of the Programming Language Wars.

Each circle is a language. Its horizontal position represents the gzipped source code size used to implement the benchmarks, which is intended to measure the language's "expressiveness". Its vertical position represents the real time used to execute the benchmarks, and its size (and color) indicate how much memory was used.

The cluster of languages in the top left are slow but expressive scripting languages. At the bottom right you will find C and C++, the fastest languages, but which take quite a bit more coding to get the job done. In between there is a tradeoff between speed and expressiveness, where lie languages like OCaml (which I happen to use whenever possible).

Actually, each point is only a summary of the language's performance: Consider some metric, like real time, and some particular language L. The Benchmarks Game folks ran implementations of a set of about 12 benchmarks (FASTA, Mandelbrot, ...) in L. L's time for each benchmark is divided by the best time across all languages for that benchmark. This gives us a normalized score for each benchmark; we take the median of these to produce a summary real time score for L. Then we do the same for the other metrics: CPU time, source code length, and memory.

The plot shows data for a single-core x86 box (assuming you haven't yet messed around with the controls). If you press the movie button in the bottom left, it will transition to results on a quad-core box. (Still normalized by the best single-core score. The labels say 1901 and 1904 since Google's API wants dates.) TIP: When you play the animation, select a few languages you're interested in and check the Trails checkbox, so the movement stands out.

To better visualize which languages' implementations took advantage of parallelism, and then click Play. The languages that move downward have improved their real time. Some stay in the same spot, probably indicating that the Computer Language Benchmarks Game doesn't have the best implementations.

Fine. Just tell me which language is best.

These benchmarks are almost certainly not representative of what you want to do. There are various flaws in this approach — how we choose to summarize (the median here) will affect the ordering of languages; the implementations are not perfect; some languages are missing implementations for some benchmarks; even for one language there are many possible implementations with different tradeoffs and only the fastest was tested; and so on. Perhaps most significantly, we're completely lacking important metrics like programmer time, maintainability, and bugginess.

Thus, just as someone out there thinks Circus Peanut Gelatin pie is a good idea, so most of these languages are the right tool for some job. We can't use these benchmarks to brand a language as useless. What I think the benchmarks and visualization can do is introduce you to general-purpose languages that may be a better solution for many tasks.

In particular, you might want to take a gander at the Pareto-optimal languages: those which, for every other language L, are better than L in at least one metric under consideration. If we consider source code length and real time as the two metrics, then the Pareto-optimal languages are:

1 core4 cores
More expressive
Ruby 1.9
Ruby JRuby
Javascript TraceMonkey
Python PyPy
JavaScript V8
Lua LuaJIT
Haskell GHC
Java 6 SteadyState
C GNU gcc
Faster
More expressive
Ruby JRuby
Python CPython
Erlang HiPE
OCaml
Haskell GHC
F# Mono
Scala
Java 6 Steady State
C GNU gcc
C++ GNU c++
Faster

From top to bottom, these languages trace the best points in the tradeoff between expressiveness (at top) and speed (at bottom). Perhaps what this does best is to illustrate why it is hard to pick a "best" language. For the single-core case, 27% of the languages are on the list above; for quad-core, 48% made the cut. Even with just two simple metrics, many languages might be considered "optimal"!

Coming soon: a visualization of the 3D tradeoff space.

Notes

Last year, Guillaume Marceau produced informative plots based on the The Computer Language Benchmarks Game, similarly showing the tradeoffs between metrics. The CLBG has now included similar plots on their site. [Updated: the CLBG didn't use Marceau's plots directly.] The visualization here summarizes more (which can be good and bad), includes the memory metric and the quad-core data, and lets you interactively play with the data thanks to Google Motion Charts. A chat with Rodrigo Fonseca several years ago got me interested in plotting these tradeoffs. Finally, my apologies to those without Flash for the chart.

20 comments:

  1. I think these data are flawed since although writing vanilla programs is encouraged (simple but good code), some of the entries have done an extraordinary effort to make their program fast. For instance look at the Haskell entry for k-nucleotide. It rolls it's own hash table. It claims that there is no library for hash table, but there is one, and using it would cut the code size to about a quarter in size.

    ReplyDelete
  2. Same is somewhat true for C/C++ implementation. Using specialized SIMD instructions to get better performance, or using the GMP library which reduces all languages to: who can call a library faster.

    ReplyDelete
  3. FWIW, there's also been some theoretical work on comparing programming languages on expressiveness. E.g. http://www.uclouvain.be/en-80042.html

    Ashish

    ReplyDelete
  4. If you have measurements, why not improve them with a visualization?

    "Improve them" is good by definition - unfortunately a visualization can also be a meaningless mush of bogus relationships and misunderstanding :-)


    To better visualize which languages' implementations took advantage of parallelism ... Some stay in the same spot, probably indicating that the Computer Language Benchmarks Game doesn't have the best implementations.

    That seems to be a plucked-out-of-thin-air guess.

    Shouldn't you have figured out what (if anything) it indicates?


    For the single-core case, 27% of the languages are on the list above; for quad-core, 48% made the cut.

    Does it make any difference to your statistics that the single core data simply includes more language implementations than the quad core data?


    Guillaume Marceau produced informative plots ... showing the tradeoffs between metrics which the CLBG has now included on their site

    No, Guillaume Marceau's plots are not included on the website.

    Yes, because Guillaume Marceau showed people were interested in seeing code-used time-used scatterplots - I designed a new chart.

    ReplyDelete
  5. I guess I'm happy there are quad-core numbers on here, but are they really useful? I looked through the benchmarks and support for parallelism is spotty, and I'm not really sure they represent things people even want to do in parallel. Does it make sense to have parallel benchmarks for everything?

    For example, binary-trees benchmark "allocates MANY, MANY binary trees". Yeah, ok, that's embarrassingly parallel. Yay! What are you going to do with the trees once you have them allocated, and how much memory bandwidth do you need to do it all on your quad-core machine?

    Also disappointing: n-body is not parallelized… despite being one of the more interesting parallelizable problems in there. And the various ports of it (e.g. to OCaml) look like they're really just line-for-line ports of the C version. Wouldn't you write this differently if you were really an OCaml person?

    Another complaint: the OpenMP benchmarks in there don't seem to use any inter-thread communication more complicated than a reduction, and they're GNU OpenMP. Nobody uses GNU OpenMP because it sucks compared to all the other implementations. Granted, they're commercial, but the GNU guys only test their stuff for correctness, not performance.

    Finally, is parallelism really rightly expressed as a function of the programming language? You've got a few different parallel programming models represented here: threads, OpenMP, Erlang's actor model (if you can call that different from threads), and essentially one of those is used in a few different languages. Why not compare the programming model instead? You can do threads in pretty much any of these languages, and it'd be interesting to see how that does against models (like OpenMP and, I assume, Erlang) where the compiler helps out with the parallelization to some degree.

    Also, Ruby moves up and to the right and uses *less* memory with more cores… wtf?

    Ok, I think I've fallen for this sufficiently now :-).

    ReplyDelete
  6. There are a lot of interesting comments about how flawed these benchmarks are. Let me just point out that the Benchmarks Game folks are quick to acknowledge that. If you don't like the benchmarks, then perhaps you at least will find that this visualization can reveal some new questionable properties of the data!

    Elvind: Yes, this visualization uses the fastest implementation of each language; as you've pointed out, even with in a language there can be a significant tradeoff between the metrics. I wouldn't say that means the data is flawed. It's just showing one point in the tradeoff space — an incomplete picture of each language.

    Isaac wrote, regarding the fraction of languages that are Pareto-optimal: "Does it make any difference to your statistics that the single core data simply includes more language implementations than the quad core data?"

    My main point was that a reasonably large fraction of languages are Pareto-optimal. Beyond that I'm not making any claims, and I wouldn't read too much into the particular numbers 27% and 48%. (But yes, I suspect that the size of the Pareto-optimal set does in general depend in nontrivial ways on the number of points in the set. I would guess that larger sets imply more Pareto-optimal points but a smaller fraction of points being Pareto-optimal. Maybe theorists have analyzed this... In this particular case, the larger set has fewer Pareto-optimal points and a smaller fraction of points being Pareto-optimal.)

    ReplyDelete
  7. @Brighten Godfrey > I suspect that the size of the Pareto-optimal set does in general depend...

    Just exclude language implementations from the single core data that are not in the quad-core data - and re-run your analysis, and see what happens.

    ReplyDelete
  8. @tgamblin > Does it make sense to have parallel benchmarks for everything?

    Those aren't "parallel benchmarks" - the task set was established and measured on single core Intel Pentium 4 five years ago.


    @tgamblin > n-body is not parallelized… despite being one of the more interesting parallelizable problems in there

    Please design a program that uses multi core for that task and then contribute your program to the benchmarks game.

    (So far people have tried to come up with a multi core program that's faster than the sequential implementations but have not succeeded - so they don't bother contributing them.)


    @tgamblin > Another complaint: the OpenMP benchmarks

    What OpenMP benchmarks? Which benchmarks have OpenMP as a requirement?

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. @Brighten Godfrey > If you press the movie button in the bottom left, it will transition to results on a quad-core box. Still normalized by the best single-core score.

    I think I got this wrong earlier, so let's be clear -

    You took the summary data for x86 single core and x86 quad core, and used the best scores from the single core data to normalize both the single core and quad core data?


    So now let's look at tgamblin's question - Also, Ruby moves up and to the right and uses *less* memory with more cores… wtf?

    JRuby memory measurements for some tasks are a lot less on quad core than single core.

    Look at the quad-core measurements.

    Look at the one-core measurements.


    Is JVM GC behaviour the same when affinity forces the process onto one core and when four cores are available?

    Can memory use measurements vary wildly depending on JVM GC behaviour?

    ReplyDelete
  13. @Isaac Gouy:Those aren't "parallel benchmarks" - the task set was established and measured on single core Intel Pentium 4 five years ago.

    If they're not parallel benchmarks, then why test them on four cores? What do you learn? This is exactly my point. The parallel benchmarks aren't even things people want to do in parallel.

    @Isaac Gouy:Please design a program that uses multi core for that task and then contribute your program to the benchmarks game.

    I might, someday when I have a little free time. But for anyone who tries this before me: I would start by setting the number of bodies higher than 5. If that constant is a requirement, then I imagine no one has succeeded in parallelizing the benchmark because it doesn't actually do much work.

    Also, it's not too hard to find implementations that get speedup.

    Caveat: you could argue that parallelizing the 5-body problem is interesting, and that you could perhaps parallelize over time. Maybe. People have looked into it. But that's a research problem, and I would counter that no language is going to help you do that.

    @Isaac Gouy:What OpenMP benchmarks? Which benchmarks have OpenMP as a requirement?

    The ones that use it -- you need to look at the code to see this. I didn't say any of them required it.

    My point is that maybe the benchmarks, instead of comparing languages for parallelism, should look at the programming model instead (OpenMP, MPI, threads, Erlang actors, PGAS, etc.). Otherwise you're falsely attributing any speedup you see to the language, when maybe it only sped up because a particular parallel paradigm was available there.

    Maybe you've answered my question, though. If they're not really intended to be parallel benchmarks, none of these points matter anyway.

    ReplyDelete
  14. Isaac wrote: "Just exclude language implementations from the single core data that are not in the quad-core data - and re-run your analysis, and see what happens."

    Since the question you asked above doesn't seem central to my main point I'll forgo doing that calculation now but you can figure it out the number you're interested in by inspecting the figure above. (Also, when I said "I suspect that the size of the Pareto-optimal set does in general depend..." I was referring to Pareto-optimal sets in general (e.g., a random data set), not this particular data set.)

    You took the summary data for x86 single core and x86 quad core, and used the best scores from the single core data to normalize both the single core and quad core data?

    That's right. If they had been normalized by different values, they wouldn't be directly comparable.


    So now let's look at tgamblin's question - "Also, Ruby moves up and to the right and uses *less* memory with more cores… wtf?" JRuby memory measurements for some tasks are a lot less on quad core than single core. Look at the quad-core measurements. Look at the one-core measurements.

    I'm not sure what the Benchmarks Game has done, but I wonder if there are very different implementations or algorithms for the two cases. For example, on chameneos-redux with N=6M, moving from 1 to 4 cores JRuby gets substantially slower (57 sec -> 167 sec) but uses substantially less memory (161,480 KB -> 49,580 KB).

    ReplyDelete
  15. @tgamblin > If they're not parallel benchmarks, then why test them on four cores? What do you learn?

    You learn that having 4 cores on the machine doesn't necessarily mean your programs will go 4 times faster.

    You learn what kind of changes might be required in different language implementations to make use of those 4 cores.


    @tgamblin > I would start by setting the number of bodies higher than 5.

    You'd start by magically creating some more planets between Jupiter and Saturn! :-)


    @tgamblin > If they're not really intended to be parallel benchmarks, none of these points matter anyway.

    OK

    ReplyDelete
  16. @Isaac:You learn that having 4 cores on the machine doesn't necessarily mean your programs will go 4 times faster.

    Shocking! Who knew?


    Anyway, gripes aside, I do actually like the animation. Nice work, Brighten.

    ReplyDelete
  17. @tgamblin > Shocking! Who knew?

    The obvious is always obvious once it's been pointed out.

    I think you underestimate how ill informed people can be.

    ReplyDelete
  18. @Brighten Godfrey > I'm not sure what the Benchmarks Game has done, but I wonder if there are very different implementations or algorithms for the two cases.

    If you don't know that the programs for the two cases are the same then you don't know that they are directly comparable.

    Here's an example where they are not - the spectral-norm Scala program is (elapsed time) faster on single-core but slower on quad-core than the spectral-norm Scala #2 program.

    Would your visualization lead us to think that somehow Scala source code size increases when the programs are used on quad-core?


    @Brighten Godfrey > on chameneos-redux with N=6M, moving from 1 to 4 cores JRuby

    Here are the 1 core sample measurements

    58.872s 58.971s 98,000KB
    57.208s 57.326s 96,196KB
    57.836s 57.929s 161,480KB
    61.132s 61.215s 98,028KB
    57.392s 57.465s 64,348KB
    58.516s 58.596s 161,292KB

    Here are corresponding 4 core sample measurements

    327.648s 161.779s 50,004KB
    315.780s 158.034s 50,248KB
    300.635s 148.473s 50,400KB
    317.620s 156.989s 50,024KB
    312.956s 155.470s 50,140KB
    318.024s 156.417s 51,000KB

    ReplyDelete
  19. @Brighten Godfrey > I'm not sure what the Benchmarks Game has done, but I wonder if there are very different implementations or algorithms for the two cases.

    Select Python 3 and be amazed when the movie shows running on quad core increases source code length!

    (Incidentally, you may think that gzipped source code size has something to do with language "expressiveness" but if you're going to claim that's what it was intended to measure then I think you need to be able to show where the benchmarks game website says that.)

    ReplyDelete
  20. Hacker circles, huh? Pretty nice term used. I fairly believe that since internet was invented, the programming language wars has been evolving and has become more and more serious. Expressing thoughts and arguments are somehow the main ingredients in the programming language wars. Let us just sit back and observe what and where these programming language wars would last and take us.

    ReplyDelete