You Infinite Snake

On getting rejected

2014-02-16T13:35:00.000-08:00

One of the most frustrating things when starting a career in academic research is getting your paper rejected.

Even with more experience, no one enjoys a rejection and we all prefer to see good news.

Quite often this happens on the first project you tackle. A new student might work for a year on a project, pouring in effort and passion and resulting in something that seems to have real merit ... only to be hit with a cold, hard rejection.

And then, with the computer science conference submission schedule, you have no opportunity to respond to the reviewers and you might have to wait six months or more for another appropriate chance to submit. Science is a slow process.

But there's one bit of silver lining: A paper's rejection doesn't mean your research is bad.

In fact, many or most of the papers you see in their final polished form in top venues went through rejections -- even the best papers. Here's a thought experiment. Among a set of published papers, some will have gotten in on the first try, some were rejected first, others were rejected multiple times. Ultimately, how impactful are the papers that got in on the first try, compared to the rejected ones?

To answer that let's (imperfectly) measure impact in terms of citations. Here are my own published research projects, showing the number of times the project received a rejection (X axis), and the number of citations per year (Y axis), as of a few weeks ago:

First you'll note that most of my projects have been rejected, either with a failed workshop, conference, or journal submission, before reaching successful publication later. And furthermore, among these published papers, there is apparently no correlation between whether the project has been rejected, and its eventual impact. In fact, if we were to draw a best fit line:

...then we see that my published projects have received 3.96 more citations per year, per rejection. Not bad considering the average number of citations per year here is 13.4. This is not a very robust method given the small sample and skewed distribution of citations. But a 2012 study by Calcagno et al of 923 journals similarly showed that rejected papers received "significantly" more citations than those accepted on the first submission.

This might be counterintuitive. Doesn't a rejection mean that the project is less exciting or less well executed, which certainly imply lower impact? Perhaps, but there are at least two other factors:

A rejection can improve the quality of a project's next submission, due to feedback from peer review and time to improve the manuscript.
Authors might judge, based on the rejection, to not bother resubmitting. These dead manuscripts dropped out of my sample.

Here's what I think this says: You should let your best judgement, not the reviewers' decision, guide your research. Certainly you should give careful consideration to how reviewers reacted to your paper, but don't automatically take a rejection as an indication of the quality of a project. If you still believe that this thing is a good thing, then you are probably right. It is a good thing, with at least as much impact potential as an immediately accepted paper.

OK ... now go have some ice cream.

Proposal: CoolNets 2014

2013-11-06T16:51:00.000-08:00

The SIGCOMM workshop proposal deadline is coming up. But the community already has many workshops devoted to popular hot topics and multiple sessions on SDN at every top networking venue. Here's my proposal.

CoolNets 2014: The First Workshop on Cool Topics in Networks

The First Workshop on Cool Topics in Networks (CoolNets 2014) seeks papers describing totally groovy contributions to the field of computer communication networks that are not related to presently anointed hot topics. We invite submissions on a wide range of networking research, including, but not limited to, results which are awesome, neat-o, nifty, keen, swell, rad, da bomb, badass, and slick as a greased herring in a butter factory, while being refreshingly orthogonal to:

software-defined networking
cloud computing
content-centric networks
virtualization of networks, network functions, and machines
big data

Submissions on the above topics will be held without review for five years, and then fast-tracked to a prestigious journal unless IP-based QoS has achieved 80% deployment. Such submissions are considered not cool, dude.

Strong submissions may break ground in new research directions, reveal insights about longstanding problems, develop alternative solutions whose clean design will have lasting value, forge connections across fields, deconstruct misconceptions, contribute solid technical advances in important areas, or build networked systems that are just pretty darned cool.

Multi-part series of totally tubular papers will be considered.

Creating research ideas over time

2013-09-02T20:39:00.000-07:00

Research is built on ideas: identifying questions to investigate, problems to solve, or new techniques to solve them. Before I started as faculty one of my biggest doubts was whether I would have good enough ideas to lead a research group and help shape five or six years of each student's career. There is no deterministic procedure to sit down and generate an idea.

However, we can think about how to improve the conditions for them to pseudorandomly appear.

Since sometime in my first year of grad school (2003), I've kept a document logging ideas for research projects. The criteria for including an idea was simple and has remained, at a high level, fairly consistent: When I have an idea that I think would have a reasonable chance at leading to a publishable paper, I jot down a description and notes. This is useful to help remember the idea and the document is also a convenient place to record notes over time, for example if I notice a related paper months later.

Having grown over almost exactly ten years to 169 entries, this document is now an interesting data set in its own right.

The data

Of course, this is not a uniform-random sample of ideas. There are various biases; not every idea makes it into the document, my inclusion standards might have changed over time, and so on. And many of these ideas are, in retrospect, terrible. But let's take a look anyway.

Here is the number of ideas in the document per year. (The first and last were half years and so the value shown is twice the actual number of ideas.)

Now let's probe a bit deeper. The number of ideas might not tell the whole story. Their quality matters, too. To investigate that, I annotated each idea with whether (by 2013) it successfully produced a published paper. I also tagged each of the 169 ideas with an estimate of its quality in retrospect (as subjectively judged by my 2013 self), using a scale from 1 to 10 where

5 = dubious: maybe not publishable, or too vague to have much value
6 = potential to result in a second-tier publication
8 = potential to result in a top-tier publication (e.g. SIGCOMM or NSDI in my field)
10 = potential to result in a top-tier publication and have significant additional impact (e.g., forming the basis of a student's thesis, producing a series of papers, a startup, etc.)

The number of reasonably high quality ideas and the number that produced papers both show significant jumps in 2008-2009, though with different behavior later. Also, the plot below shows perhaps a gentle increase in the mean idea quality over time, and a bigger jump in the quality of the top 5 best ideas in 2008. Note that even a one-point quality difference is quite significant.

The most prominent feature of this graph is an enormous spike in number of ideas in the 2008-2009 timeframe, and a corresponding increase in higher-quality ideas. What might have caused this? And what can we conclude more generally?

Ideas need time for creative thought

A significant change in my life during 2008-2009 was my transition from PhD dissertation work to a postdoc year (split between working on post-dissertation projects at my PhD institution of UC Berkeley, and working with Sylvia Ratnasamy at Intel Labs).

This appears to show the value of having time to step back and think -- and also the opportunity to interact with a new set of people. By May 2008, I was largely done with my dissertation work (though I did work on it more later and finally graduated in May 2009). I had accepted a position here at the University of Illinois and deferred for a year. So I was largely free of responsibilities and concerns about employment, and had more time to be creative. While there are reasons to be concerned about the surge of postdocs in computer science, I think this indicates why this particular kind of postdoc can be extremely valuable: providing time and space for creative thought, and new inspiration.

If that is the explanation, then it seems I was not sufficiently proactive about creating time to be creative after the flood of professorial tasks hit in late 2009.

There are alternative explanations. For example, knowing that I was about to enter a faculty position, I might have more proactively recorded ideas for my future students to work on. However, that would not explain another observation -- that my creative expression at this time in other areas of my life outside computer science seemed to increase as well.

John Cleese has argued that creativity takes time, and it's more likely to happen in an "open mode" of playful, even absurd thought, rather than in a "closed mode" of efficiently executing tasks:

His talk makes other points relevant to academic research. In particular, you are less likely to get into an "open mode" of thought if you are interacting with people with whom you're not completely comfortable. This should certainly affect your choice of collaborators.

It's worth noting that having time to enter an "open mode" of creative thought does not mean that one is thinking free of any constraints whatsoever. I personally find that constraints in a problem domain can provide some structure for creative thought, like improvising around a song's fixed chord changes in jazz.

Ideas need time to germinate

In fact, some ideas need years.

You'll note from the second plot above that the number of paper-producing ideas is zero in 2012 and 2013. This is not just random variation: It's actually fairly unlikely to have an idea and immediately turn it around into a paper. In fact, it has happened fairly often that an idea takes a year or two to "germinate". I might write down the seed of an idea, and at that time not recognize whether it is valuable and what it might become. In coming back to it occasionally, and combining it with other ideas, and bouncing the idea off other people, the context and motivation and focus of the idea gradually takes shape until it is something much stronger and which I can recognize as a worthwhile endeavor.

And that is all before the right opportunity appears to begin the project in earnest -- such as a PhD student who is looking for a new project and is interested in the area -- and the project is actually developed and the paper written and submitted (and resubmitted ...) and finally published. The most extreme example I've been involved with was a 2005 idea-seed that was finally published in a top conference seven years later. In fact, in processing this data I realized there was a second idea from 2005 which lacked sufficient motivation at the time and got somewhat lost until 2011 when it combined with a new take on a similar idea that grew out of a student's work and was published in 2012. The plot below shows 14 lines, each corresponding to a project, with points at the inception year of the seed idea, intermediate ideas if any which combined with it, and finally the year of publication.

Ideas from connections

Reading over the document made it clear that very few if any of the ideas sprang from out of nowhere. They come from connections: with a paper I read, or with a previous project, or in chatting with collaborators. Some of these connections can be quite unexpected. For example, one project on future Internet architecture indirectly inspired a project on network debugging.

Many of the ideas on the list in fact owe at least as much to collaborators as they do to me. This likely is a big part of the rise in number of ideas after becoming faculty. Although I lost some of my open creative time after beginning as faculty, I gained a set of fantastic students.

Conclusions

Generating and selecting among ideas is an art, one of the most important arts to learn over years of grad school. I will never feel that I've truly mastered that art. But studying my own history has suggested some strategies and conditions that seem to help, or at least seem to help me.

Ideas are more likely to appear when I have time or create time to think creatively, rather than simply appearing for free.

They often need to germinate over a period of months or years.

And perhaps most importantly, they are most likely to grow out of connections with other work and other people.

Live-blogging HotNets 2012, Day Two

2012-10-30T09:26:00.002-07:00

This is Day Two. Day One is here.

Mobile and Wireless

Calum Harrison presented work on making rateless codes more power-efficient. Although rateless codes do a great job of approaching the Shannon capacity of the wireless channel, they're computationally expensive, and this can be a problem on mobile devices. This paper tries to also optimize for cost of decoding measured in terms of CPU operations, and gets 10-70% fewer operations with competitive rate. [Calum Harrison, Kyle Jamieson: Power-Aware Rateless Codes in Mobile Wireless Communication]

Shailendra Singh showed that there isn't one single wireless transmission strategy that is always best. DAS, Reuse, Net-mimo — for each there exists a profile of the user (are they moving, how much interference is there, etc.) for which that scheme is better than the others, which this paper experimentally verified. TRINITY is a system they're building to automatically get the best of each scheme in a heterogeneous world. [Shailendra Singh, Karthikeyan Sundaresan, Amir Khojastepour, Sampath Rangarajan, Srikanth Krishnamurthy: One Strategy Does Not Serve All: Tailoring Wireless Transmission Strategies to User Profiles]

Narseo Vallina-Rodriguez argued for something that may be slightly radical: "onloading" traffic from a wired DSL network onto wireless networks. We sometimes think of wireless bandwidth as a scarce resource, but actually your wireless throughput could easily be twice your DSL in some situations. If there is spare wireless capacity, why not use it? 40% of users use less than 10% of their allocated wireless data volume. They tested this idea in a variety of locations at different times and can get order-of-magnitude improvements in video streaming buffering. Apparently the reviewers noted that wireless providers wouldn't be a big fan of this — but Narseo noted that his coauthors are all from Telefonica. Interesting question from Brad Karp: How did we get here? Telefonica owns the DSL and wireless; if you need additional capacity is it cheaper to build out wireless capacity or wired? The answer seems to be that wired is way cheaper, but we need to have wireless anyway. Another commenter: this is promising because measurements show congestion on wireless and DSL peaks at different times. Open question: Is this benefit going to be true long term? [Narseo Vallina-Rodriguez, Vijay Erramilli, Yan Grunenberger, Laszlo Gyarmati, Nikolaos Laoutaris, Rade Stanejovic, Konstantina Papagiannaki: When David helps Goliath: The Case for 3G OnLoading.]

Data Center Networks

Mosharaf Chowdhury's work dealt with the fact that the multiple recent projects improving data center flow scheduling are dealing with just that — flows — with each flow in isolation. On the other hand, applications mean there are dependencies: for example, a partition-aggregate workload may need all of its flows to finish, and if one finishes earlier, it's useless. The goal of Coflow is to expose that information to the network to improve scheduling. One question that was asked was what is the tradeoff with complexity of the API. [Mosharaf Chowdhury, Ion Stoica: Coflow: An Application Layer Abstraction for Cluster Networking]

Nathan Farrington presented a new approach to build hybrid data center networks, with both a traditional packet-switched network and a circuit-switched (e.g., optical) network. An optical switch provides much higher point-to-point bandwidth but switching is slow — far too slow for packet-level switching. Prior work used hotspot scheduling, where the circuit switch is configured to help the elephant flows. But performance of hot spot scheduling depends on the traffic matrix. Here, Nathan introduced Traffic Matrix Scheduling: the idea is to repeatedly iterate between a series of switch configurations (input-output assignments), such that the collection of all assignments fulfills the entire traffic matrix. Q: Once you reach 100% traffic over optical, is there anything stopping you from eliminating the packet switched network entirely? Still there is latency on the order of 1 ms to complete one round of assignments; 1 ms is much higher than electrical DC network RTTs. Q: Where does the traffic matrix come from? Do you have to predict, or wait until you've buffered some traffic? Either way, there's a tradeoff. [Nathan Farrington, George Porter, Yeshaiahu Fainman, George Papen, Amin Vahdat: Hunting Mice with Microsecond Circuit Switches]

Mohammad Alizadeh took another look at finishing flows quickly in data centers. There are a number of recent protocols which are relatively complex. Their design is beautifully simple: each packet has a priority, and routers simply forward high priority packets first. They can have extremely small queues since the dropped packets are likely low priority anyway. End-hosts can set each packet's priority based on flow size, and perform very simple congestion collapse avoidance. Performance is very good, though with some more work to do for elephant flows in high-utilization regimes. [Mohammad Alizadeh, Shuang Yang, Sachin Katti, Nick McKeown, Balaji Prabhakar, Scott Shenker: Deconstructing Datacenter Packet Transport]

Lunch!

Routing and Forwarding

Gábor Rétvári tackled a compelling question: How much information is actually contained in a forwarding table? Can we compress the FIB down to a smaller size, making router hardware simpler and longer-lasting? Turns out, there's not so much information in a FIB: with some new techniques, a realistic DFZ FIB compresses down to 60-400 Kbytes, or 2-6 bits per prefix! A 4 million prefix FIB can fit in just 2.1 Mbyte of memory. Now the interesting thing is that this compression can support reasonably fast lookup directly on the compressed FIB, at least asymptotically speaking, based on an interesting new line of theory research on string self-indexing. One problem: They really need more realistic FIBs. The problem is that widely-available looking glass servers obscure the next-hops, which affect compression. "We are inclined to commit crimes to get your FIBs." Before they turn to a life of crime, why not send them FIBs? They have a demo! Question for the future: Can we use compressed forwarding tables at line speed? [Gábor Rétvári, Zoltán Csernátony, Attila Körösi, János Tapolcai, András Császár, Gábor Enyedi, Gergely Pongrácz: Compressing IP Forwarding Tables for Fun and Profit]

Nicola Gvozdiev wins the award for best visualizations with some nice animation of update propagation among iBGP routers. Their work is developing the algorithms and systems necessary to propagate state changes in iBGP, without causing any transient black holes or forwarding loops. [Nikola Gvozdiev, Brad Karp, Mark Handley: LOUP: Who's Afraid of the Big Bad Loop?]

Vasileios Kotronis's work takes SDN-based routing a step further: Don't just centralize within a domain, outsource your routing control to a contractor! One cool thing here, besides reduced management costs, is that you can go beyond what an individual domain can otherwise do — for example, the contractor has interdomain visibility and can perform cross-domain optimization, debug policy conflicts, etc. [Vasileios Kotronis, Bernhard Ager, Xenofontas Dimitropoulos: Outsourcing The Routing Control Logic: Better Internet Routing Based on SDN Principles]

User Behavior and Experience

Rade Stanojevic presented results from a large data set of mobile service plans (roughly a billion each of calls, SMS/MMS messages, and data sessions). The question: Are economic models of how users select bandwidth and service plans realistic? What choices do real people make? In fact, only 20% of customers choose the optimal tariff. 37% mean overpayment, 26% median. Another interesting result: use of service peaks immediately after purchase, and then decays steadily over at least a month, even with unlimited service (so it's not just because people are conservative as they near their service limits). Several Questions: Do these results really demonstrate irrationality? Users may buy more service than they need, so they don't need to worry about (and pay) comparatively pricey overage fees. Comment from an audience member: One has to imagine the marketing department of Telefonica has that exact same CDF of "irrationality" as their metric of success. [Rade Stanojevic, Vijay Erramilli, Konstantina Papagiannaki: Understanding Rationality: Cognitive Bias in Network Services]

Athula Balachandran presented a study working towards a quantitative metric to score user experience of video delivery (in particular, how long users end up watching the video). The problem here is that predicting user experience based on quantitative observables is hard: it's a complex function of initial startup delay, how often the player buffers, buffering time, bit rate, the type of video, and more. The paper analyzes how well user experience can be predicted using several techniques, based on data from Conviva. [Athula Balachandran, Vyas Sekar, Aditya Akella, Srinivasan Seshan, Ion Stoica, Hui Zhang: A Quest for an Internet Video Quality-of-Experience Metric]

Vijay Erramilli presented a measurement study of how web sites act on information that they know about you. In particular, do sites use price discrimination based on information they collect about your browsing behavior? Starting with clean machines and having them visit sites based on certain high- or low-value browsing profiles, they could subsequently measure how a set of search engines and shopping sites present results and prices to those different user profiles. They uncovered evidence of differences in search results, and some price differences on aggregators such as a mean 15% difference in hotel prices on Cheaptickets. Interestingly, there were also significant price differences based on the client's physical location. Q from Saikat Guha: How can you differentiate the vendor's intentional discrimination from unintentional? For example, in ad listings, having browsed a certain site can cause a Rolex ad to display, which bumps off an ad for a lower priced product. [Jakub Mikians, László Gyarmati, Vijay Erramilli, Nikolaos Laoutaris: Detecting Price and Search Discrimination in the Internet]

That's it! See you all next year...

Live-blogging HotNets 2012

2012-10-29T12:12:00.001-07:00

Note: This blogging might be rather bursty. If you want something more deterministic, here's the HotNets program.

This is Day One. Day Two is here.

Session 1: Architecture and Future Directions

Teemu Koponen spoke about how combining the ideas of edge-core separation (from MPLS), separating control logic from the data plane (from SDN), and general-purpose computation on packets (from software routers) can lead to a more evolvable software defined Internet architecture. [Barath Raghavan, Teemu Koponen, Ali Ghodsi, Martin Casado, Sylvia Ratnasamy, Scott Shenker: Software-Defined Internet Architecture]

Sandeep Gupta discussed rather scary hardware trends, including increasing error rates in memory, and how this may affect networks (potentially increasing loss rates). [Bin Liu, Hsunwei Hsiung, Da Cheng, Ramesh Govindan, Sandeep Gupta: Towards Systematic Roadmaps for Networked Systems]

Raymond Cheng talked about how upcoming capabilities which will be widely deployed in web browsers will enable P2P applications among browsers, so free services can really be free. Imagine databases in browsers, or every browser acting as an onion router. [Raymond Cheng, Will Scott, Arvind Krishnamurthy, Tom Anderson: FreeDOM: a new Baseline for the Web]

Session 2: Security and Privacy

Scott Shenker examined how to build inter-domain routing with secure multi-party computation (SMPC), to preserve privacy of policies. The idea is that interdomain routing really is a multi-party computation of global routes, and participants want it to be secure. The benefits of using SMPC: autonomy, privacy, simple convergence behavior, and a policy model not tied to computational model. The last item should be emphasized: there's a lot more potential policy flexibility here with a much easier deployment story, just changing software at the set of servers running the computation. For example do other classes of policies have different or better oscillation policies? Part of this (convergence) seems to connect with Consensus Routing. Jeff Mogul mentioned an interesting point: By adding the layer of privacy it may be very hard to figure out what's going on inside the algorithm and debug why it arrived at a particular result. [Debayan Gupta, Aaron Segal, Gil Segev, Aurojit Panda, Michael Schapira, Joan Feigenbaum, Jennifer Rexford, Scott Shenker: A New Approach to Interdomain Routing Based on Secure Multi-Party Computation]

Katerina Argyraki spoke about how we can change the basic assumption of secure communication: creating a shared secret not based on computational difficulty, but on physical location. The idea is to use different wireless interference across location. Security is more robust that you might think, in that you just need a lower bound on how much information Eve misses, rather than which pieces of message Eve missed. An implementation generated 38 secret Kbps between 8 nodes. However in a few corner cases Eve learned a substantial amount about the secret. There is some hope to improve this.[Iris Safaka, Christina Fragouli, Katerina Argyraki, Suhas Diggavi: Creating Shared Secrets out of Thin Air]

Saikat Guha linked the problem of data breaches to money and proposed data breach insurance ("Obamacare for data") In a survey, 77% of users said they would pay, a median of $20. (Saikat thought this may be optimistic.) They're working to develop a browser-based app to monitor user behavior, offer individuals incentives to change to more secure behavior, and see if people actually change. [Saikat Guha, Srikanth Kandula: Act for Affordable Data Care.]

Lunch!

Session 3: Software-Defined Networking

Aaron Gember spoke about designing an architecture for software defined middleboxes, taking the idea of SDN to more complex processing. Distributed state management is one challenge. [Aaron Gember, Prathmesh Prabhu, Zainab Ghadiyali, Aditya Akella: Toward Software-Defined Middlebox Networking]

Monia Ghobadi has rethought end-to-end congestion control in software-defined networks. The work observes that TCP has numerous parameters that operators might want to tune — initial congestion window size, TCP variant, even AIMD parameters, and more — that can have a dramatic effect on performance. But the effects they have depend on current network conditions. The idea of the system they're building, OpenTCP, is to provide an automatic and dynamic network-wide tuning of these parameters to achieve performance goals of the network. This is done in an SDN framework with a central controller that gathers information about the network and makes an educated decision about how end-hosts should react. Experiments show some very nice improvements in flow completion time. Questions: Did you see cases when switching dynamically offered an improvement? And in general, how often do you need to switch to get near the best performance? Some of that remains to be characterized in experiments. [Monia Ghobadi, Soheil Hassas Yeganeh, Yashar Ganjali: Rethinking End-to-End Congestion Control in Software-Defined Networks]

Eric Keller, now at the University of Colorado, spoke about network migration: Moving your virtual enterprise network between cloud providers, or moving within a provider to be able to save power on underutilized servers, for example. Now, doing this while keeping the live network running reliably is not trivial. The solution here involves cloning the network and using tunnels from old to new, and then migrating VMs. But then, you need to update switch state in a consistent way to ensure reliable packet delivery. Some questions: How do you deal with SLAs, how do you deal with networks that span multiple controllers? [Eric Keller, Soudeh Ghorbani, Matthew Caesar, Jennifer Rexford: Live Migration of an Entire Network (and its Hosts)]

Session 4: Performance

Ashish Vulimiri presented our paper on making the Internet faster. The problem: Getting consistent low latency is extremely hard, because it requires eliminating all exceptional conditions. On the other hand, we know how to scale up throughput capacity. We can convert some extra capacity into a way to achieve consistent low latency: execute latency-sensitive operations twice, and use the first answer that finishes. The argument, through a cost-benefit analysis and several experiments, is that this redundancy technique should be used much more pervasively than it is today. For example, speeding up DNS queries by more than 2x is easy. [Ashish Vulimiri, Oliver Michel, P. Brighten Godfrey, Scott Shenker: More is Less: Reducing Latency via Redundancy]

The questions are getting interesting. Where is Martha Raddatz?

Udi Weinsberg went in the other direction: redundancy elimination. This is an interesting scenario where a kind of content-centric networking may be a big help: in a disaster which cuts off high-throughput communication, a DTN can provide a way for emergency response personnel to learn what response is most effective, through delivery of photos taken by people in the disaster area. But in this scenario, as they have verified using real-world data sets, people tend to take many redundant photos. Since the throughput of the network is limited, smart content-aware redundancy elimination can more quickly get the most informative photos into the hands of emergency personnel. [Udi Weinsberg, Qingxi Li, Nina Taft, Athula Balachandran, Gianluca Iannaccone, Vyas Sekar, Srinivasan Seshan: CARE: Content Aware Redundancy Elimination for Disaster Communications on Damaged Networks

Onward to Day Two...

Notes on ACM, Open Access, and Copyright

2012-05-14T20:46:00.000-07:00

My last post listed the comments on open access and copyright of the candidates in the 2012 ACM Council Election. Since I first posted, several more responses came in, so you might be interested to check it out. Vicki Hanson's note, in particular, provided a concise summary of the rationale for ACM's current policies.

So what did the candidates think? There are at least two important issues:

Not preventing access to papers: This is a question of the copyright or licensing policy. Does it inhibit researchers from distributing their own work?
Actively facilitating greater access to papers: This implies that ACM itself would somehow openly distribute papers.

Not preventing access to papers

The candidates' statements differed fairly significantly on this point — so you have a meaningful choice in your vote!

Many candidates noted that already the ACM allows authors many rights. However, it still prevents uses such as posting on arXiv and commercial distribution.

The co-chairs of the ACM Publications Board explained ACM's copyright policy in the October 2011 CACM. Regarding copyright transfer, they write:

One might wonder, given the generous rights retained by authors, why ACM requires authors to transfer copyright to ACM at all. In fact, the transfer of copyright to ACM provides substantial benefit to the computing research community and to authors. By owning exclusive publication rights to articles, ACM is able to develop salable publication products that sustain its top-quality publishing programs and services; ensure access to organized collections by current and future generations of readers; and invest continuously in new titles and in services like referrer-linking, profiling, and metrics, which serve the community. Furthermore, it allows ACM to efficiently clear rights for the creation, dissemination, and translation of collections of articles that benefit the computing community that would be impossible if individual authors or their heirs had to be contacted for permission. Ownership of copyright allows ACM to pursue cases of plagiarism. The number of these handled has been steadily growing; some 20 cases were handled by ACM in the last year. Having ACM investigate and take action removes this burden from our authors, and ACM is more likely to obtain a satisfactory outcome (for example, having the offending material removed from a repository) than an individual.

My summary of this is that ACM gets the following from holding the copyright:

More revenue. Question: how much more?
Easier dissemination without contacting individuals. Question: wouldn't this be fixed with a non-exclusive perpetual license to distribute the work?
Ability to pursue plagiarism. Point of comparison: 20 papers represents a fraction 0.000065 of the 307,000 articles in the Digital Library, i.e., one in every 15,350.

Actively facilitating greater access to papers

Exactly zero of the candidates fully endorsed open access in the sense of ACM providing all publications freely online, though Radia Perlman came closest.

Open access does not necessarily mean that all the Digital Library's services would be free — only that papers would be distributed freely somehow (for example, many ACM conferences already distribute their proceedings freely online). Still, full open access certainly could impact revenue, perhaps significantly. Here are some interesting numbers. In 2011, the ACM DL grew by over 31,000 full-text articles, or 11%, to a total of 307,000 (up from 21,000 new articles in 2010). In 2011, from publications, ACM earned $18,275,000 in revenue (28% of its total) and incurred $11,750,000 in expenses. Thus, for each new publication last year, ACM took in $590 and spent $379 leaving about $211 to support numerous other activities beneficial to the community.

I assume those numbers include not only digital but also print distribution of some papers and articles. It would be interesting to have ACM's digital-only costs as a comparison to the arXiv. In 2010 arXiv wrote,

The annual budget for arXiv is $400,000. With over 60,000 new submissions per year one may think of this as an effective cost of <$7 per submission. Alternatively, with over 30,000,000 full-text downloads per year this is an effective cost of <1.4 cents per download.

The one-time cost of $7 per submission is as much as three orders of magnitude lower than some other estimates of the cost of providing open access per paper. In 2009 Michel Beaudouin-Lafon wrote in CACM:

But how much are authors ready to pay to publish an article? A few hundred dollars? The most prominent Open Access publisher, the Public Library of Science (PLOS), is a nonprofit organization that has received several million dollars in donations. Yet it charges between $1,350 and $2,900 per paper, depending on the journal. In fact, many in the profession estimate that to be sustainable, the author-pay model will need to charge up to $5,000–$8,000 per publication.

Some of these numbers might include additional services such as editing, but the arXiv numbers and similar numbers from JMLR imply that the cost of archiving and distribution is far lower than the thousand-dollar estimates. Indeed, PLOS ONE publisher Peter Binfield left to found Peerj which will apparently charge authors a $99 lifetime membership fee to publish open access papers starting fall 2012.

Remember the ACM election runs just a few more days, till May 22.

Statements of ACM candidates on open access and copyright policy

2012-04-24T19:41:00.001-07:00

In the May issue of CACM, Moshe Vardi argues that the interests of authors and commercial publishers have irreconcilably diverged. But "in the case of publishing by a professional association, such as ACM, the authors, as ACM members, are essentially also the publishers", so when choosing a publishing model, "the decision is up to us: ACM members, authors, publishers."

Good point! With the 2012 ACM Council Election happening now through May 22, what are the candidates' positions on progressive copyright policies? I asked the candidates the following:

Do you have a position on the appropriate copyright policy for ACM's publications? Specifically, should the copyright on published research papers be assigned to ACM, or should the authors retain the copyright with ACM holding a non-exclusive license to distribute the work, similar to USENIX's policy? What is your position on moving ACM's publications to an open-access model?

Here are the candidates and their responses, filled in as they come.

Update May 14 2012: Additional notes and thoughts over here.

President

Barbara G. Ryder, Virginia Tech: "The ACM Digital Library (DL) has been designed and constructed by ACM, led by the vision of computing researchers in the SIGs. It now has become THE repository to go to for computing publications, having listings for many more than only ACM publications. This effort was undertaken for and supported by the computing research community; more recently, ACM has enhanced the DL with author metrics, additional search capabilities, the Authorizer tool, etc -- all in support of the research community. So the ACM DL is an important resource for computing. But ACM is a membership organization, not a for-profit company which can choose to invest in services for the community, funded by other revenue streams. At this time, the ACM DL generates a significant income stream for ACM and its SIGs, which, in part, supports further DL development as well as other activities. Any discussion of Open Access publication and ACM has to consider the financial consequences of the choices to be made. It is not just a philosophical discussion. ps These comments have already been posted on the Web, after answering similar questions from Matt Welsh: link [Updated: Regarding question about copyright policy:] please look at [ACM copyright policy] ... This allows non-commercial personal use by an author of her/his paper after the copyright has been signed away to ACM [and] the right to post a unique link using the Author-Izer ACM Linking Service on either the author’s homepage or Institutional Repository (wherever the author’s bibliography is maintained) which enables free access from that location to the definitive version of the work permanently maintained in the ACM Digital Library."

Vinton G. Cerf, Google: "I much prefer a kind of creative commons method or licensing method that leaves the authors with copyright and ACM with sufficient privilege to carry out its work."

Vice President

Mathai Joseph, Tata Consultancy Services (excerpt of longer response): "... I am quite happy with the ACM copyright policy because it represents a sensible balance between the rights of the author and the rights of a publisher who has invested time and effort in making the publication available to the community. ACM is competing with commercial publishers with far more restrictive policies and has to protect its rights in a fairly predatory market. ... Thank you for raising this important question. Some time back I talked to people in ACM HQ about it, thought of alternatives and then decided that the ACM policy is actually fairly sane."

Show/hide Mathai's full response

Thank you for your message and question about my stand on a copyright policy for ACM.

First, you refer to the USENIX policy as having 'a non-exclusive license'; in fact USENIX asks for exclusive rights for a specified period (12 months) and rights to continue to maintain its copies with public access after that period.

More broadly, I think the important question is the expected period of interest in a publication. I may be wrong, but I would guess that material published by USENIX has immediate interest for a specific community that diminishes over time as the important ideas of the content become part of a more permanent repository for long-term reference. In that context, 12 months is probably the period when there is most interest and it is covered by exclusive rights.

In contrast, journals provide a long term repository for material that has been carefully selected, refereed by the community and published as part of the accepted knowledge of a field (accepting of course that errors may be found at a later time). A paper like the one by Fischer, Lynch and Paterson on 'Impossibility of Distributed Consensus with One Faulty Process' which appeared in J. ACM in April 1985, has now had over 3000 citations, many of which have appeared in the last decade, or over 15 years after original publication. So rights have to be preserved over a very long time.

I am quite happy with the ACM copyright policy because it represents a sensible balance between the rights of the author and the rights of a publisher who has invested time and effort in making the publication available to the community. ACM is competing with commercial publishers with far more restrictive policies and has to protect its rights in a fairly predatory market.

The ACM Digital Library took a very large investment from ACM members to create. It not only holds the final version of a publication, it allows it to be seen along with other similar or related publications by the same author, or on similar topics. So the value of the DL should not be seen in the context of a single publication but over a range of publications that may be of interest at the same time. If the DL did not exist, it would have to be created in order to give us all the facilities that are needed for research. ACM does have consortium agreements for access to the DL and this brings down the cost for access (to zero, in most cases, since it is the institution and not the individual who has to pay the consortium charges).

I would like to turn the question around and ask you what the ACM policy prevents an author from doing: in what important way is the author unable to make use of his or her publication because of ACM's policy?

Thank you for raising this important question. Some time back I talked to people in ACM HQ about it, thought of alternatives and then decided that the ACM policy is actually fairly sane.

Alexander L. Wolf, Imperial College London: "Obviously, this is a very important and timely issue. ... I can tell you that the USENIX licensing model, the IEEE Security and Privacy licensing experiment, and related ideas are all under active study by various ACM volunteer groups. One thing I've learned from these discussions is that open access is a deceptively and desperately complex issue ... Personally, I subscribe to the general principle that outcomes of activities supported through public funds ... should be available for use by all citizens. ... ACM provides a staggeringly rich set of services (not just the management of professional conferences within a restricted intellectual domain, which is the predominant role of USENIX) to its members and to the larger (non-member) community. Those services cost money. ... How do we compensate for the loss of DL revenue, the funds that effectively subsidize many of ACM's other activities? Should we raise member dues? Should we raise conference fees? (BTW: dues and fees would have to be raised substantially, to the point that we would seriously risk the viability of both our organization and our conferences. Have you looked at the fees being charged for NSDI 2012 this week? And that's just to cover the conference costs, a bit of USENIX staff time, and a small share of maintenance cost for the USENIX content servers.) ... For instance, ACM is able to provide substantial financial and organizational aid to CSTA [... which supports CS K-12 school teachers. Alex also mentioned ACM's role in policy, developing nations, inclusion of women, curricula guidance, the 35 ACM SIGs, and student participation.] The overall point here is that we face a difficult trade off. ... The aim is to find a balance between the potentially conflicting goals of giving individuals easy access to the information generated by the community at the same time as helping guarantee a revenue stream for an organization that, frankly, plays a key role in sustaining the community. ... I urge you to take a look at several articles that have appeared in CACM related to the open access issue if you haven't already done so: [1, 2, 3]. I largely agree with them, and as such they also represent my position on the topic. Of course, the environment is dynamic, and new ideas are likely to emerge. I think the important thing for an officer of our association to do is maintain an understanding of and appreciation for the full context of the situation."

Show/hide Alex's full response

Thanks for getting in touch. Obviously, this is a very important and timely issue. It is one that is discussed and debated regularly by the ACM Council and ACM Publications Board. I take that as a healthy sign: the serious thought and effort that ACM volunteers are putting into consideration of the issue. I can tell you that the USENIX licensing model, the IEEE Security and Privacy licensing experiment, and related ideas are all under active study by various ACM volunteer groups.

One thing I've learned from these discussions is that open access is a deceptively and desperately complex issue, and one for which there is a lot of mis-information floating about. For example, the notion that one needs to "mov[e] ACM's publications to an open-access model". We should begin with the question: by what definition of "open access"? ACM publications are already considered "Green Open Access" as defined by various leading advocates of open access. So, we need to understand in what way GOA might not be sufficient or appropriate for ACM publications. Consider, too, ACM's new Author-izer service, which gives authors a mechanism for granting non-DL subscribers cost-free access to their publications. Access can be granted from a personal web page or from an organizational corpus (e.g., a university's publication repository). And, of course, the standard ACM copyright agreement already permits various forms of free dissemination.

Personally, I subscribe to the general principle that outcomes of activities supported through public funds (whether directly through government research grants, or indirectly through the education, training, and employment of people who carry out research at public institutions no matter the sponsor of that research) should be available for use by all citizens. (As a general principle it leaves aside many thorny issues, of course, such as what about partial support, what about certain specific and potentially harmful dual-use outcomes, how do we best promote industrial innovation, are not-for-profit organizations such as MIT and ACM "public" institutions, etc. Let's accept that we don't have answers to those questions for the moment.)

Now, how does that principle relate to your questions? It could be that this principle is exactly what you had in mind. Or it could be that you believe authors should have exclusive rights to what they produce, which could very well be in conflict with the principle outlined above. (Consider, for example, that if one follows the principle above, then by accepting public funds one has already given up certain rights.) And, then, which perspective is supported by the notion of licensing to which you alluded? I would suggest the latter (exclusive author rights), in which case we may well disagree. You see, some people may think that licensing, as opposed to copyright transfer, better supports public access, when in fact it may instead simply support exclusive author rights, at which point we must then trust each individual author (or the organization that employs the author) to make the works publicly available, and on a continuing basis. So perhaps it is actually the detail of the agreement that is put in place that is important, not so much the vehicle (license or copyright transfer) that is used to carry it. See, for example, this commentary on the IEEE Security and Privacy license experiment:

https://freedom-to-tinker.com/blog/dwallach/ieee-blows-it-security-privacy-copyright-agreement/

There are many, many other issues to consider. Here is a sampling:

ACM is a not-for-profit, volunteer, member organization. The decisions that ACM takes are decisions made by you and me, the members of the organization, not the headquarters staff.
Why is it that libraries and library consortia are willing to pay ACM for DL access? Two simple answers: (1) because ACM content is not only of the highest quality, it is far, far less expensive than the fees charged by commercial publishers -- value for money in an extremely tight economy; and (2) because it is a managed-access corpus supported by a professional organization. We must be very careful to consider this value model.
ACM provides a staggeringly rich set of services (not just the management of professional conferences within a restricted intellectual domain, which is the predominant role of USENIX) to its members and to the larger (non-member) community. Those services cost money. Do we believe that these services are valuable? Then we must find ways to generate the money to fund them. Should we shut down the ACM DL and let authors take full responsibility for making their papers publicly accessible? Should authors be charged a fee for ACM to provide the DL service? How do we compensate for the loss of DL revenue, the funds that effectively subsidize many of ACM's other activities? Should we raise member dues? Should we raise conference fees? (BTW: dues and fees would have to be raised substantially, to the point that we would seriously risk the viability of both our organization and our conferences. Have you looked at the fees being charged for NSDI 2012 this week? And that's just to cover the conference costs, a bit of USENIX staff time, and a small share of maintenance cost for the USENIX content servers.)
We need to consider that there are multiple constituencies involved in this issue. Authors, yes, but also readers, other ACM members, research sponsors, practitioners, governments, companies, teachers, students, the public at large, and libraries and library consortia. Of particularly concern to me, I must admit, are those benefiting from the other services made possible in part by the revenue generated by the ACM DL. For instance, ACM is able to provide substantial financial and organizational aid to CSTA, the Computer Science Teachers Association, which is an activity (started by the ACM) to support K-12 teachers ("school teachers" in the UK) around the world. ACM operates USACM, which provides informed technical opinions to US policy agencies and law makers, whose decisions, like it or not, have huge impact around the world. ACM is helping developing nations, such as India, organize their computer science education and research communities. ACM is promoting the inclusion of women in the profession through ACM-W and related activities. ACM provides curricula guidance used in establishing educational programs and accreditation criteria. The 35 ACM SIGs and their members receive substantial support from the ACM DL revenue, again effectively subsidizing their operations, such as to promote student conference attendance. There are many other examples.
Should we allow this issue to be resolved on a case-by-case basis by individual authors? By that I mean, should authors decide for themselves what rights to assign or not? My feeling is that such an approach is not viable, much in the same way that (health or car) insurance as a concept only works if the society as a whole is compelled to participate. We are in a society of sorts, a computer-professionals society, and as such we must also consider what is required of the individual to maintain the viability of the society. Of course, this is the essence of the debate, and we must resolve opposing viewpoints on that question.

The overall point here is that we face a difficult trade off. Any action we take in one direction with respect to this issue must certainly be taken in consideration of its impact on the others. Facile solutions and proposals must be considered suspect.

The trade off, and the ACM response to it, are well represented by the emerging notion of "fair access", which is obviously an allusion to the related DRM notion of "fair use". The aim is to find a balance between the potentially conflicting goals of giving individuals easy access to the information generated by the community at the same time as helping guarantee a revenue stream for an organization that, frankly, plays a key role in sustaining the community. As ACM volunteers, let's be careful not to let our not-for-profit, professional association get caught up in the swirl surrounding the for-profit, commercial publication companies, such as Elsevier. Yes, the ACM volunteers want to maintain a revenue stream, but to support and sustain good works for the community, not to generate a "profit".

I hope I've answered your questions. I urge you to take a look at several articles that have appeared in CACM related to the open access issue if you haven't already done so:

http://cacm.acm.org/magazines/2009/7/32075-open-closed-or-clopen-access/fulltext

http://cacm.acm.org/magazines/2010/2/69353-open-access-to-scientific-publications/fulltext

http://cacm.acm.org/magazines/2012/5/148564-fair-access/fulltext

I largely agree with them, and as such they also represent my position on the topic. Of course, the environment is dynamic, and new ideas are likely to emerge. I think the important thing for an officer of our association to do is maintain an understanding of and appreciation for the full context of the situation.

Secretary/Treasurer

George V. Neville-Neil, Neville-Neil Consulting: "At the moment this entire question is being gone over by the Publications Board of ACM. They are meeting this June to talk about this issue as well as others. This has not been an area of ACM policy that I have been involved with in the past, but I agree that it's extremely important, not only to authors, but to the organization as a whole. I remain open minded about what the policy ought to be in the future, and am interested in seeing what the publications board comes up with as a recommendation. Having published several articles, and a monthly column, with ACM I have to say that I do not find the current system to impose unnecessary strictures on my ability to share my work or for others to gain access to it. [After a short exchange concerning arguments for open access:] Thanks for the pointers, I've looked them over and they're certainly food for thought. I'll keep these in mind as and when I get to see what the publications committee comes up with. I suspect that if ACM does move to a similar model to USENIX that this will take time as there are actual financial questions to deal with in this area. While the cost of publishing has diminished, there remain costs other than printing and shipping paper that ACM has to deal with. Figuring out a path from the current model to a more open one is certainly something I'd be involved with if I were elected as Secretary/Treasurer."

Vicki L. Hanson, University of Dundee: "I appreciate your thoughtful questions put to ACM candidates for election. The issues you raise have been, and continue to be, extensively discussed within ACM. ACM’s Publications Board regularly considers questions of licensing and open access and strives to continue with its high quality service while providing authors rights to their published work. As you are likely aware, the Pubs Board Chairs published an editorial in the October, 2011 issue of CACM about ACM’s copyright policy. Since that editorial, ACM has made available the Author-Izer service that allows authors to put a link on their personal or institutional web page that will enable anyone to download the definitive version of published papers from ACM’s Digital Library (DL) at no charge. This service also makes available the display on these personal and institutional pages of ACM's up-to-date download and citation statistics for the publications. ACM is exploring the implications of allowing authors to retain copyright, transferring a license to ACM for archiving, indexing, and electronic distribution. It is worth noting that such a change, according to my understanding, would make it somewhat easier for authors to distribute their work but would preclude ACM from protecting those works from plagiarism and unauthorized distribution by other entities including for-profit ones. The current policy must be reviewed, weighing the importance of such protections and other author needs. The fully open access issue is more difficult still and requires a careful consideration of business practices and organizational sustainability. There are substantial costs involved in publishing and maintaining the high quality archival collection of materials provided by ACM’s DL. I agree with the Pubs Board’s resistance to the author pays model of open access in that this does not allow poorly-funded authors to have the same access to publishing as well-funded ones. An economic model that places the financial burden on conferences for proceedings publications similarly tends to place financial roadblocks to publication for those less able to pay. This latter model also does not address larger questions of how the DL would be funded to support journals, educational materials, and other non-conference content. The current ACM business model attempts to gives authors flexibility and rights to make their work available to the community while, at the same time, being able to provide the DL service for aggregating articles, collecting bibliometrics, and investing in further development of the DL as a resource for the computing community. I realize that the above answers are not the definitive answers you might have sought in your questions to me. At this point in time, the issues you raise are critical ones for the future of ACM and continuing dialog is needed to consider the best way forward in terms of meeting the needs of authors and readers of DL materials as well as determining a sustainable business model that will allow authors and readers continued access to the DL, an important resource for ACM’s community of researchers and practitioners."

Members at Large

Radia Perlman, Intel: "I'd like to hear arguments on all sides before having a cast-in-stone position. Some companies have worked out an agreement with IEEE and ACM for something like what you said...that ACM has non-exclusive right, but the authors also get to post and distribute. So that implies, I think, that it wouldn't be totally detrimental to ACM to do that for everyone. Some conferences post the papers online, freely accessible. That seems like the right thing to do. Going beyond the rights of authors (and/or the company they worked for at the time they wrote the paper) having the right to post and distribute, I think the model of only letting people see the title and abstract of papers, and then having to pay to download the article, is really bad for facilitating research. When one is doing research, and browses on the web, and finds a 15 year old paper that looks like it might be relevant, but you have to pay $25 to download the paper, only to find it really is not relevant... A lot of companies and most universities have a blanket access to ACM and IEEE publications, so people at those companies probably don't notice the issue. I wonder how much ACM depends on revenue from people downloading papers. Especially really old ones. Perhaps a compromise might be to say that after, say, 3 years, the articles should be free. Anyway, my heart is in having everything easily accessible on the web, for free. I wouldn't care, as an author, whether I could distribute the paper or just a link to the paper, as long as the link allows the person to see the whole paper. For facilitating research, my inclination would also be for anyone to access all the published papers, without having to get a link from the author. [...] But as I said, I'd like to hear other points of view and legal/economic issues that I may not fully appreciate, before getting too entrenched in a position."

Ricardo Baeza-Yates, Yahoo! Research, Barcelona/Santiago: "In general I am in favor of open access models and giving the author more control of their copyrights. On the other hand we need to do this without jeopardizing the financial stability of ACM."

Feng Zhao, Microsoft Research Asia, Beijing: "My platform is primarily around building a sustained and quality engagement between ACM and the regional computing community in China and the rest of Asia, building on the tremendous momentum of the Council's China and India initiatives. As part of that, I felt it is important to lower the cost of access for people from the developing regions. I have not really thought through the copyright issue at any depth. But one thing is clear. The old model of publication, dissemination, and monetization is broken in the online world today. If elected, I will work with the Council to study and innovate on ways that can expand the ACM reach and at the same time ensure the financial sustainability of the society."

Eric Allman, Sendmail: "This is neither my area of expertise nor do I have all the information (particularly about finances), so I do not (as yet) have a strongly held position on this. However, I don't understand why it is necessary for ACM to actually hold copyright as long as it retains the rights to use the materials in the ways that it already does. In particular, as I read the copyright policy, the authors retain the right to privately publish the materials on non-ACM web sites, so the usual financial argument about the Digital Library doesn't seem to fit here. It also seems clear to me that research that was funded with public money should be available to that public with no more than a cost recovery fee. Obviously not all authors are funded by government grants, and the ACM audience transcends any particular government, but trying to sort articles on this basis seems excessively complex. I'm also a supporter of the concept of replication to maintain long-term integrity and retention of archival material, which is antithetical to centralized administration. Note that I'm not saying that the DL is superfluous or needs to be free. The DL provides value through indexing, providing a stable reference copy (URLs are notoriously unstable), and assisting ease of access. Maintaining the DL is not without costs which need to be recovered, and any reductions in revenue resulting from changing the copyright policy must be balanced in some way. Fiscal responsibility is important."

Mary Lou Soffa, University of Virginia: [not yet responded]

PJ Narayanan, IIIT-Hyderabad: "I personally believe the authors should have all rights to distribute their work and hence should hold the copyright. ACM as the publisher and maintainer of the electronics library should have non-exclusive rights to distribute the content."

Eugene H. Spafford, Purdue University: "Well, I'm not expert in ACM's policies, so I am not sure I am the best person to ask right now. However, I'll try to answer. My understanding is that there is a publications board that considers ACM policy for copyright. It is regularly reviewed. I know there have been many changes during the time I've been a member, in response to changing times, needs, and user requests. I haven't heard of any problems with the current policy, and things seem to be working okay. So, I'll assume that the current policy is appropriate until presented with evidence indicating otherwise. [In response to whether authors should retain the copyright:] ACM is not like USENIX -- I know, as I was a member of Usenix for 25 years. ACM publishes journals and maintains a curated digital library that must be supported over a long time to be of real value. The Usenix model is okay for some conferences, and for authors to maintain for a limited period of time, but that is not the same as immutable copies maintained in a curated collection, indefinitely. The current model seems to work fine, so, that gives a proof by example. [In response to whether publications should be open access:] Please define 'open access' and what it provides that the current model does not. Does it provide the necessary support and resources to maintain and enhance the ACM digital library in a global environment for an indefinite time? I'd then want to see a response from someone in ACM about the current model. I'm open to considering changes, but I need complete information to understand the issues and potential effects."

Update: Candidates' positions on open access from two years ago. A couple candidates are running this year as well.

Notes: You can read more recent discussion of open access here and here. Thanks to all the candidates for taking the time to reply. Thanks to George Porter for suggesting this.

Jellyfish: Networking Data Centers Randomly

2012-04-24T11:04:00.000-07:00

People have been designing communication network topologies for more than 150 years. But usually their structure is quite constrained. Building a wide area network, one has to follow the locations of cities or railroads. Building a supercomputer, a regular structure enables simple, deadlock-free routing. Building a traditional data center network, one might use a tree-like design amenable to the Spanning Tree routing protocol. Even state-of-the-art high-capacity data center networks might use a multi-rooted tree like a fat-tree.

3-level fat-tree · 432 servers, 180 switches, degree 12

In our Jellyfish paper appearing this week in NSDI 2012, we're proposing a slightly radical alternative: a completely random network.

Jellyfish random graph · 432 servers, 180 switches, degree 12

This project, the work of Ph.D students Ankit Singla and Chi-Yao Hong, along with Lucian Popa of HP Labs and myself, has two goals. First, high bandwidth helps servers avoid bottlenecks while streaming big data across the network, and gives cloud operators the agility to place virtual machines on any physical host without worrying about bandwidth constraints between hosts.

Second, we want a network that is incrementally expandable. Cloud service providers continually expand and modify their networks to meet increasing demand and to reduce up-front capital expenditure. However, existing high-capacity network interconnects have relatively rigid structure that interferes with incremental modification. The fat-tree, hypercube, butterfly, and other proposed networks can only be built from a limited menu of fixed sizes and are difficult to expand by, for example, adding a rack of servers at a time. Of course, there are some workarounds: one can replace some switches with ones of larger port-count or oversubscribe them, but this can make network bandwidth constrained or uneven across the servers. One could leave ports free for future network connections but this wastes investment while they sit idle.

Our solution is to simply give up on structure, and build a random network among the network routers. This sloppiness yields significantly more flexibility than past designs: a few random link swaps is all it takes to incorporate additional components, making a new random network. It can naturally support varying port-counts, and scales to arbitrary sizes rather than limiting the network to coarse design points. In fact, we show in the paper that Jellyfish reduces the cost of incremental expansion quite substantially over a past expansion heuristic for fat-tree-like (Clos) networks. Intuitively, Jellyfish makes network capacity less like a structured solid and more like a fluid. Coincidentally, it also looks like a jellyfish.

Arctapodema jellyfish · Bill Curtsinger, National Geographic

At this point, one natural reaction is that a completely random network must be the product of a half-deranged intellect, somewhere between 'perpetual motion machine' and 'deep-fried butter on a stick'. Won't network capacity decrease, due to the sloppy interconnections? How does one route packets through a completely unstructured network? Isn't it possible to randomly pick a bad network, or for failures to cause problems? How do you physically cable up a network that bears more than a passing resemblance to a bowl of spaghetti? How could it possibly work?

The first surprise is that rather than sacrificing bandwidth, Jellyfish supports roughly 25% higher capacity than a fat-tree of equal cost. That is, a completely random network makes more efficient use of resources than a carefully-structured one. The intuition is that Jellyfish's diverse connections — in theoretical terms, it is a good expander graph — give it low average path length, which in turn means that sending each packet across the network takes less work. In fact, there is reason to believe that the random graph is pretty close to the best possible network for maximizing throughput, perhaps within 10%.

Routing is another interesting question. Chi-Yao and Ankit found that load-balanced routing works well as long as the routing protocol provides sufficiently high path diversity, as can be obtained with OpenFlow, MPLS, or other recent proposals that go beyond STP or simple shortest path routing. In addition, it turns out that the questions of consistent performance, resilience, and cabling have favorable, and we believe reasonably practical, answers. There are some very interesting theoretical questions that come up as well, which we're now looking into.

Jellyfish is sufficiently unlike past designs that implementation challenges will certainly arise. But so far it seems like an unstructured network just might work. And happily, rather than running into a tricky tradeoff, the two design goals of high bandwidth and incremental expansion appear to be satisfied by one network.

Congratulations to Ankit and Chi-Yao who have done (and are continuing to do) great work on this project. Don't miss Chi-Yao's talk in the Thursday afternoon data center networking session! Finally, thanks to NSF for supporting this research.

Like a spider's web

2011-12-06T16:39:00.001-08:00

“It is anticipated that the whole of the populous parts of the United States will, within two or three years, be covered with net-work like a spider's web.”

— Illustration and quote from The London Anecdotes, 1848
Quoted in The Victorian Internet

Google+ vs. Facebook engagement

2011-12-04T22:22:00.000-08:00

Here's a little statistically-insignificant self-experimentation, based on 51 near-simultaneous posts to both Google+ and Facebook, from the beginning of September 2011 to the present.

"Engagement" is the number of unique people (excluding me) who responded, either by commenting, liking or +1'ing the post, or liking or +1'ing a comment on the post. (The scatterplot points are perturbed slightly from their true integral values so they don't completely overlap.) What's remarkable here is how coincidentally similar the engagement is on the two networks — the difference is under 2 percent (!) despite the fact that my social network on Facebook is currently 2.25 times as large as on Google+.

Google+ has very slightly higher engagement on STEM-related posts (science, technology, engineering, and mathematics), while Facebook is slightly higher for other posts, but the differences are well within 95% confidence intervals.

It's possible my social network has somewhat shifted to Google+. Here is the post set split into five chronological partitions with 10 or 11 posts in each.

Engagement as defined above excludes re-sharing posts, because I wasn't confident the two social networks are reporting these in the same way (e.g., do they both report recursive shares?). But there is some interestingly significant difference in sharing behavior on Google+ with STEM posts seeing nearly seven times as much sharing as non-STEM posts in this very small data set, an effect which didn't appear on Facebook.

Of course, all of this is specific to my social network, and really, the sample size is too small to draw any conclusions at all. Now, if someone were to compare posts for a large number of people that cross-post publicly to Facebook and Google+, that could start to get interesting...

Matrix multiplication algorithms over time

2011-12-02T15:37:00.001-08:00

The asymptotically fastest algorithm for matrix multiplication takes time O(n^ω) for some value of ω. Here are the best known upper bounds on ω over time.

The latest improvements, the first in over 20 years, are due to Andrew Stothers and Virginia Vassilevska Williams. The latter gave an O(n^2.3727)-time algorithm for multiplying matrices.

When will the sometimes-conjectured ω = 2 be reached? Certainly nothing wrong with taking a linear fit of this data, right?

So that would be around the year 2043. Unfortunately, the pessimist's exponential fit asymptotes to ω = 2.30041...

Live-Blogging HotNets 2011 (Day Two)

2011-11-16T00:21:00.001-08:00

(OK, not quite live...)

Day One was over here.

Session 5

Christopher Riederer spoke about auctioning your personal information. Unfortunately I missed the talk, but it must have been a good one since there was quite a bit of discussion.

Next up was Vincent Liu speaking about Tor Instead of IP — which is just what it sounds like, Tor as an Internet architecture. Of course you can't just use Tor and they have proposals for controlling incoming traffic, DoS, and getting better efficiency (lower stretch) with enough diversity of jurisdiction and plausible routing-policy-compliance. Similar to what Telex does with network-layer steganography, the general approach here is to make Internet connectivity an all or nothing proposition: If you can get anywhere outside a censored region, you can get everywhere so all the censor can do is block the entire Internet.

Last was Gurney et al's Having your Cake and Eating it too: Routing Security with Privacy Protections. The notion of security here is that the neighbors of an AS can verify that the AS in question selected and advertised routes according to its agreed-upon policy. The privacy is that they can verify this without revealing any more information than the verification itself. The paper presents protocols to verify several particular policies (e.g., if the AS was offered some route, then it advertised one). Could be useful for debugging interdomain connectivity issues.

Session 6

Steven Hong presented Picasso which provides a nice abstraction, hiding the complexity of full duplex signal shaping to utilize discontiguous spectrum fragments.

Mohammad Khojastepour spoke about using antenna cancellation to improve full duplex wireless.

Souvik Sen asked, Can physical layer frequency response information improve WiFi localization? Yes. And it involves driving Roombas around cafeterias.

Session 7

Poor old TCP felt snubbed at this workshop until Keith Winstein started roasting it. TCP is designed to work well in steady-state with very particular assumptions. It fails under messy real world conditions — hugely varying RTTs, stochastic rather than congestion-induced loss, very dynamic environments, and so on. Keith's goal is to make transmission control more robust and efficient by modeling network uncertainty. So the end-host has a model of the network (potentially including topology, senders, queues, etc.) but any of these elements can have unknown parameters describing their behavior. Then it maintains a probability distribution over possible parameter values and updates its beliefs as evidence comes in from the network (e.g., ACKs). At any given time, it takes whatever action maximizes expected utility (= some throughput / fairness / latency metric) given the current distribution of possible situations. The action is a delay until sending next packet. It's a beautifully intuitive idea, or as a certain reviewer put it,

"This is the craziest idea I've heard in a very long time."

Keith showed a simulation in one small example where this approach decides to use a slow start behavior — without that being hard-coded in — but then after getting a little info, immediately sends at the "right" rate. But there are big challenges. Some discussion touched on state explosion in the model, the danger of overfitting with an overly-complicated model, and how much the sender needs to know about the network.

Q: What would be the first variable you'd add to the network model beyond what TCP captures? A: Stochastic loss.

Q (Barath): Should we add a control plane to disseminate information about the network to assist the model? A: anything that gets more info is good.

Q (Ion): When won't this work? A: Model mismatch — if the truth is something the model didn't consider.

Q: Do the smart senders coexist? If you have 2 versions of the algorithm, do they coexist? A: Good question.

Next, a movie. The BitMate paper — "Bittorrent for the Less Priviliged" — by Umair Waheed and Umar Saif was one of the few entirely-non-US papers. Unfortunately visa issues prevented their attendance but Umar presented via a movie. The problem is that high bandwidth BitTorrent nodes for mutually helpful clusters but no one wants to upload to low bandwidth nodes. The solution included several interesting mechanisms but Umar said the one that got 80% of the benefit was something they call "Realistic Optimistic Unchoke" which improved unchoking of low-bandwidth peers. BitMate gets dramatic improvements in efficiency and fairness.

Umar took questions via Skype. It was so 21st century.

BitMate was written up in the NYTimes.

Q: what's your experience with real deployment? A: 40,000 users from 170 countries — a lot from US "for reasons that escape me" (~40% from North America). Many users from Iran, probably to circumvent censorship.

Vyas Sekar told us that a particular real 80,000 user network with tens of sites had 900 routers and 636 middleboxes (Firewalls, NIDS, Media gateways, load balanceers, proxies, VPN gateways, WAN optimizers, Voice gateways). Problem with middleboxes: device sprawl. Another problem: dealing with many vendors (Bruce Davie, Cisco: that problem we can fix!). Result: high CapEx and high OpEx. (Just network security cost $6B in 2010, $10B in 2016.) Also, middleboxes today are inflexible and difficult to extend.

So most net innovation happens via middleboxes, but it doesn't come easily. And middleboxes have been missing from the research community's discussion of innovation in networks. The Vision: Enable innovation in middlebox deployments. Approach: (1) software-centric implementations, (2) consolidated physical platform, (3) logically centralized open management APIs. There are also opportunities for reduced costs (via multiplexing) and improved efficiency (via reuse of functions like session reconstruction of TCP flows).

Session 8

This was the data center session with three more cool papers, but unfortunately I had to leave early.

See the full program.

Overall I was impressed with the ideas and engaging presentations. Thanks to the chairs Aditya and Ion for a very well-run workshop.

Live-blogging HotNets 2011

2011-11-14T13:41:00.000-08:00

Lots of exciting talks and discussion at HotNets. Here are a few highlights.

Session 1

The first session was on Internet architecture. Ali Ghodsi spoke about three unanswered questions for the burgeoning area of {data/information/content}-{centric/oriented} networking. These were privacy, data-plane efficiency, and whether ubiquitous caching (a key feature of nearly all the proposals) actually provides quantitative improvement. For the latter point, the argument is that work on web caching from the late 1990's indicated that if you have caching near the edge (as already exists in present-day web caches), then adding ubiquitous caching to the architecture does not provide much more benefit due to heavy-tailed access distributions. So, does the caching advantage of information-centric networking warrant such a large-scale architectural change?

In later one-on-one discussions, Dan Massey (who later gave an interesting talk on the IPv4 grey market) argued that at least for the NDN project, ubiquitous caching is not the focus. Something more important is being able to do policy-aware multipath load balancing — in a very dynamic way in the network, by shipping content requests optimistically to multiple locations and seeing what ends up working well. A kind of speculative execution for forwarding. This may not be specific to content-awareness, but Dan argued that if you want to make this work, you end up needing something like NDN's Pending Interest Table mechanism. (The discussion was brief, but hopefully I restated the argument accurately.)

Dave Andersen and Scott Shenker argued that the primary goal of a future Internet architecture should be to accommodate evolution within the architecture itself, rather than just adding new functionality. The XIA approach introduces the notion of data-plane fallbacks so the sender can ask for new functionality and if it isn't supported everywhere, things still work. Scott focused on bringing evolvability to the architecture by applying the principles of extensibility and modularity.

There were several questions about what would be the incentives to deploy either approach. Scott responded that while incentives are important, first we need to understand what technical mechanisms we need to make evolvability feasible — which previously we have not understood. Ion Stoica asked, What would be the first thing that would drive deployment of one of the future Internet architectures? Some of the answers included SCADA networks which need extreme security, content caching (despite the first talk!) where content providers have monetary incentives, and (from Hari) the ability to deploy differential pricing by having more information about applications' intent (though users may not like this!).

Another question was whether these architectures would actually have fixed the processes which led to ossification in practice (e.g., via middlebox problems); and whether they'd aid deployment of protocols like secure BGP which have had problems in practice.

Session 2

Haitao Zheng from UCSB spoke about building wireless data centers, where directional wireless interfaces on racks of servers can be dynamically steered to connect pairs of racks that need to communicate. If you want to connect racks of servers with high-bandwidth wireless links, interference is a big problem. Their approach is 3D beamforming: Rather than aiming the radio directly (in the 2D plane at the top of racks), bounce it off a reflective ceiling and put an absorber around the target. This direction of reception reduces interference. In addition to having many pretty pictures of interference patterns, this is part of a line of work (in wireless and optical) that has a very cool approach — we always think of changing the traffic flow to match the topology; now we can change the physical topology to match the traffic.

Abhinav Pathak talked about finding energy bugs in mobile devices — as he said, that hits three hot keywords.

Jonathan Perry spoke about Rateless Spinal Codes. My main question: why do coding schemes get to have such cool names?

Session 3

Mark Reitblatt spoke on "Consistent Updates for Software-Defined Networks: Change You Can Believe In!". Here is the problem they are solving: As you are reconfiguring your network how can you be sure your policy (like availability or security) is preserved even during the transition? Traditionally, this is hard because of the inconsistency of having one set of forwarding rules deployed some places, and another set deployed other places. Actually, it might seem impossible. Even if you magically deploy a change everywhere instantly, you can still get policy violations because packets travel across non-negligible time. Can you solve it? Yes you can!

Junda Liu should get some sort of award for giving the most entertaining talk that also featured a state machine diagram.

Barath Raghavan calculated the energy and emergy of the Internet, which has been getting some press recently and which generated a lot of discussion on the complexities and implications of measuring society's energy use.

Awesome feature of this session: all the talks finished early!

Session 4

Jon Howell spoke on a proposed refactoring (and narrowing) of the API for web applications executing on user machines.

Ethan Katz-Bassett spoke about Machiavellian Routing. The coolness here is a trick by which ISPs can control inbound routing, so if they notice there is a connectivity problem at some AS they can induce other senders to avoid the problem.

Dan Massey noted that a grey market is emerging for IPv4 addresses and argued that we need a way not to prevent the market from existing outside the traditional Internet governance, but instead to verify what transactions happen. This would make the market more honest and efficient. Most interesting point from discussion (I think from Jon Howell): Why do we want to improve the IPv4 market? This will allow more efficient use of available IPv4 addresses ... but if we let the market be as baroque and inconvenient as possible, it will encourage deployment of IPv6 sooner!

Onward to Day Two...

What's wrong with computer science reviewing?

2011-08-03T01:12:00.000-07:00

There is a sense among some researchers in computer science that many peer reviews in our field are bad — in particular, too often unfairly slanted against papers in various ways that do not encourage good science and engineering. Why might this be happening and what can we do about it?

[Aside: you can now follow me on Google+. Short posts there, long posts here.]

The problem

First of all, let me be clear: (1) I think most reviews, regardless of whether they recommend acceptance or rejection, are well done and reflect care and significant time that the reviewers have invested. (2) Exciting, impactful research still manages to get done, so the system does generally work pretty well. Still, that doesn't mean we can't improve it. (3) Despite the fact that I try to take care with each review, statistically speaking I have probably committed each of the problems discussed here.

With the fine print out of the way ... what is this possible problem? The evidence is almost all ancedotal and biased. But since that is all we have, let me supply some anecdotes, first that reviews are often negative:

This July's edition of Computer Communication Review rejected all its peer-reviewed submissions, thus matching (at least for one issue) the Journal of Universal Rejection as the most prestigious journal as judged by acceptance rate.
Jeffrey Naughton critiqued the state of research in the databases community with bad reviewing ("Reviewers hate EVERYTHING!") as a key problem, giving the anecdote that in SIGMOD 2010, out of 350 submissions, only paper 1 had all reviews rate it "accept" or higher; and only 4 had an average rating of "accept" or higher.
Taking a recent major systems-and-networking-related conference as a representative presumably-normal example, papers received generally 4 or 5 reviews and were scored on a scale of 1 to 5. Out of about 177 submissions, only six accepted papers received any 5's. Only one received more than a single 5, which also says something about variance.
"...there is a pervasive sense of unease within these communities about the quality and fairness of the review process and whether our publication processes truly serve the purposes for which they are intended. ... It was clear from the reaction to the panel that concerns with the reviewing process cut across many, if not all, fields of computer science." (Panel summary from a session at the 2008 CRA Conference at Snowbird)

Is it just a reweighting problem? If reviews were just conservative numerically, any problem could be fixed by "curving" the scores up. That would be nice, but ... another anecdote:

A survey asked authors of SIGCOMM 2009 submissions whether they agreed that "The reviews were technically correct". Roughly one-third of respondents disagreed or strongly disagreed, about a third agreed or strongly agreed, and another third had no opinion.

Even taking into account the fact that the survey included authors whose submissions were rejected and might just be grumpy, those numbers seem undesirable. Reviews are sometimes wrong or emphasize unimportant or subjective problems, even when the paper gets accepted. (And that's not necessarily the reviewer's fault.)

But this is true in every field, right? After all, authors have been complaining about criticisms for centuries. Here, by the way, are a couple favorite criticisms:

"Your manuscript is both good and original. But the part that is good is not original, and the part that is original is not good." (unattributed)

"In one place in Deerslayer, and in the restricted space of two-thirds of a page, Cooper has scored 114 offenses against literary art out of a possible 115. It breaks the record." (Mark Twain, How to Tell a Story and Other Essays.)

Getting back to the point, there is plenty of precedent for reviewers not seeing the light. But there is anecdotal evidence that this is a bigger problem in CS than in certain other areas:

A study of NSF panel reviews found that reviewers in computer science give lower scores on average than in other areas. (Note: I read this in CACM or some other magazine but now I can't find it; if you can, please let me know.) Update: Here's the data: CISE proposals average 0.41 points lower than other directorates.
While it appears to be a common (but not universal) belief in CS that reviewers are too-often wrong and frustrating, I'm told by at least one physicist that that is not the general feeling about reviewers in that field. They are "rarely out to actively find problems with your paper", and while they may often misunderstand parts of the paper, the authors can respond and usually the reviewers or the journal editor will accept the response. Publication is still competitive and often annoying for other reasons, but reviewers are generally reasonable.

So what? Does this have any negative impact? As pointed out by others:

Researchers may be discouraged. (I know of at least one top PhD graduate who went to industry citing weariness with "selling" papers as one cause.)
It puts CS at a disadvantage with other fields, if we are generally more negative in grant proposal reviews. As Naughton wrote, "funding agencies believe us when we say we suck".
More speculative or unusual work (with dozens of potential challenges for the approach that a reviewer could cite) is at a disadvantage compared to work with well-known quantitative metrics for evaluation.
Variance in reviews may make papers more likely to return for another round of reviewing at another conference, increasing time to publication and reviewer workload.

Causes of the problem

Naughton suggested that "Reviewers are trained by receiving bad reviews from other reviewers who have received bad reviews in the past". Keshav suggested human failings and increasing reviewer workloads. Without disagreeing with those possibilities, I'm wondering what about CS in particular might exacerbate the problem? Here are two ideas.

No author response to reviewers. As a consequence of CS's focus on conferences, most venues (in my area) have no opportunity for authors to answer reviewer criticism. The communication is author ---> reviewer ---> author, with no feedback to reviewer. It's a little like putting papers on trial without a defense team. As a result:
- Bad reviews are more likely to happen, because the reviewer typically never learns if they have submitted a bad review, and is not really held accountable.
- Once a bad review does happen, there's no chance to fix it.
Focus on bugs. (This is extremely speculative.) As computer scientists, we are really great at spotting bugs, and that's a good thing when you're writing code. Possibly, some of that carries over into reviewing more than it should. Maybe bug-finding is easier than thinking carefully about contributions of the paper — especially if, once you honestly think there's a bug, you don't have to do any more work even if you're wrong. (Just noticed that someone else had the same idea.)

Fixing the problem

I'm suggesting these as possible directions to discuss, not as solutions I think are guaranteed to work.

Allow authors to respond to reviewers. Just as in TCP's three-way handshake, one would hope that both involved parties get feedback. Responses, at least in theory, (1) create incentive for better reviews and feedback to help improve, (2) allow authors to point out simple misunderstandings in reviews. (Note that some venues, like ASPLOS, have rebuttals. And in fact, CCR has reasonably fast turnaround and allows responses to reviewer comments as they arrive. Apparently that didn't help the July issue, though...)

One could argue that reviewers already have an incentive to do well, because they have their reviews looked at (or even voted upon!) by other program committee members. But other reviewers don't know the paper and its area as well as the authors; and reviewers have at least as much incentive to maintain a friendly relationship with other reviewers as they do to argue the case for a specific paper. Arguing a case after one reviewer has taken a negative stand involves extra effort and to a certain extent puts one's reputation on the line. I suspect the most effective response comes from the authors. They have the needed incentive and knowledge.

That seems like the most obvious approach, but it does require organizational change. There are some smaller steps that might be easier on an individual level.

Avoid Naughton's checklist for bad reviewing. Quoting him directly:
- Is it "difficult"?
- Is it "complete"?
- Can I find any flaw?
- Can I kill it quickly?
Focus on what a paper contributes, not on what it doesn't contribute, which is always an infinitely long list. Focusing on the absent results will inevitably lead any paper to the wastebin of rejection and any author to a pit of misery.

In particular, it seems to me that "This paper didn't show X" is, by itself, not a valid criticism. It is an irrelevant factoid unless it negates or diminishes some other contribution in the paper. If it is fair to argue that particular results are absent, then my first beef with every paper is going to be that it fails to resolve whether P ≠ NP.

Of course, a paper should get more "contribution points" for a better and more thorough evaluation, but perhaps it's OK to leave some questions unanswered. Particularly since it's often hard to predict which particular dimension or potential inefficiency the reviewers will be interested in. Leaving certain questions unanswered is entirely compatible with the paper making other useful contributions.
Submit to arXiv, bypassing the reviewing process entirely and letting other researchers judge what they want to read. Subscribe to arXiv RSS feeds so you find out about other people's work more quickly. Of course, arXiv currently has limited value for CS systems and networking researchers, since other such researchers tend not to look for papers there. More on that later.
Adopt policies that tolerate some reviewer pessimism. As an example of what seems to me like a bad idea, a recent workshop had a reviewing policy that allowed a single reviewer to effectively veto a paper if they strongly disliked it.
Implement feedback yourself. If a conference doesn't provide a means for author feedback to reviewers, the reviewer could implement this herself by including in the review a way to provide feedback, e.g., a link to a Google Docs form that could preserve the anonymity of the reviewer and authors. Disadvantage: This only fixes a piece of the problem and might seem strange to PC chairs and authors.

Other past suggestions include reducing PC workloads, making reviews public, maintaining memory across conferences (so resubmissions are associated with old reviews), and much more; see links above and below.

The open question is, which of these will best improve the quality of reviews and, ultimately, CS research? My guess is that any good solution will include some form of author response to reviewers, but there are several ways to do that.

There's voluminous past discussion on this topic. Related links:

Workshop on Organizing Workshops, Conferences, and Symposia for Computer Systems, in particular, see papers/talks in the 11:00 a.m. session. One became an article in CACM.
Open Issues in Organizing Computer Systems Conferences (Jeffrey Mogul and Tom Anderson, CCR July 2008)
Paper and proposal reviews: is the process flawed? (summary of a panel session at the 2008 CRA Conference at Snowbird)
Peer reviews: make them public (Nature correspondence)
Peer reviews: some are already public (Nature correspondence)
How Should Peer Review Evolve? (Ed Chi, blog@cacm)
Conferences vs. Journals in Computing Research (Moshe Vardi, CACM 2009)

Update: SIGCOMM 2012 will have rebuttals. Also, Bertrand Meyer has something to say about CS reviewing.

Attractive scientific plots with gnuplot

2011-02-12T23:52:00.000-08:00

I use gnuplot for nearly all my graph-drawing for academic publications. On the whole, it's clean and relatively flexible, and that combined with inertia has been enough to keep me from trying interesting alternatives like matplotlib, Plot, ploticus, and R. However, gnuplot's default output is not especially pretty. I often see graphs in papers that look like this...

...or worse, if it's been bitmapped rather than using EPS or PDF. With some tweaking, however, one can produce much more attractive output. I would much rather look at plots like this:

In fact it looks better. Blogger doesn't seem to support any vector image format, but here are the pdf version and the svg version. To produce the PDF version, you need gnuplot 4.4's pdfcairo terminal. Below, you can see the gnuplot files for the above two plots.

set terminal postscript eps color

set output "boring_default.eps"
set xlabel "x axis label"
set ylabel "y axis label"

set key bottom right

set xrange [0:1]
set yrange [0:1]

plot "template.dat" \
   index 0 title "Example line" w lp, \
"" index 1 title "Another example" w lp

# Note you need gnuplot 4.4 for the pdfcairo terminal.

set terminal pdfcairo font "Gill Sans,9" linewidth 4 rounded fontscale 1.0

# Line style for axes
set style line 80 lt rgb "#808080"

# Line style for grid
set style line 81 lt 0  # dashed
set style line 81 lt rgb "#808080"  # grey

set grid back linestyle 81
set border 3 back linestyle 80 # Remove border on top and right.  These
             # borders are useless and make it harder
             # to see plotted lines near the border.
    # Also, put it in grey; no need for so much emphasis on a border.
set xtics nomirror
set ytics nomirror

#set log x
#set mxtics 10    # Makes logscale look good.

# Line styles: try to pick pleasing colors, rather
# than strictly primary colors or hard-to-see colors
# like gnuplot's default yellow.  Make the lines thick
# so they're easy to see in small plots in papers.
set style line 1 lt rgb "#A00000" lw 2 pt 1
set style line 2 lt rgb "#00A000" lw 2 pt 6
set style line 3 lt rgb "#5060D0" lw 2 pt 2
set style line 4 lt rgb "#F25900" lw 2 pt 9

set output "template.pdf"
set xlabel "x axis label"
set ylabel "y axis label"

set key bottom right

set xrange [0:1]
set yrange [0:1]

plot "template.dat" \
   index 0 title "Example line" w lp ls 1, \
"" index 1 title "Another example" w lp ls 2

set terminal svg size 320,240 fname "Gill Sans" fsize 9 rounded dashed

# Line style for axes
set style line 80 lt 0
set style line 80 lt rgb "#808080"

# Line style for grid
set style line 81 lt 3  # dashed
set style line 81 lt rgb "#808080" lw 0.5  # grey

set grid back linestyle 81
set border 3 back linestyle 80 # Remove border on top and right.  These
             # borders are useless and make it harder
             # to see plotted lines near the border.
    # Also, put it in grey; no need for so much emphasis on a border.
set xtics nomirror
set ytics nomirror

#set log x
#set mxtics 10    # Makes logscale look good.

# Line styles: try to pick pleasing colors, rather
# than strictly primary colors or hard-to-see colors
# like gnuplot's default yellow.  Make the lines thick
# so they're easy to see in small plots in papers.
set style line 1 lt 1
set style line 2 lt 1
set style line 3 lt 1
set style line 4 lt 1
set style line 1 lt rgb "#A00000" lw 2 pt 7
set style line 2 lt rgb "#00A000" lw 2 pt 9
set style line 3 lt rgb "#5060D0" lw 2 pt 5
set style line 4 lt rgb "#F25900" lw 2 pt 13

set output "template.svg"
set xlabel "x axis label"
set ylabel "y axis label"

set key bottom right

set xrange [0:1]
set yrange [0:1]

plot "template.dat" \
   index 0 title "Example line" w lp ls 1, \
"" index 1 title "Another example" w lp ls 2

Now here's something for which I would pay (some) real money: a gnuplot terminal which outputs directly to Keynote. Then, for example, during a presentation, one could have lines in the plot appear one at a time, explaining each without the distraction of showing irrelevant objects. This should actually be quite doable since Keynote's format is just a zipped XML.

Update 2011.04.09: Mac users of macports may note that the default install of gnuplot for some reason excludes pdfcairo. Abhinav Bhatele writes with instructions for enabling pdfcairo in macports:

$ sudo port edit gnuplot

Add these lines somewhere in the file (I added them before the lua variant):

variant pangocairo description "Enable pdfcairo" {
     depends_lib-append      port:pango
     configure.args-delete   --without-cairo
     configure.args-append   --with-cairo
}

$ sudo port info gnuplot

Just to check that pangocairo variant exists. And then:

$ sudo port uninstall gnuplot
$ sudo port install gnuplot +pangocairo

You'll need to keep in mind that if you do port selfupdate,
the edited version of the portfile might get overwritten.

Update to the update 2011.12.04: Looks like macports now includes the pangocairo variant, but still does not install it by default; so it should work if you run just the last two lines.

Update 2011.04.09: Added SVG version and made it slightly more beautiful.

Update 2013.06.21: Added fontscale 1.0 in PDF version. Also, it seems the SVG output now looks somewhat different in a more recent gnuplot ... will have to fix that sometime.

A peer-reviewing horror story

2010-12-28T13:19:00.000-08:00

A peer-reviewing horror story. Don't let this happen to you.

Also available in various ebook formats.

Google Frequency Plotter

2010-10-23T22:05:00.000-07:00

Here's an app version of this xkcd comic that lets you plot the frequency of phrases according to Google searches.

Plot
for _ in	0-30 Enter a range like 1-10 or a list like monday,tuesday,wednesday

Permalink to this chart

Some examples below the fold.

Sanity checks

How convenient.

Politics

[Thanks: Bryan]

Most frequent birth year: 1982

What you've got

Mind your phone

Procreation

Post permalinks to your favorites in the comments below.

Disclaimers: The number of search results reported by Google's search API is known to be occasionally bogus and is not a reliable indicator of anything in particular. Also, there's some bug here if you do a range query with only a single value. Finally, an exclamation point in the status message indicates an error in one query or the lack of any results.

Ig Nobel candidate

2010-10-19T18:11:00.000-07:00

"University of Ljubljana researcher Borut Povse is conducting experiments in which a robot limb repeatedly hits human volunteers on the arm to evaluate human-robot pain thresholds in order to facilitate adherence to Isaac Asimov's first law of robotics, which prohibits robots from injuring people." [ACM TechNews]

Public reviewing

2010-10-02T23:27:00.000-07:00

The New York Times has a piece on public review, even to the point of crowdsourcing, as a partial alternative to peer review for scholarly publications. The interesting bit is that at least one journal, Shakespeare Quarterly, has tested this open review process. You can view the submitted papers and discussion (including a paper providing an information-theoretic analysis of Shakespeare). The interface seems well designed and allows commenting on individual paragraphs.

There are doubtless situations where this opens the reviewing process to trolls, flamewars, or inerudite remarks. On the other hand, assuming the comments are used to help inform a final judgement by experts, there could be advantages. Reviewing a paper is sometimes like an NP search problem, to find the contributions and weaknesses. Public review could be seen as using crowdsourcing to tackle the search problem. Certain comments would be easily verifiable by an expert, even without relying on the trustworthiness of the anonymous commenter, yet would not necessarily have been noticed by any particular expert. (The same easy-verification property is one reason Wikipedia is useful even when you're looking for a reliable answer to a question.)

In computer networking research, the closest we come to a collaborative, real-time form of reviewing is Computer Communication Review which has just recently started returning reviews to authors as they are submitted, and allowing authors to comment on the reviews.

Unrelated fun ... here's Seaquence, a clever visualization of musical composition.

Programming Language Wars: The Movie

2010-09-14T21:56:00.000-07:00

In computer science and hacker circles, the programming language wars have, it seems, been raging since the beginning of time. A little electronic archaeology reveals some amusing exchanges:

"By all means create your own dialect of FORTH. While your at it, you can add the best features of PL-I, F77 and CORAL66. Then, look me up when you get out of college and we'll show you how it's done when you have to make a living" [1985 thread]
"This debate ... is very much like two engineers engaged in building a three-mile bridge arguing over the brand of blue-print paper they use." [1987 thread]

Passionate arguments can often be improved by actual measurements. How fast, expressive, and efficient is a particular language? That's what The Computer Language Benchmarks Game set out to provide, measuring time, source code length, and memory use of several dozen languages across a set of benchmarks.

If you have measurements, why not improve them with a visualization? And so I present to you an interactive, multi-dimensional, dynamic, whizbang-o-matic rendering of the Programming Language Wars.

Each circle is a language. Its horizontal position represents the gzipped source code size used to implement the benchmarks, which is intended to measure the language's "expressiveness". Its vertical position represents the real time used to execute the benchmarks, and its size (and color) indicate how much memory was used.

The cluster of languages in the top left are slow but expressive scripting languages. At the bottom right you will find C and C++, the fastest languages, but which take quite a bit more coding to get the job done. In between there is a tradeoff between speed and expressiveness, where lie languages like OCaml (which I happen to use whenever possible).

Actually, each point is only a summary of the language's performance: Consider some metric, like real time, and some particular language L. The Benchmarks Game folks ran implementations of a set of about 12 benchmarks (FASTA, Mandelbrot, ...) in L. L's time for each benchmark is divided by the best time across all languages for that benchmark. This gives us a normalized score for each benchmark; we take the median of these to produce a summary real time score for L. Then we do the same for the other metrics: CPU time, source code length, and memory.

The plot shows data for a single-core x86 box (assuming you haven't yet messed around with the controls). If you press the movie button in the bottom left, it will transition to results on a quad-core box. (Still normalized by the best single-core score. The labels say 1901 and 1904 since Google's API wants dates.) TIP: When you play the animation, select a few languages you're interested in and check the Trails checkbox, so the movement stands out.

To better visualize which languages' implementations took advantage of parallelism, and then click Play. The languages that move downward have improved their real time. Some stay in the same spot, probably indicating that the Computer Language Benchmarks Game doesn't have the best implementations.

Fine. Just tell me which language is best.

These benchmarks are almost certainly not representative of what you want to do. There are various flaws in this approach — how we choose to summarize (the median here) will affect the ordering of languages; the implementations are not perfect; some languages are missing implementations for some benchmarks; even for one language there are many possible implementations with different tradeoffs and only the fastest was tested; and so on. Perhaps most significantly, we're completely lacking important metrics like programmer time, maintainability, and bugginess.

Thus, just as someone out there thinks Circus Peanut Gelatin pie is a good idea, so most of these languages are the right tool for some job. We can't use these benchmarks to brand a language as useless. What I think the benchmarks and visualization can do is introduce you to general-purpose languages that may be a better solution for many tasks.

In particular, you might want to take a gander at the Pareto-optimal languages: those which, for every other language L, are better than L in at least one metric under consideration. If we consider source code length and real time as the two metrics, then the Pareto-optimal languages are:

1 core

4 cores

More expressive
Ruby 1.9
Ruby JRuby
Javascript TraceMonkey
Python PyPy
JavaScript V8
Lua LuaJIT
Haskell GHC
Java 6 SteadyState
C GNU gcc
Faster

More expressive
Ruby JRuby
Python CPython
Erlang HiPE
OCaml
Haskell GHC
F# Mono
Scala
Java 6 Steady State
C GNU gcc
C++ GNU c++
Faster

From top to bottom, these languages trace the best points in the tradeoff between expressiveness (at top) and speed (at bottom). Perhaps what this does best is to illustrate why it is hard to pick a "best" language. For the single-core case, 27% of the languages are on the list above; for quad-core, 48% made the cut. Even with just two simple metrics, many languages might be considered "optimal"!

Coming soon: a visualization of the 3D tradeoff space.

Notes

Last year, Guillaume Marceau produced informative plots based on the The Computer Language Benchmarks Game, similarly showing the tradeoffs between metrics. The CLBG has now included similar plots on their site. [Updated: the CLBG didn't use Marceau's plots directly.] The visualization here summarizes more (which can be good and bad), includes the memory metric and the quad-core data, and lets you interactively play with the data thanks to Google Motion Charts. A chat with Rodrigo Fonseca several years ago got me interested in plotting these tradeoffs. Finally, my apologies to those without Flash for the chart.

What happened to the Internet on Friday

2010-08-30T04:54:00.000-07:00

Note to readers: Judging from the past, this blog will have posts related to both computer science and politics. If you like, you can view just CS or just politics posts, or subscribe to feeds for just CS or just politics.

On Friday, a large disruption of Internet traffic made the news as an experiment gone awry. What actually happened? It's a good lesson in how fragile and insecure the Internet's routing protocol can actually be.

There was indeed a major event on Friday. A plot by Earl Zmijewski of Renesys shows that at the moment the experiment started — 8:41 GMT — about 3,000 IP prefixes became unstable. That is, the routes to these prefixes were quickly changing or being advertised and withdrawn. (An IP prefix is a chunk of destination IP addresses, the basic unit on which Internet routing operates.) Since there are roughly 300,000 prefixes announced globally, this is about 1% of the prefixes on the entire Internet.

We can also observe the effects by looking at the total amount of "chatter" in the Internet's global routing protocol, BGP. I created the following graphs based on raw data from the Route Views project.

This plot shows the rate of BGP messages received by one particular router, located at the London Internet Exchange. Routers are continually exchanging messages about new, changing, or unavailable routes to destinations all around the world. However, as you can see, the event in question vastly increased the rate of routing updates, exceeding the "background radiation" of messages by about a factor of 6.

The event was visible globally. Here is the same plot, for a router at Equinix in Ashburn, VA.

How can a disruption of this magnitude happen? Based on a note from RIPE, it went something like this:

Researchers at RIPE and Duke create a BGP announcement message, which advertises the availability of an IP prefix under their control. The message uses an unusual format, but one which complies with the BGP protocol format.
The message begins to propagate from router to router on the Internet, as normal, until...
...it reaches some router running the Cisco IOS XR software. These routers have (or had) buggy software which, upon receipt of the unusual message, corrupted the message before propagating it to other neighboring routers.
A neighboring router (call it N) has now received a malformed message from the Cisco router (call it C). N then follows the BGP protocol specifications which require that N terminates its BGP connection to C. This disrupts traffic to any destination which N reached via C (and vice versa) — not just traffic to the prefix originally announced by RIPE!
It is likely (depending on router configuration, so I'm not sure how common this is) that either C or N then attempts to re-establish the BGP connection. In this case, C re-advertises every route it knows about to N — perhaps all 300,000 of them. And one of these would presumably be the corrupted message, causing the connection to again be terminated, and the process to repeat indefinitely.

It's always a good idea to isolate security problems to contain damage. And many BGP problems can be isolated close to the origin of the bad announcement. This event, on the other hand, apparently caused (brief) widespread damage for two reasons.

It spread geographically because the original announcement message was entirely valid, and was handled correctly by many routers. Thus, the message could reach buggy routers anywhere on the planet.
It spread to many IP prefixes beyond the original announced prefix, because the BGP protocol spec asserts that if a router sends a bad message for one prefix, it's unsafe to communicate with that router for destinations in any prefix.

Similar events have occurred in the past.

Despite the headlines ("Research experiment disrupts Internet") on Slashdot and Network World and the Renesys post's point of view, I find it hard to place blame on the researchers. I assume it was not their intent to stress-test the live Internet. Clearly, one problem was the software bug which Cisco quickly acknowledged and fixed.

But we can also think about the protocol design. One way to better isolate the damage might have been for the router N to discard only the single malformed message from C, logging an error message but not terminating the entire BGP session between N and C. The counter-argument is that receipt of one malformed announcement raises the probability that other announcements are malformed, too. Indeed, in this bug, C apparently declared an incorrect header length; something similar to this could plausibly confuse N's parsing of all subsequent messages from C.

Looking into the much more distant future, a very different approach would be to base routing decisions on end-to-end observable behavior (do my packets actually get through to the destination along this path, or not?) rather than on relatively uninformative and attack-prone control plane announcements. This robustness is one potential benefit of designs like our pathlet routing and Xiaowei Yang's NIRA.

SIGCOMM 2009

2009-06-13T14:13:00.000-07:00

For readers interested in networking, I note that the SIGCOMM 2009 program is now available.

Also available is Pathlet Routing, our paper with Igor Ganichev, Scott Shenker, and Ion Stoica. Pathlet routing is a new Internet routing architecture which can improve scalability by enabling very small forwarding tables, and can allow senders to choose between multiple paths for improved reliability and path quality. The idea is basically to do source routing over a virtual topology whose nodes are arbitrary virtual nodes (vnodes) and whose links are sequences of vnodes (pathlets). Intuitively, this architecture is highly flexible because vnodes can represent arbitrary granularities, and because pathlets can represent policy constraints on routing while simultaneously enabling a large number of path choices. This is because sources can stitch together pathlets to form an end-to-end route in potentially exponentially many ways.

An interesting property of the design is that it doesn't impose a global requirement on what "style" of routing policy is used, but rather allows multiple styles to coexist. One router could choose to have routes like in today's Internet, with a giant forwarding table specifying only a single allowed route to each destination. And the next router could have a tiny forwarding table that still gives the network owner some control, but provides a high degree of path choice for the senders. I think of this as being very much in the spirit of the principle of designing for variation in outcome advocated by Clark et al. in their Tussle in Cyberspace paper.

Recent results in nonlinear peer-to-peer reviewing algorithms

2008-09-11T00:23:00.000-07:00

On Monday I received spam, as many periodically do, from the World Multi-Conference on Systemics, Cybernetics and Informatics, WMSCI '09. Their peer review process having been previously demonstrated by Stribling et al. to be susceptible to random paper generation, they have a new strategy:

Submitted papers or extended abstracts will have three kinds of reviews: double-blind (by at least three reviewers), non-blind, and participative peer-to-peer reviews.

(Emphasis mine.) The conference web site further describes the peer-to-peer review process as

Informal, nonlinear, systemically interactive methods, for the achievement of what is called bottom-up quality[.]

This is great news; as a sometime peer-to-peer researcher myself, I'm eager to see nonlinear peer-to-peer reviewing technology adopted.

I understand that as future work WMSCI is developing an oblivious algorithm for ad-hoc low-power reviewing.

Computer Science Research Trends

2008-08-22T02:23:00.000-07:00

Computer science is a fast-changing field of research. Can we track some of its changes in the past? To answer this or any other question, experimental computer science researchers have essentially three options:

Write a program.
Search the Internet.
Write a program that searches the Internet.

We'll take the last approach in this post. We can pick a term and look at term frequency, that is the fraction of abstracts containing the term, as a function of time. Conveniently the ACM Digital Library and Guide have an advanced search that we can exploit to analyze term frequency across decades of abstracts. We'll start with a generic term, performance:

Let me say now, in case it becomes unclear later, that all the data presented here is actually derived from the ACM's database; it is not made up. (As for the interpretation of the data, well, ...) Moving on, we can plot a couple other terms, which and that:

The naive reader will conclude that conjunctions went out of fashion in the mid 80's, and came back after the dot com crash. However, the inappropriately perspicacious reader realizes that this conclusion is subtly flawed, because the word "that" might be a pronoun, adjective, or adverb: not just a conjunction. Fortunately the astute reader realizes we can look at relative instead of absolute frequencies to help clean up the data set.

The next group of figures are normalized by the term performance. For example, the blue line in the first plot below shows freq(mobile) / freq(performance).

Click on the image below for a larger version.

Now some "matchups", where x v. y = freq(x) / freq(y). Click on an image to enlarge.

distributed v. centralized
flat v. hierarchical
Berkeley v. Stanford
Us v. Them

They're winning, but we're gaining on Them.

Now our final, and perhaps most important, plot.

Good
v.
Evil

While computer science is solidly on the side of good, several abstracts in this search were disturbingly megalomaniacal, such as J. R. Landry's "Can computing professionals be the unintentional architects of evil information systems?", in ACM SIGMIS-CPR, 2008. The author "discusses how the technical rational paradigm supports the creation of systems that embody administrative evil" and aims to "determine if information systems can harm or be evil, the frequency of harm, and response to harm by designers and users."

We can be thankful that according to its abstract, the paper is only "a research-in-progress".

This post was originally an Outrageous Opinion at SIGCOMM'08. Here's a PDF of the original presentation.

Graphing English

2008-08-06T02:31:00.000-07:00

Given two words, can we connect them by a chain of synonyms? For example: minuscule — little — short — poor — wretched — ugly — frightful — tremendous — enormous. Try some:

[Note, the tool's offline right now due to a system reinstallation ... should come back shortly ... -- Dec 6, 2011]

Connect two words with a chain of synonyms:

In the graphs above, each node is a word and an edge connects each pair of displayed words that can be synonymous. The raw data on semantic relationships is from Princeton's WordNet project.

Many word pairs are not linked. How large is the largest connected component—that is, the largest set of words such that any pair in the set can be connected by a chain of synonyms? An interesting graph theoretic question.

The answer is that the largest component has 25,105 words, or 17% of all words in the database. Meanwhile, the second largest component is over six hundred times smaller, with only 38 words (show it above).

Is this structure bizarre? Actually it's roughly what one would expect knowing random graph theory, which I'll now attempt to explain in one paragraph. A classic (Erdős–Rényi) random graph consists of n nodes, connected by a bunch of edges chosen uniform-randomly. Suppose the number of edges is such that on average, each node has d neighbors (for us, d synonyms). Imagine exploring a connected component by starting at one node and expanding outward: first looking at the nodes that are one step away from the starting point, then two steps away, and so on. What happens to the size of this frontier? If d < 1 then the frontier tends to shrink by a certain percentage in each step, so with high probability it dies out before the component gets very large. On the other hand, if d > 1, then the frontier tends to expand by a certain percentage in each step. Chances are pretty good that the frontier just gets bigger and bigger until the component includes a good fraction of all n nodes in the graph. That's the giant component. The second largest component must be very small, specifically O(log n) nodes, since otherwise it's likely to intersect with, and thus be absorbed by, the giant component.

Our graph of English has d = 2.92 synonyms per word. But of course English is not a random graph—which, with the same d, would have a largest component of 93% of its nodes and a second largest of about 5 nodes. Considering that we get within an order of magnitude without modeling any of the structure of the language except the number d, this is not so bad. And WordNet doubtless does not perfectly represent English: for example, it's plausible that common words are better annotated with synonymy relations than, say, abulia or vituperation.

I was led to think of this topic during conversation with some folks from UCL and CMU. But others have built a graph out of WordNet too (after all, it is called WordNet). A paper by Kamps, Marx, Mokken, and de Rijke, Using WordNet to Measure Semantic Orientations of Adjectives, scores words based on their relative distance to "good" vs. "bad".