Blue Bull Data Research

This year's SIGMOD/PODS was quite exciting. Attended by over 800 students, researchers, and members of industry, the DB community is more vibrant than ever.

The highlight for me was a new event at this year's PODS, a panel discussion on future trends in Database research. Many of the speakers discussed specialized forms of data processing where creative ideas were needed: a particularly impassioned plea came from Andrew McCallum, who argued for a tighter coupling between the database, machine learning, and data mining communities. This sentiment was echoed by a number of the panelists, who suggested that database researchers had dropped the ball on the challenge of "Big Data", allowing it to be defined almost exclusively in terms of data-mining and systems challenges. Social Graph Databases, Astronomy (e.g., Skyserver), and similar projects were put forth as areas where peta-scale (or larger) query processing are critical.

Joe Hellerstein made some interesting points that I saw echoed throughout the remainder of the conference: He mentioned the almost obvious parallel between communication and storage, namely that communication is a form of messaging to the future. The primary distinction lies between who is responsible for what -- In storage, the sender is responsible for doing the work to put the message/signal someplace where the recipient can easily retrieve it. Conversely, in communication, the recipient is responsible for listening and waiting for the message/signal to show itself. Parallels exist throughout the DB community, query processing vs stream processing, being the obvious example. I saw this sentiment echoed throughout the conference, as papers like the latest PIQL offering suggested the need to revisit the tradeoffs between pre-computation and online query processing.

A third theme that arose both at the panel discussion and throughout the conference was consistency. Between Joe's CALM conjecture, an excellent tutorial by Phil Bernstein and Sudipto Das and other chatter throughout the conference, it seems clear that consistency and the CAP theorem are once again rearing their ugly heads. The key takeaway from all of this seems to be that each application has different consistency requirements, and the underlying platform needs to establish a clear, understandable contract with the programmer about what "consistency" means. Also clear from all of this is that consistency requirements vary between applications. Through DSLs and other platforms, we are once again talking about how to figure out what kind of consistency an application requires.

Das consistency

Hardware continues to be a growing trend, and over the past few years, I've been seeing a shift towards (Eric Sedlar's prediction of) specialized hardware for databases. An interesting point in this space is a measurement paper out of EPFL where it is observed that instruction cache misses are a major bottleneck in query processing. Pinar's suggested solution to this is that we devote individual cores to specific tasks that fit entirely into an instruction cache.

I've been seeing a lot more effort on crowdsourcing. In particular, the field seems to be shifting towards more specialized forms of crowdsourcing -- focusing the crowdsourcing efforts on domain specialists and data mining the results of such queries. One paper on crowdmining, discussed efforts to infer causal connections and trends in data by querying users for instantiations of these trends.

And that's all... pretty jazzy if I do say so.

NewImage

(The SIGMOD Jazz Concert)

Blue Bull Data Research

Friday, July 12, 2013

SIGMOD 2013 Review