Big data languages: the reason for the tests

By joe

November 30, 2013 - 4 minutes read - 852 words

In a number of recent articles, I’ve seen/read that “Python is displacing R”, and other similar things. Something about this intrigued me, as I had heard many years ago that “Python was displacing Perl”. Only, it wasn’t. And others are questioning the supplantation premise quite strongly. It seems that there is little actual evidence of this. Mostly hyperbole, guesses, and dare I say, wishful thinking. It seems that this is modus operandi for Python advocates, and their latest object of attention is R. I personally find this form of advocacy rather displeasing , if not outright annoying. First off, R is a fantastic statistical language, and I will freely admit, I’ve forgotten more about it than I care to admit. I haven’t used it in years, but I am quite aware of people using it on clusters with MPI for very large scale processing of massive biological data sets. Since R is free, and so widely used and so powerful, we are looking to include it on many of our analytical appliances as an additional package. Second, “big data” is something of terra incognita for many people … they’ve heard the term, and aren’t sure what it is, if/how it applies to them. One persons massive data set might be miniscule by another persons comparison to their small computing project. Doug Eadline has a great discussion about big data analytics sizes … the results are quite surprising to me, and provide strong motivation for a Limulus like solution. Which is why we are reselling it. We believe in the data, and the customers need to analyze it quickly. What “big” means, might be somewhat ill defined, but the analytics need powerful tools to leverage. And this is where the comparison comes in. I took a relatively small computational problem (large vector of data), and tried a few ways of performing a very simple analysis on it. As expected, the static compiled code was the fastest. The naive loop based interpretive codes were the least performing. The vectorized/data language systems (Perl PDL, Julia native, Python Numpy) provided better expressiveness and power, though Python seems to lag all in performance, often significantly, even with the high performance options in place. This does not bode well for a language with aspirations to be a big data platform. The surprise to me was the Perl PDL. I know parallel Perl is non-trivial. I use this a bit in our work, and its not fun to deal with. Pthreads isn’t evil, it was just not written with humans in mind. Python suffers the same fate there. In C, I can use OpenMP/OpenACC with compiler hints, much like the Julia macro. In kdb+, I can use peach for parallel looping, though I am not sure how to do sum reduction in this yet. But the Perl PDL was something I’d never used before. That will likely change now that I’ve played with it. kdb+ has fantastic support for time series constructs and computing, making it easy to ingest and compute on data sets. It has an integrated database, which none of the others do. Julia just worked. With very little effort on my part. This is the 0.2.0 code base. It was fast, easy to debug and test, and quite expressive without requiring additional complications. Python, on the other hand … I tried Python 2.7, Python 3.3.1, Cython, and finally Numpy came close to reasonable performance. The amount of effort and googling it took relative to most of the others (save kdb+), was large for me. I had to look up many different things to figure out the “right” way to do this. I don’t know how it compares to R performance, but for data ingestion, its hard to beat kdb+ and Perl. For performance, julia, kdb+ and Perl PDL are excellent based on my limited testing. Basically, I don’t see a strong reason to chose Python in any of its variants over one (or more) of the others. Actually the most intriguing thing to me is IJulia. This notebook paradigm is reminiscent of the QTMaxima, or Mathematica notebooks (actually the MathCad notebooks from long ago). In this regard, Python shows its best capabilities as a UI glue language. It does a superb job of tying things together. I do think it best for programmer productivity to use languages that they are most comfortable with, but at the same time, to not get trapped into using just one language, with dubious rationale or evidence for its “ascendancy”. Especially considering its used these arguments before with other languages when they were at their peak. So dear reader, please take the language advocacy masquerading as articles on trends for what they are. Use what you like, but be aware of hype, and know that this isn’t the first time this has happened. As far as Julia goes, it works very nicely. Once it gets static compilation, it should get very easy to deploy applications. Its in its very early stages, but looks to be one of the best of the lot right now. Very well designed for a data analyst.