Below you will find pages that utilize the taxonomy term “Big Data”
Posts
#HPC in all the things
I read this announcement this morning. Our friends at Facebook releasing their reduced precision server side convolution and GEMM operations.
Many years ago, I tried to convince people that HPC moves both down market, into lower cost hardware, as well as more widely into more software toolchains. Basically, the decades of experience building very high performance applications and systems will have value downstream for many users over time.
GEMM is a generalized approach to a matrix multiply, which has been well optimized for HPC applications in various scientific libraries over time.
Posts
Looking forward to #SC18 next week and a discussion of all things #HPC
I’m attending SC18 next week. It’s been 3 years since I last attended (2015). Then we (@scalableinfo) had a large booth, lots of traffic, and showed off some of the first commercial NVMe high performance storage systems running BeeGFS over 100GbE.
I am looking forward to talking with as many people as I can, to get their perspectives on things. To see what they are thinking, hear what they are doing, and in which direction they are going.
Posts
On technology zealotry
I’ve encountered this in my career, at many places. Sadly, early in my career, I participated in some of this. You are a zealot for a particular form of tech if you can see it do no wrong, and decry reports of issues or problems as “attacks”. You are a zealot against a particular form of tech if you cannot see it as a potentially useful and valuable portion of a solution stack, and (often gleefully) amplify reports of issues or problems.
Posts
Interesting post on mixed integer programming for diets ... that has some hilarious output
I am a fan of the Julia language. Tremendously powerful analytical environment, compiled, high performance, easy to understand and use, strongly typed, … there’s a long list of reasons why I like it. If you are doing analytics, modeling, computation in other languages, it is definitely worth a look. Think of it as python, compiled, with multiple dispatch and strong typing … and no indent-as-structure problem. My Julia fanboi-ism aside, there was an interesting blog post about using JuMP, a linear programming environment for Julia.
Posts
Aria2c for the win!
I’ve not heard of aria2c before today. Sort of a super wget as far as I could tell. Does parallel transfers to reduce data motion time, if possible. So I pulled it down, built it. I have some large data sets to move. And a nice storage area for them. Ok. Fire it up to pull down a 2GB file. Much faster than wget on the same system over the same network.
Posts
M&A and business things
First up, Tegile was acquired by Western Digital (WDC). This is in part due to WDC’s desire to be a one stop shop vertically integrated supplier for storage parts, systems, etc. This is how all of the storage parts OEMs needed to move, though Seagate failed to execute this correctly, selling off their array business in part to Cray. Toshiba … well … they have some existential challenges right now, and are about to sell off their profitable flash and memory systems business, if they can just get everyone to agree … This comes from the fact that spinning disk, while a venerable technology, has been effectively completely commoditized.
Posts
Finally got to use MCE::* in a project
There are a set of modules in the Perl universe that I’ve been looking for an excuse to use for a while. They are the MCE set of modules, which purportedly enable easy concurrency and parallelism, exploiting many core CPUs, and a number of techniques. Sure enough, I had a task to handle recently that required this. I looked at many alternatives, and played with a few, including Parallel::Queue. I thought of writing my own with IPC::Run as I was already using it in the project, but I didn’t want to lose focus on the mission, and re-invent a wheel that already existed elsewhere.
Posts
Cray "acquires" ClusterStor business unit from Seagate
Information at this link. It is being called a “strategic transaction”, though it likely came about vis-a-vis Seagate doing some profound and deep thinking over what business it was in. Seagate has been weathering a storm, and has been working on re-orgs to deal with a declining disk market. They acquired ClusterStor as part of a preceding transaction of Xyratex. Xyratex was the basis for the Cray storage platforms (post Enginio).
Posts
What reduces risk ... a great engineering and support team, or a brand name ?
I’ve written about approved vendors and “one throat to choke” concept in the past. The short take from my vantage point as a small, not well known, but highly differentiated builder of high performance storage and computing systems … was that this brand specific focus was going to remove real differentiated solutions from market, while simultaneously lowering the quality and support of products in market. The concept of brand and marketing of a brand is about erecting barriers to market entry against the smaller folk whom might have something of interest, and the larger folk who might come in with a different ecosystem.
Posts
On hackerrank and Julia
My new day job has me developing considerably less code than my previous endeavor, so I like to work on problems to keep these particular muscles in steady use. Happily, I get to do more analytics than ever before, so this at least is some compensation for the lower amount of coding. When I work on coding for myself, I’ll play with problems from my research days, or small throw-away ones, like on Hackerrank.
Posts
I always love these breathless stories of great speed, and how VCs love them ...
Though, when I look at the “great speed”, it is often on par with or less than Scalable Informatics sustained years before. From 2013 SC13 show, on the show floor, after blasting through a POC at unheard of speed, and setting long standing records in the STAC-M3 benchmarks …
Article in question is in the Register. Some of the speeds and feeds:
* 200 microsecs latency * 45GBps read bandwidth * 15GBps write bandwidth * 7 million IOPS But then … a fibre connection.
Posts
What is old, is new again
Way back in the pre-history of the internet (really DARPA-net/BITNET days), while dinosaur programming languages frolicked freely on servers with “modern” programming systems and data sets, there was a push to go from a static linking programs to a more modular dynamic linking. The thought processes were that it would save precious memory, not having many copies of libc statically linked in to binaries. It would reduce file sizes, as most of your code would be in libraries.
Posts
Brings a smile to my face ... #BioIT #HPC accelerator
Way way back in the early aughts (2000’s), we had built a set of designs for an accelerator system to speed up things like BLAST, HMMer, and other codes. We were told that no one would buy such things, as the software layer was good enough and people didn’t want black boxes. This was part of an overall accelerator strategy that we had put together at the time, and were seeking to raise capital to build.
Posts
#Perl on the rise for #DevOps
Note: I do quite a bit of development in Perl, and have my own biases, so please do take this into consideration. It is one of many languages I use, but it is by and large, my current go-to language. I’ll discuss below. According to TIOBE (yeah, I know), Perl usage is on the rise. The linked article posits that this is for DevOps reasons. The author of the article works at a company that makes money from Perl and Python … they build (actually very good) tools.
Posts
Fully RAMdisk booted CentOS 7.2 based SIOS image for #HPC , #bigdata , #storage etc.
This is something we’ve been working on for a while … a completely clean, as baseline a distro as possible, version of our SIOS RAMdisk image using CentOS (and by extension, Red Hat … just need to point to those repositories). And its available to pull down and use as you wish from our download site. Ok, so what does it do? Simple. It boots an entire OS, into RAM. No disks to manage and worry over.
Posts
@scalableinfo 60 bay Unison with these: 3.6PB raw per 4U box
Color me impressed … Seagate and their 60TB 3.5inch SAS drive. Yes, the 60 bay Unison units can handle this. That would be 3.6PB per 4U unit. 10x 4U per 48U rack. 36PB raw per rack. 100PB in 3 racks, 30 racks for an exabyte (EB). The issue would be the storage bandwidth wall height. Doing the math, 60TB/(1GB/s) -> 6 x 104 seconds to empty/fill such a single unit. We can drive these about 50GB/s in a box, so a single box would be 3600TB/(50GB/s) or 7.
Posts
Raw Unapologetic Firepower: kdb+ from @Kx
While the day job builds (hyperconverged) appliances for big data analytics and storage, our partners build the tools that enable users to work easily with astounding quantities of data, and do so very rapidly, and without a great deal of code. I’ve always been amazed at the raw power in this tool. Think of a concise functional/vector language, coupled tightly to a SQL database. Its not quite an exact description, have a look at Kx’s website for a more accurate one.
Posts
Systemd and non-desktop scenarios
So we’ve been using Debian 8 as the basis of our SIOS v2 system. Debian has a number of very strong features that make it a fantastic basis for developing a platform … for one, it doesn’t have significant negative baggage/technical debt associated with poor design decisions early on in the development of the system as others do. But it has systemd. I’ve been generally non-committal about systemd, as it seemed like it should improve some things, at a fairly minor cost in additional complexity.
Posts
Talk from #Kxcon2016 on #HPC #Storage for #BigData analytics is up
See here, which was largely about how to architect high performance analytics platforms, and a specific shout out to our Forte NVMe flash unit, which is currently available in volume starting at $1 USD/GB. Some of the more interesting results from our testing:
* 24GB/s bandwidth largely insensitive to block size. * 5+ Million IOPs random IO (5+MIOPs) sensitive to block size. * 4k random read (100%) were well north of 5M IOPs.
Posts
Not even breaking a sweat: 10GB/s write to single node Forte unit over 100Gb net #realhyperconverged #HPC #storage
TL;DR version: 10GB/s write, 10GB/s read in a single 2U unit over 100Gb network to a backing file system. This is tremendous. The system and clients are using our default tuning/config. Real hyperconvergence requires hardware that can move bits to/from storage/networking very quickly. This is that. These units are available. Now. In volume. And are very reasonably priced (starting at $1USD/GB). Contact us for more details. This is with a file system …
Posts
Massive unapologetic storage firepower part 4: On the test track with a Forte unit ... vaaaaROOOOOOMMMMMMM!!!!!
I am trying to help people conceptualize the experience. Here is a video depicting very fast, very powerful cars and their sound signatures.
This is a good start. Take one of those awesome machines, and turn off half the engine. So it is literally running with 1/2 of its power turned off. Remember this. There will be a quiz. As we flippantly noted in the video, this is face-melting performance. Had I any hair left, it would have been blown way back.
Posts
When infinite resources aren't, and why software assumes they are infinite
We’ve got customers with very large resource machines. And software that sees all those resources and goes “gimme!!!!”. So people run. And then more people use it. And more runs. Until the resources are exhausted. And hilarity (of the bad kind) ensues. These are firedrills. I get an open ticket that “there must be something wrong with the hardware”, when I see all the messages in console logs being pulled in from ICL saying “zOMG I am out of ram ….
Posts
Nutanix files for IPO
Short story here. I am not going to pour over their S-1 form to find interesting tidbits, others will do that, and are paid to do so. They are the first of several, though I had thought that Dell would acquire them before they hit IPO. I am guessing that the combination of the price for them, plus the EMC acquisition stopped this conversation. So now Nutanix is going to IPO.
Posts
#Perl6 compiler betas are ready
Ok … I am … well … blown away. I had thought Perl6 would be the Duke Nukem forever of programming languages. Indeed, it has been in active development for more than a decade. But you can download compilers (yes, you heard me right, compilers) for it now. You might say “why perl” or “why perl6” or “why now, because we have #insert(language_x) and its wonderful”. Good question, I wasn’t sure why it was relevant, until I started reading some of the code.
Posts
Testing a new @scalableinfo Unison #Ceph appliance node for #hpc #storage
Simple test case, no file system … using raw devices, what can I push out to all 60 drives in 128k chunks. Actually this is part of our burn-in test series, I am looking for failures/performance anomalies.
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 0 1 95 5 0 0| 513M 0 | 480B 0 | 0 0 | 10k 20k 4 2 94 0 0 0| 0 0 | 480B 0 | 0 0 |5238 721 0 2 98 0 0 0| 0 0 | 480B 0 | 0 0 |4913 352 0 2 98 0 0 0| 0 0 | 570B 90B| 0 0 |4966 613 0 2 98 0 0 0| 0 0 | 480B 0 | 0 0 |4912 413 0 2 98 0 0 0| 0 0 | 584B 92B| 0 0 |4965 334 0 2 98 0 0 0| 0 0 | 480B 0 | 0 0 |4914 306 0 2 98 0 0 0| 0 0 | 636B 147B| 0 0 |4969 483 0 2 98 0 0 0| 0 0 | 570B 0 | 0 0 |4915 377 8 8 50 32 0 2|7520k 8382M| 578B 0 | 0 0 | 76k 215k 9 7 30 52 0 3|8332k 12G| 960B 132B| 0 0 | 109k 279k 10 5 29 53 0 2|4136k 12G| 240B 0 | 0 0 | 109k 277k 12 6 29 51 0 2|4208k 12G| 240B 0 | 0 0 | 108k 280k 11 6 31 50 0 2|2244k 12G| 330B 90B| 0 0 | 109k 281k 11 6 30 50 0 3|2272k 13G| 240B 0 | 0 0 | 110k 281k Writes around 12.
Posts
Video interview: face melting performance in #hpc #nvme #storage @scalableinfo
Oh no … we didn’t say “face melting” … did we? Oh. Yes. We. Did. The interview is here at the always wonderful InsideHPC.com You can see the video itself here on YouTube, but read Rich’s transcript. I was losing my voice, and he captured all of the interview in text. Take home messages: Insane IO/Networking/processing performance, small footprint, tiny price, available for orders now.
Posts
There are no silver bullets, 2015 edition
In Feb 2013, I opined (with some measure of disgust) that people were looking at various software packages as silver bullets, these magical bits of a stack which could suddenly transform massive steaming piles of bits (big … uh … “data” ?) into golden nuggets of actionable data. Many of the “solutions” marketed these days are exactly like that … “add our magic bean software to your pipeline and you will gain insight faster.
Posts
Shiny #HPC #storage things at #SC15
Assuming everything goes as planned (HA!) we should have a number of very cool things at SC15.
* 100Gb [Unison storage system with BeeGFS](https://scalableinformatics.com/unison) * 100Gb [Unison Ceph](https://scalableinformatics.com/unison) system * 100Gb connection to a partner/customer booth * Forte 100Gb is awesome. The first time I ran an iperf bidirectional test, saw 20GB/s … it blew me away. 40/56GbE is old hat now, and 10GbE is in the rapidly receding past.
Posts
Cat peeking out of bag: Schedule of presentations and talks in our booth for SC15 is up
I mentioned previously that we have some new (shiny) things … and it looks like you’ll be able to hear about them at my talk. See the schedule for timing information. This said, please note that we have a terrific line up of people giving talks:
Fintan Quill from Kx on kdb+ … which is an awesome market leading Big Data Time Series analytics and database tool that runs absolutely balls-out insanely fast on our architecture Christian Mohrbacher from Thinkparq on BeeGFS … the primary parallel file system we are leveraging for Unison parallel file system appliances * Mark Nelson from Inktank/Red Hat on Ceph … the reliable block and object storage system that we’ve built into our Unison Object/Block Storage appliance * Doug Eadline from Basement Supercomputing on Hadoop, and likely showing a Limulus deskside Hadoop appliance * Phil Mucci from Minimal Metrics on optimization problems for systems and code.
Posts
M&A: EMC gobbled by Dell
Need to think how this will play out. The Register’s take is here. It seems that this will solve the “shareholder value” problem indicated by Elliot Management (e.g. they wanted more return on their investment). As part of the increasing the return and value return to shareholders, EMC had been in a cost cutting mode. Layoffs have been in process, and likely products trimmed or refocused. Once this goes through (assuming regulators won’t protest), Dell will have
Posts
As the benchmark cooks
We are involved in a fairly large benchmark for a potential customer. I won’t go into many specifics, though I should note that lots of our Unison units are involved. Current architecture has 5 storage nodes (6th was temporarily removed to handle a customer issue). Each Unison node has a pair of 56GbE NICs, as well as our appliance OS, and bunches of other goodness (quite a bit of flash). Total capacity for test is of order 200TB of flash.
Posts
Been there, done that, even have a patent on it
I just saw this about doing a divide and conquer approach to massive scale genomics calculation. While not specific to the code in question, it looked familiar. Yeah, I think I’ve seen something like this before … and wrote the code to do it. It was called SGI GenomeCluster. It was original and innovative at the time, hiding the massively parallel nature of the computation behind a comfortable interface that end users already knew.
Posts
On storage unicorns and their likely survival or implosion
The Register has a great article on storage unicorns. Unicorns are not necessarily mythical creatures in this context, but very high valuation companies that appear to defy “standard” valuation norms, and hold onto their private status longer than those in the past. That is, they aren’t in a rush to IPO or get acquired.
The article goes on to analyze the “storage” unicorns, those in the “storage” field. They admix storage, nosql, hyperconverged, and storage as a service.
Posts
Imitation and repetition is a sincere form of flattery
A few years ago, we demonstrated some truly awesome capability in single racks and on single machines. We had one of our units (now at a customer site), specifically the unit that set all those STAC M3 records, showing this:
and a rack of our units (now providing high performance cloud service at a customer site)
for 8k random reads across 0.25 PB of storage on a very fast 40GbE backbone.
Posts
M&A [RUMOR]: Cisco grabs Nutanix
[update] TL;DR this appears to be rumor/speculation. One would think that such an acquisition would be prominent on Nutanix’s web site. Its April fools, in May. /sigh
Huge in the hyperconverged space (which, not so curiously, is where the day job is), and its setting up the battle lines between the major software/hardware players. Cisco was already number 5 hardware vendor, and was bragging about “beating the white boxes”. The last may be more wishful thinking than reality.
Posts
Booth at BioIT World 15 in Boston
Should be fun, we will have booth (#461) on the side near the thoroughfare for the talks. Our HPC on Wall Street booth looked like this:
[ ](/images/HPConWS-booth-spring2015.jpg)
The display on the monitor is from our FastPath Cadence machine, and is part of the performance dashboard, built upon InfluxDB, Grafana, sios-metrics, and influxdbcli. Here is a blown up view, note the vertical axes for BW (GB/s) and IOPs.
[ ](/images/cadence-dash-spring2015.jpg)
Posts
The worlds fastest hyper-converged appliance is faster and more affordable than ever
This is a very exciting hyper-converged system, representing our next generation of time series, and big data analytical systems. Tremendous internal bandwidths coupled with massive internal parallelism, and minimal latency design on networks. This unit has been designed to focus upon delivering the maximal performance possible in an as minimal footprint … both rack based and cost wise … as possible. You can use these as independent stand alone units, integrate them into a larger FastPath Unison system We have our software stack (SIOS) integrated onto each unit, and include our builds of Python + Pandas/SciPy/NumPy, R, and Perl.
Posts
Interesting Q1 so far for day job
Our Q1 is usually quiet, fairly low key. Not this one. Looks like lots of pent up demand. We are deep into record territory, running 200+% of normal, with possibility of more. Another new wrinkle is that our small investment round is mostly complete. This is new territory for us, and you may have noticed I’d backed off posting intensity over the last half year or so while this was going on.
Posts
When the revolution hits in force ...
Our machines will be there, helping power the genomics pipelines to tremendous performance. Performance is an enabling feature. Without it you cannot even begin to hope to perform massive scale analytics. With it, you can dream impossible dreams. This article came out talking about a massive performance analytics pipeline at Nationwide Children’s Hospital in Ohio. This pipeline runs on a cluster attached to Scalable Informatics FastPath Unison storage. This is a very dense, very fast system, interconnected with Mellanox FDR Infiniband, Chelsio 40GbE, and leveraging BeeGFS from thinkparq.
Posts
Inventory reduction @scalableinfo
Its that time of year, when the inventory fairies come out and begin their counting. Math isn’t hard, but the day job would like a faster and easier count this year. So, the day job is working on selling off existing inventory. We have 4 units ready to go out the door to anyone in need of 70-144TB usable storage at 5-6 GB/s per unit. Specs are as follows:
16-24 processor cores 128 GB RAM 48x {2,3,4} TB top mount drives 4x rear mount SSDs (OS/metadata cache) Scalable OS (Debian Wheezy based Linux OS) 3 year warranty As this is inventory reduction, the more inventory you take, the happier we are (and the less work that the inventory fairies have to do).
Posts
Starting to come around to the idea that swap in any form, is evil
Here’s the basic theory behind swap space. Memory is expensive, disk is cheap. Only use the faster memory for active things, and aggressively swap out the less used things. This provides a virtual address space larger than physical/logical memory. Great, right? No. Heres why.
swap makes the assumption that you can always write/read to persistent memory (disk/swap). It never assumes persistent memory could have a failure. Hence, if some amount of paged data on disk suddenly disappeared, well … Put another way, it increases your failure likelihood, by involving components with higher probability of failure into a pathway which assumes no failure.
Posts
30TB flash disk, Parallel File System, massive network connectivity
This will be fun to watch run …
Scalable Informatics FastPath Unison for the win!
Posts
SC14 T minus 6 and counting
Scalable’s booth is #3053. We’ll have some good stuff, demos, talks, and people there. And coffee. Gotta have the coffee. More soon, come by and visit us!
Posts
massive unapologetic firepower part 2 ... the dashboard ...
For Scalable Informatics Unison product. The whole system:
[ ](/images/dash-2.png)
Watching writes go by:
[ ](/images/dash-3.png)
Note the sustained 40+ GB/s. This is a single rack sinking this data, and no SSDs in the bulk data storage path. This dashboard is part of the day job’s FastPath product.
Posts
Updated boot tech in Scalable OS (SIOS)
This has been an itch we’ve been working on scratching a few different ways, and its very much related to forgoing distro based installers. Ok, first the back story. One of the things that has always annoyed me about installing systems has been the fundamental fragility of the OS drive. It doesn’t matter if its RAIDed in hardware/software. Its a pathway that can fail. And when it fails, all hell breaks loose.
Posts
Soon ... 12g goodness in new chassis
This is one of our engineering prototypes that we had to clear space for. A couple of new features I’ll talk about soon, but you should know that these are 12g SAS machines (will do 6g SATA of course as well).
Front of unit:
[ ](/images/IMG_2330.JPG)
Note the new logo/hand bar. The rails are also brand new, and are set to enable easy slide in/out even with 100+ lbs of disk in them.
Posts
Massive, unapologetic, firepower: 2TB write in 73 seconds
A 1.2PB single mount point Scalable Informatics Unison system, running an MPI job (io-bm) that just dumps data as fast as the little Infiniband FDR network will allow. Our test case. Write 2TB (2x overall system memory) to disk, across 48 procs. No SSDs in the primary storage. This is just spinning rust, in a single rack. This is performance pr0n, though safe for work.
usn-01:/mnt/fhgfs/test # df -H /mnt/fhgfs/ Filesystem Size Used Avail Use% Mounted on fhgfs_nodev 1.
Posts
Doing what we are passionate about
I am lucky. I fully admit this. There are people out there whom will tell you that its pure skill that they have been in business and been successful for a long time. Others will admit luck is part of it, but will again, pat themselves on the back for their intestinal fortitude. Few will say “I am lucky”. Which is a shame, as luck, timing (which you can never really, truly, control), and any number of other factors really are critical to one being able to have the luxury of doing what we are doing.
Posts
We had a record setting, knock the barn doors down year last year
… and believe it or not, I forgot to mention it. This is the first time in company history that we had a backlog going into Q1. Orders being built and tested on the last work day of the year. We grew, not the amount we had originally forecast, but we understand why (and sadly have little control over that aspect). We are working very hard on our appliances … I am blown away as to how perfect a fit they are for folks.
Posts
Day job at HPC on Wall Street on Monday the 9th
We’ll be showing off 2 appliances, with a change of what we are showing/announcing on one due to something not being ready on the business side. The first one is our little 108 port siRouter box. Think ‘bloody fast NAT’ and SDN in general, you can run other virtual/bare metal apps atop it.
The second will be a massive scale parallel SQL DB appliance. Usable for big data, hadoop like workloads, and other similar workloads more commonly used on other well known platforms.