Below you will find pages that utilize the taxonomy term “devops”
Posts
What reduces risk ... a great engineering and support team, or a brand name ?
I’ve written about approved vendors and “one throat to choke” concept in the past. The short take from my vantage point as a small, not well known, but highly differentiated builder of high performance storage and computing systems … was that this brand specific focus was going to remove real differentiated solutions from market, while simultaneously lowering the quality and support of products in market. The concept of brand and marketing of a brand is about erecting barriers to market entry against the smaller folk whom might have something of interest, and the larger folk who might come in with a different ecosystem.
Posts
pcilist: because sometimes you really, really need to know how your PCIe devices are configured
If you don’t know what I am talking about here, that’s fine. I’ll assume you don’t do hardware, or you call someone else when there is a hardware problem. If you think “well gee, don’t we have lspci? so why do we need this?” then you probably have not really tried to use lspci to find this information, or didn’t know it was available. Ok … what I am talking about.
Posts
That was fun: mysql update nuked remote access
Update your packages, they said. It will be more secure, they said. I guess it was. No network access to the databases. Even after turning the database server instance to listen again on the right port, I had to go in and redo the passwords and privileges. So yeah, this broke my MySQL instance for a few hours. Took longer to debug as it was late at night and I was sleepy, so I put it off until morning with caffeine.
Posts
Architecture matters, and yes Virginia, there are no silver bullets for performance
Time and time again, the day job had been asked to discuss how the solutions are differentiated. Time and time again, we showed benchmarks on real workloads that show significant performance deltas. Not 2 or 3 sigma measurements. More often than not, 2x -> 10x better. Yet … yet … we were asked, again and again, how we did it. We pointed to our architecture. But, they complained, isn’t it the same as X (insert your favorite volume vendor here)?
Posts
Another itch scratched
So there you are, with many software RAIDs. You’ve been building and rebuilding them. And somewhere along the line, you lost track of which devices were which. So somehow you didn’t clean up the last build right, and you thought you had a hot spare … until you looked at /proc/mdstat … and said … Oh … So. I wanted to do the detailed accounting, in a simple way. I want the tool to tell me if I am missing a physical drive (e.
Posts
strace -p is your friend
So there I was, trying to use a serial port on a node which was connected to a serial port on a switch. Which I needed to properly configure the switch. So I light up minicom and get garbage. Great, a baud rate mismatch, easily fixed. Fix it. Connect again. I get the first 10-12 characters … and then garbage. Hmmm. I’d like to pause our story for a moment, and say I had the key insight at this moment … but that would not be true.
Posts
I don't agree with everything he wrote about systemd, but he isn't wrong on a fair amount of it
Systemd has taken the linux world by storm. Replacing 20-ish year old init style processing for a more legitimate control plane, and replacing it with a centralized resource to handle this control. There are many things to like within it, such as the granularity of control. But there are any number of things that are badly broken by default. Actually some of these things are specifically geared towards desktop users (which isn’t a bad thing if you are a desktop linux user, as I am).
Posts
That was fun ... no wait ... the other thing ... not fun
Long overdue update of the server this blog runs on. It is no longer running a Ubuntu flavor, but instead running SIOSv2 which is the same appliance operating system that powers our products. This isn’t specifically a case of eating our own dog-food, but more a case that Ubuntu, even the LTS versions, have a specific sell by date, and it is often very hard to update to the newer revs.
Posts
new SIOS feature: compressed ram image for OS
Most people use squashfs which creates a read-only (immutable) boot environment. Nothing wrong with this, but this forces you to have an overlay file system if you want to write. Which complicates things … not to mention when you overwrite too much, and run out of available inodes on the overlayfs. Then your file system becomes “invalid” and Bad-Things-Happen(™). At the day job, we try to run as many of our systems out of ram disks as we can.
Posts
A wonderful read on metrics, profiling, benchmarking
Brendan Gregg’s writings are always interesting and informative. I just saw a link on hacker news to a presentation he gave on “Broken Performance Tools”. It is wonderful, and succinctly explains many thing I’ve talked about here and elsewhere, but it goes far beyond what I’ve grumbled over. One of my favorite points in there is slide 83. “Most popular benchmarks are flawed” and a pointer to a paper (easy to google for).
Posts
Updated net-tools bits
So far, 3 components, and working to fix a few things in formatting. On github, grab it here. First, lsbond.pl to report about bond details
root@unison-mgr-1:~/net-tools# ./lsbond.pl bond0: mac 0c:c4:7a:48:69:cb state up mode fault-tolerance (active-backup) xmit_hash layer2 0 active slave eth1 polling 100 ms up_delay 200 ms down_delay 200 ms slave nics: eth1: mac 0c:c4:7a:48:69:cb, link 1, state up, speed 1000, driver igb, version 5.3.2.2 firmware version 1.61,0x8000090e bond1: mac 00:12:c0:80:26:76 state up mode fault-tolerance (active-backup) xmit_hash layer2 0 active slave eth3 polling 100 ms up_delay 200 ms down_delay 200 ms slave nics: eth2: mac 00:12:c0:80:26:76, link 1, state up, speed 10000, driver ixgbe, version 4.
Posts
SIOS v2.0 running pxe booted
Our SIOS (Linux based OS, usually based upon Debian) has just been updated for jessie (Debian 8). This was necessary to support rkt, docker, etc. in addition to our other bits. Its been cooking in the background for a while, for, as you might have noticed from my posting frequency, I’ve been busy. But we are up, and running. Base distro version here:
root@usn-ramboot:~# df -h Filesystem Size Used Avail Use% Mounted on tmpfs 8.
Posts
Systemd, and the future of Linux init processing
An interesting thing happened over the last few months and years. Systemd, a replacement init process for Linux, gained more adherents, and supplanted the older style init.d/rc scripting in use by many distributions. Ubuntu famously abandoned init.d style processing in favor of upstart and others in the past, and has been rolling over to systemd. Red Hat rolled over to Systemd. As have a number of others. Including, surprisingly, Debian. For those whom don’t know what this is, think of it this way.
Posts
Mixing programming languages for fun and profit
I’ve been looking for a simple HTML5-ish way to represent our disk drives in our Unison units. I’ve been looking for some simple drawing libraries in javascript to make this higher level, so I don’t have to handle all the low level HTML5 bits. I played with Raphael and a few others (including paper.js). I wound up implementing something in Raphael.
The code that generated this was a little unwieldly … as javascript doesn’t quite have all the constructs one might expect from a modern language.
Posts
And the 0.8.3 InfluxDB no longer works with the InfluxDB perl module
I ran into this a few weeks ago, and am just getting around to debugging it now. Traced the code, set up a debugger and followed the path of execution, and … and … Yup, its borked. So, I can submit a patch or 3 against the InfluxDB code, or roll a simpler more general Time Series Data Base interface that will talk to InfluxDB. And eventually kdb+. Since I wanted to code for that as well, I am looking more seriously at the second option.
Posts
Solved the major socket bug ... and it was a layer 8 problem
I’d like to offer an excuse. But I can’t. It was one single missing newline. Just one. Missing. Newline. I changed my config file to use port 10000. I set up an nc listener on the remote host.
nc -k -l a.b.c.d 10000 Then I invoked the code. And the data showed up. Without a ()&(&%&$%*&(^ newline. That couldn’t possibly be it. Could it? No. Its way to freaking simple.
Posts
New monitoring tool, and a very subtle bug
I’ve been working on coding up some additional monitoring capability, and had an idea a long time ago for a very general monitoring concept. Nothing terribly original, not quite nagios, but something easier to use/deploy. Finally I decided to work on it today. The monitoring code talks to a graphite backend. Could talk to statsd, or other things. In this case, we are using the InfluxDB plugin for graphite. I wanted an insanely simple local data collector.
Posts
InfluxDB cli is up on github
I know there is a node version, and I did try it before I wrote my own. Actually, the reason I wrote my own was that I tried it and … well … Link is here. And yes, the readme is borked about 1/2 way through. Doesn’t quite show the formatting of the output quite right. Will try to fix over the weekend, as I move this a far more feature complete bit.
Posts
Have a nice cli for InfluxDB
I tried the nodejs version and … well … it was horrible. Basic things didn’t work. Made life very annoying. So, being a good engineering type, I wrote my own. It will be up on our site soon. Here’s an example
./influxdb-cli.pl --host 192.168.5.117 --user test --pass test --db metrics metrics> \list series
.----------------------------------. | series name | +----------------------------------+ | lightning.cpuload.avg1 | | lightning.cputotals.idle | | lightning.cputotals.irq | | lightning.
Posts
Be on the lookout for 'pauses' in CentOS/RHEL 6.5 on Sandy Bridge
Probably on Ivy Bridge as well. Short version. The pauses that plagued Nehalem and Westmere are baaaack. In RHEL/CentOS 6.5 anyway. A customer just ran into one. We helped diagnose/work around this a few years ago when a hedge fund customer ran into this … then a post-production shop … then … Basically the problem came in from the C-states. The deeper the sleep state, in some instances, the processor would not come out of it, or get stuck in the lower levels.
Posts
Don't know if I mentioned it, but the day job has a new website
Take a gander. Some things are missing, and our marketing folks are developing the content where needed, and revising it where we have existing content. Its quite refreshing to see this. It will get better over time. Its running in our facility now, and likely we’ll have a few clones in the cloud as well. But thats for later.
Posts
Update on IPMI Console Logger
Config now comes from some nice and simple json, and it handles multiple machines with aplomb. See the git repository for the latest. The config file example is in there, and you can replicate the n01-ipmi section with more nodes trivially. Coming next is getting config from a trusted web server, along with registering the client to the trusted web server. This prevents things like passwords from showing up in the clear, though you can always create a lower privileged user to access the console for monitoring.