Thursday, October 20, 2016

We need more desktop processor branches

Ars Technica is reporting an interesting attack that uses a side-channel exploit in the Intel Haswell branch translation buffer, or BTB (kindly ignore all the political crap Ars has been posting lately; I'll probably not read any more articles of theirs until after the election). The idea is to break through ASLR, or address space layout randomization, to find pieces of code one can string together or directly attack for nefarious purposes. ASLR defeats a certain class of attacks that rely on the exact address of code in memory. With ASLR, an attacker can no longer count on code being in a constant location.

Intel processors since at least the Pentium use a relatively simple BTB to aid these computations when finding the target of a branch instruction. The buffer is essentially a dictionary with virtual addresses of recent branch instructions mapping to their predicted target: if the branch is taken, the chip has the new actual address right away, and time is saved. To save space and complexity, most processors that implement a BTB only do so for part of the address (or they hash the address), which reduces the overhead of maintaining the BTB but also means some addresses will map to the same index into the BTB and cause a collision. If the addresses collide, the processor will recover, but it will take more cycles to do so. This is the key to the side-channel attack.

(For the record, the G3 and the G4 use a BTIC instead, or a branch target instruction cache, where the table actually keeps two of the target instructions so it can be executing them while the rest of the branch target loads. The G4/7450 ("G4e") extends the BTIC to four instructions. This scheme is highly beneficial because these cached instructions essentially extend the processor's general purpose caches with needed instructions that are less likely to be evicted, but is more complex to manage. It is probably for this reason the BTIC was dropped in the G5 since the idea doesn't work well with the G5's instruction dispatch groups; the G5 uses a three-level hybrid predictor which is unlike either of these schemes. Most PowerPC implementations also have a return address stack for optimizing the blr instruction. With all of these unusual features Power ISA processors may be vulnerable to a similar timing attack but certainly not in the same way and probably not as predictably, especially on the G5 and later designs.)

To get around ASLR, an attacker needs to find out where the code block of interest actually got moved to in memory. Certain attributes make kernel ASLR (KASLR) an easier nut to crack. For performance reasons usually only part of the kernel address is randomized, in open-source operating systems this randomization scheme is often known, and the kernel is always loaded fully into physical memory and doesn't get swapped out. While the location it is loaded to is also randomized, the kernel is mapped into the address space of all processes, so if you can find its address in any process you've also found it in every process. Haswell makes this even easier because all of the bits the Linux kernel randomizes are covered by the low 30 bits of the virtual address Haswell uses in the BTB index, which covers the entire kernel address range and means any kernel branch address can be determined exactly. The attacker finds branch instructions in the kernel code such as by disassembling it that service a particular system call and computes (this is feasible due to the smaller search space) all the possible locations that branch could be at, creates a "spy" function with a branch instruction positioned to try to force a BTB collision by computing to the same BTB index, executes the system call, and then executes the spy function. If the spy process (which times itself) determines its branch took longer than an average branch, it logs a hit, and the delta between ordinary execution and a BTB collision is unambiguously high (see Figure 7 in the paper). Now that you have the address of that code block branch, you can deduce the address of the entire kernel code block (because it's generally in the same page of memory due to the typical granularity of the randomization scheme), and try to get at it or abuse it. The entire process can take just milliseconds on a current CPU.

The kernel is often specifically hardened against such attacks, however, and there are more tempting targets though they need more work. If you want to attack a user process (particularly one running as root, since that will have privileges you can subvert), you have to get your "spy" on the same virtual core as the victim process or otherwise they won't share a BTB -- in the case of the kernel, the system call always executes on the same virtual core via context switch, but that's not the case here. This requires manipulating the OS' process scheduler or running lots of spy processes, which slows the attack but is still feasible. Also, since you won't have a kernel system call to execute, you have to get the victim to do a particular task with a branch instruction, and that task needs to be something repeatable. Once this is done, however, the basic notion is the same. Even though only a limited number of ASLR bits can be recovered this way (remember that in Haswell's case, bit 30 and above are not used in the BTB, and full Linux ASLR uses bits 12 to 40, unlike the kernel), you can dramatically narrow the search space to the point where brute-force guessing may be possible. The whole process is certainly much more streamlined than earlier ASLR attacks which relied on fragile things like cache timing.

As it happens, software mitigations can blunt or possibly even completely eradicate this exploit. Brute-force guessing addresses in the kernel usually leads to a crash, so anything that forces the attacker to guess the address of a victim routine in the kernel will likely cause the exploit to fail catastrophically. Get a couple of those random address bits outside the 30 bits Haswell uses in the BTB table index and bingo, a relatively simple fix. One could also make ASLR more granular to occur at the function, basic block or even single instruction level rather than merely randomizing the starting address of segments within the address space, though this is much more complicated. However, hardware is needed to close the gap completely. A proper hardware solution would be to either use most or all of the virtual address in the BTB to reduce the possibility of a collision, and/or to add a random salt to whatever indexing or hashing function is used for BTB entries that varies from process to process so a collision becomes less predictable. Either needs a change from Intel.

This little fable should serve to remind us that monocultures are bad. This exploit in question is viable and potentially ugly but can be mitigated. That's not the point: the point is that the attack, particularly upon the kernel, is made more feasible by particular details of how Haswell chips handle branching. When everything gets funneled through the same design and engineering optics and ends up with the same implementation, if someone comes up with a simple, weapons-grade exploit for a flaw in that implementation that software can't mask, we're all hosed. This is another reason why we need an auditable, powerful alternative to x86/x86_64 on the desktop. And there's only one system in that class right now.

Okay, okay, I'll stop banging you over the head with this stuff. I've got a couple more bugs under investigation that will be fixed in 45.5.0, and if you're having the issue where TenFourFox is not remembering your search engine of choice, please post your country and operating system here.

Saturday, October 15, 2016

It's Talos time (plus: 45.5.0 beta 2 now with more AltiVec IDCT)

It's Talos time. You can now plunk down your money for an open, auditable, non-x86 workstation-class computer that doesn't suck. It's PowerPC. It's modern. It's beefy. It's awesome.

Let's not mince words, however: it's also not cheap, and you're gonna plunk down a lot if you want this machine. The board runs $4100 and that's without the CPU, which is pledged for separately though you can group them in the same order (this is a little clunky and I don't know why Raptor did it this way). To be sure, I think we all suspected this would be the case but now it's clear the initial prices were underestimates. Although some car repairs and other things have diminished my budget (I was originally going to get two of these), I still ponied up for a board and for one of the 190W octocore POWER8 CPUs, since this appears to be the sweetspot for those of us planning to use it as a workstation (remember each core has eight threads via SMT for a grand total of 64, and this part has the fastest turbo clock speed at 3.857GHz). That ran me $5340. I think after the RAM, disks, video card, chassis and PSU I'll probably be all in for around $7000.

Too steep? I don't blame you, but you can still help by donating to the project and enable those of us who can afford to jump in first to smoothe the way out for you. Frankly, this is the first machine I consider a meaningful successor to the Quad G5 (the AmigaOne series isn't quite there yet). Non-x86 doesn't have the economies of scale of your typical soulless Chipzilla craptop or beige box, but if we can collectively help Raptor get this project off the ground you'll finally have an option for your next big machine when you need something free, open and unchained -- and there's a lot of chains in modern PCs that you don't control. You can donate as little as $10 and get this party started, or donate $250 and get to play with one remotely for a few months. Call it a rental if you like. No, I don't get a piece of this, I don't have stock in Raptor and I don't owe them a favour. I simply want this project to succeed. And if you're reading this blog, odds are you want that too.

The campaign ends December 15. Donate, buy, whatever. Let's do this.

My plans are, even though I confess I'll be running it little-endian (since unfortunately I don't think we have much choice nowadays), to make it as much a true successor to the last Power Mac as possible. Yes, I'll be sinking time into a JIT for it, which should fully support asm.js to truly run those monster applications we're seeing more and more of, porting over our AltiVec code with an endian shift (since the POWER8 has VMX), and working on a viable and fast way of running legacy Power Mac software on it, either through KVM or QEMU or whatever turns out to be the best option. If this baby gets off the ground, you have my promise that doing so will be my first priority, because this is what I wanted the project for in the first place. We have a chance to resurrect the Power Mac, folks, and in a form that truly kicks ass. Don't waste the opportunity.

Now, having said all that, I do think Raptor has made a couple tactical errors. Neither are fatal, but neither are small.

First, there needs to be an intermediate pledge level between the bare board and the $18,000 (!!!!) Warren Buffett edition. I have no doubt the $18,000 machine will be the Cadillac of this line, but like Cadillacs, there isn't $18,000 worth of parts in it (maybe, maybe, $10K), and this project already has a bad case of sticker shock without slapping people around with that particular dead fish. Raptor needs to slot something in the middle that isn't quite as wtf-inducing and I'll bet they'll be appealing to those people willing to spend a little more to get a fully configured box. (I might have been one of those people, but I won't have the chance now.)

Second, the pledge threshold of $3.7 million is not ludicrous when you consider what has to happen to manufacture these things, but it sure seems that way. Given that this can only be considered a boutique system at this stage, it's going to take a lot of punters like yours truly to cross that point, which is why your donations even if you're not willing to buy right now are critical to get this thing jumpstarted. I don't know Raptor's finances, but they gave themselves a rather high hurdle here and I hope it doesn't doom the whole damn thing.

On the other hand, doesn't look like Apple's going to be updating the Mac Pro any time soon, so if you're in the market ...

On to 45.5.0 beta 2 (downloads, hashes). The two major changes in this version is that I did some marginal reduction in the overhead of graphics primitives calls, and completed converting to AltiVec all of the VP9 inverse discrete cosine and Hadamard transforms. Feel free to read all 152K of it, patterned largely off the SSE2 version but still mostly written by hand; I also fixed the convolver on G4 systems and made it faster too. This is probably the biggest amount of time required by the computer while decoding frames. I can do some more by starting on the intraframe predictors but that will probably not yield speed ups as dramatic. My totally unscientific testing is yielding these recommendations for specific machines:

1.0GHz iMac G4 (note: not technically supported, but a useful comparison): maximum watchable resolution 144p VP9
1.33GHz iBook G4, reduced performance: same
1.33GHz iBook G4, highest performance: good at 144p VP9, max at 240p VP9, but VP8 is better
1.67GHz DLSD PowerBook G4: ditto, VP8 better here too
2.5GHz Quad G5, reduced performance: good at 240p VP9, max at 360p VP9
2.5GHz Quad G5, highest performance: good at 360p VP9, max at 480p VP9

I'd welcome your own assessments, but since VP8 (i.e., MediaSource Extensions off) is "good enough" on the G5 and actually currently better on the G4, I've changed my mind again and I'll continue to ship with MSE turned off so that it still works as people expect. However, they'll still be able to toggle the option in our pref panel, which also was fixed to allow toggling PDF.js (that was a stupid bug caused by missing a change I forgot to pull forward into the released build). When VP9 is clearly better on all supported configurations then we'll reexamine this.

No issues have been reported regarding little-endian JavaScript typed arrays or our overall new hybrid endian strategy, or with the minimp3 platform decoder, so both of those features are go. Download and try it.

Have you donated yet?

Sunday, October 9, 2016

The Xserves still lurk

Xserves, the last true Apple server systems, apparently still lurk in dark corners in data centres near you. But for me, the quintessential Apple server line will always be the Apple Network Servers.

Saturday, October 8, 2016

A Saturday mystery, or, locatedb considered harmful to old Macs

I've been waist-deep on AltiVec intrinsics for the last week converting some of those big inverse discrete cosine and Hadamard transforms for TenFourFox's vectorized PowerPC VP9 codec. The little ones cause a noticeable but minor improvement, but when I got the first large transform done there was a big jump in performance on this quad G5. Note that the G5, even though its vector unit is based on the 7400 and therefore weaker than the 7450's, likes long strings of sequential code it can reorder, which is essentially what that huge clot of vector intrinsics is, so I have not yet determined if I've just optimized it well for the G5 or it's generalizeable to the G4 too. My theory is that even though the improvement ratio is about the same (somewhere between 4:1 and 8:1 depending on how much data they swallow per cycle), these huge vectorized inverse transforms accelerate code that takes a lot of CPU time ordinarily, so it's a bigger absolute reduction. I'm going to work on a couple more this weekend and see if I can get even more money out of it. 720p playback is still out of the question even with the Quad at full tilt, but 360p windowed is pretty smooth and even 360p fullscreen (upscaled to 1080p) and 480p windowed can keep up, and it buffers a lot quicker.

The other thing I did was to eliminate some inefficiencies in the CoreGraphics glue we use for rendering pretty much everything (there is no Skia support on 10.4) except the residual Cairo backend that handles printing. In particular, I overhauled our blend and composite mode code so that it avoids a function call on every draw operation. This is a marginal speedup but it makes some types of complex animation much smoother.

Overall I'm pretty happy with this and no one has reported any issues with the little-endian typed array switchover, so I'll make a second beta release sometime next week hopefully. MSE will still be off by default in that build but unless I hear different or some critical showstopper crops up it will be switched on for the final public release.

When I sat down at my G5 this warm Southern California Saturday morning, however, I noticed that MenuMeters (a great tool to have if you don't already) showed the Quad was already rather occupied. This wasn't a new thing; I'd seen what I assumed was a stuck cron job or something for the last several Saturday mornings and killed it in the Activity Monitor. But this was the sixth week in a row it had happened and it looked like it had been running for over three hours wasting CPU time, so enough was enough.

The offending process was something running /usr/bin/find to find, well, everything (that wasn't in /tmp or /usr/tmp), essentially iterating over the whole damn filesystem. A couple of ps -wwjp (What Would Jesus Post?) later showed it was being kicked off as part of the update system for an old Unix dragon of yore, locate.

There are no less than three possible ways to find files from the command line in OS X macOS. One is the venerable find command, which is the slowest of the lot (it uses no cache) and the predicates can be somewhat confusing to novices, but is guaranteed to be up-to-date because it doesn't rely on a pre-existing database and will find nearly anything. The second is of course Spotlight, which is accessible from the Terminal using the mdfind command. There are man pages for both.

The third way is locate, which is easier than find and faster because it uses a database for quick lookups, but less comprehensive than Spotlight/mdfind because it only looks for filenames instead of within file content as well, and the updater has to run periodically to stay current. (There's a man page for it too.) It would seem that Spotlight could completely supersede locate, and Apple thinks so too, because it was turned into a launchd .plist in 10.6 (look at /System/Library/LaunchDaemons/ and disabled by default. That's not the case for 10.5 and previous, however, and I have so many files on my G5 by now that the runtime to update the locate database is now close to five hours -- on an SSD! And that's why it was still running when I sat down to do work after breakfast.

I don't use locate because Spotlight is more convenient and updates practically on demand instead of weekly. If you don't either, then niced or not it's wasted effort and you should disable it from running within your Mac's periodic weekly maintenance tasks. (Note: on 10.3 and earlier, since you don't have Spotlight, you may not want to do this unless locate's update process is tying up your machine also.) Here's how:

  • On 10.5, the weekly periodic script can be told specifically not to run locate.updatedb. Edit /etc/defaults/periodic.conf as root (such as sudo vi /etc/defaults/periodic.conf -- you did fix the sudo bug, right?) and set weekly_locate_enable to "NO".

  • On 10.4 and before (I checked this on my 10.2.8 strawberry iMac G3 as well, so I'm sure 10.3 is the same), the weekly script doesn't offer this option. However, it does check to see if locate.updatedb is executable before it runs it, so simply make it non-executable: sudo chmod -x /usr/libexec/locate.updatedb

Now for some 8-Bit Weapon ambient (de)programming with a much more sedate G5 into the rest of the weekend.

Friday, September 30, 2016

gdb7 patchlevel 4 available

First, in the "this makes me happy" dept.: a Commodore 64 in a Gdansk, Poland auto repair shop still punching a clock for the last quarter-century. Take that, Phil Schiller, you elitist pig.

Second, as promised, patchlevel 4 of the TenFourFox debugger (our hacked version of gdb) is available from SourceForge. This is a minor bugfix update that wallpapers a crash when doing certain backtraces or other operations requiring complex symbol resolution. However, the minimum patchlevel to debug TenFourFox is still 2, so this upgrade is merely recommended, not required.

Wednesday, September 28, 2016

Firefox OS goes Tier WONTFIX

I suppose it shouldn't be totally unexpected given the end of FirefoxOS phone development a few months ago, but a platform going from supported to take-it-completely-out-of-mozilla-central in less than a year is rather startling: not only has commercial development on FirefoxOS completely ceased (at version 2.6), but the plan is to remove all B2G code from the source tree entirely. It's not an absolutely clean comparison to us because some APIs are still relevant to current versions of OS X macOS, but even support for our now ancient cats was only gradually removed in stages from the codebase and even some portions of pre-10.4 code persisted until relatively recently. The new state of FirefoxOS, where platform support is actually unwelcome in the repository, is beyond the lowest state of Tier-3 where even our own antiquated port lives. Unofficially this would be something like the "tier WONTFIX" BDS referenced in a number of years ago.

There may be a plan to fork the repository, but they'd need someone crazy dedicated to keep chugging out builds. We're not anything on that level of nuts around here. Noooo.

Tuesday, September 27, 2016

And now for something completely different: Rehabilitating the Performa 6200?

The Power Macintosh 6200 in its many Performa variants has one of the worst reputations of any Mac, and its pitifully small 603 L1 caches add insult to injury (its poor 68K emulation performance was part of the reason Apple held up the PowerBook's migration to PowerPC until the 603e, and then screwed it up with the PowerBook 5300, a unit that is IMHO overly harshly judged by history but not without justification). LowEndMac has a long list of its perceived faults.

But every unloved machine has its defenders, and I noticed that the Wikipedia entry on the 6200 series radically changed recently. The "Dtaylor372" listed in the edit log appears to be this guy, one "Daniel L. Taylor". If it is, here's his reasoning why the seething hate for the 6200 series should be revisited.

Daniel does make some cogent points, cites references, and even tries to back them some of them up with benchmarks (heh). He helpfully includes a local copy of Apple's tech notes on the series, though let's be fair here -- Apple is not likely to say anything unbecoming in that document. That said, the effort is commendable even if I don't agree with everything he's written. I'll just cite some of what I took as highlights and you can read the rest.

  • The Apple tech note says, "The Power Macintosh 5200 and 6200 computers are electrically similar to the Macintosh Quadra 630 and LC 630." It might be most accurate to say that these computers are Q630 systems with an on-board PowerPC upgrade. It's an understatement to observe that's not the most favourable environment for these chips, but it would have required much less development investment, to be sure.

  • He's right that the L2 cache, which is on a 64-bit bus and clocked at the actual CPU speed, certainly does mitigate some of the problems with the Q630's 32-bit interface to memory, and 256K L2 in 1995 would have been a perfectly reasonable amount of cache. (See page 29 for the block diagram.) A 20-25% speed penalty (his numbers), however, is not trivial and I think he underestimates how this would have made the machines feel comparatively in practice even on native code.

  • His article claims that both the SCSI bus and the serial ports have DMA, but I don't see this anywhere in the developer notes (and at least one source contradicts him). While the NCR controller that the F108 ASIC incorporates does support it, I don't see where this is hooked up. More to the point, the F108's embedded IDE controller -- because the 6200 actually uses an IDE hard drive -- doesn't have DMA set up either: if the Q630 is any indication, the 6200 is also limited to PIO Mode 3. While this was no great sin when the Q630 was in production, it was verging on unacceptable even for a low-to-midrange system by the time the 6200 hit the market. More on that in the next point.

    Do note that the Q630 design does support bus mastering, but not from the F108. The only two entities which can be bus master are the CPU or either the PDS expansion card or communications card via the PrimeTime II IC "southbridge."

  • Daniel makes a very well-reasoned assertion that the computer's major problems were due to software instead of hardware design, which is at least partially true, but I think his objections are oversimplified. Certainly the Mac OS (that's with a capital M) was not well-suited for handling the real-time demands of hardware: ADB, for example, requires quite a bit of polling, and the OS could not service the bus sufficiently often to make it effective for large-volume data transfer (condemning it to a largely HID-only capacity, though that's all it was really designed for). Even interrupt-driven device drivers could be problematic; a large number of interrupts pending simultaneously could get dropped (the limit on outstanding secondary interrupt requests prior to MacOS 9.1 was 40, see Apple TN2010) and a badly-coded driver that did not shunt work off to a deferred task could prevent other drivers from servicing their devices because those other interrupts were disabled while the bad driver tied up the machine.

    That said, however, these were hardly unknown problems at the time and the design's lack of DMA where it counts causes an abnormal reliance on software to move data, which for those and other reasons the MacOS was definitely not up to doing and the speed hit didn't help. Compare this design with the 9500's full PCI bus, 64-bit interface and hardware assist: even though the 9500 was positioned at a very different market segment, and the weak 603 implementation is no comparison to the 604, that doesn't absolve the 6200 of its other deficiencies and the 9500 ran the same operating system with considerably fewer problems (he does concede that his assertions to the contrary do "not mean that [issues with redraw, typing and audio on the 6200s] never occurred for anyone," though his explanation of why is of course different). Although Daniel states that relaying traffic for an Ethernet card "would not have impacted Internet handling" based on his estimates of actual bandwidth, the real rate limiting step here is how quickly the CPU, and by extension the OS, can service the controller. While the comm slot at least could bus master, that only helps when it's actually serviced to initiate it. My personal suspicion is because the changes in OpenTransport 1.3 reduced a lot of the latency issues in earlier versions of OT, that's why MacOS 8.1 was widely noted to smooth out a lot of the 6200's alleged network problems. But even at the time of these systems' design Copland (the planned successor to System 7) was already showing glimmers of trouble, and no one seriously expected the MacOS to explosively improve in the machines' likely sales life. Against that historical backdrop the 6200 series could have been much better from the beginning if the component machines had been more appropriately engineered to deal with what the OS couldn't in the first place.

In the United States, at least, the Power Macintosh 6200 family was only ever sold under the budget "Performa" line, and you should read that as Michael Spindler being Spindler, i.e., cheap. In that sense putting as little extra design money into it wasn't ill-conceived, even if it was crummy, and I will freely admit my own personal bias in that I've never much cared for the Quadra 630 or its derivatives because there were better choices then and later. I do have to take my hat off to Daniel for trying to salvage the machine's bad image and he goes a long way to dispelling some of the more egregious misconceptions, but crummy's still as crummy does. I think the best that can be said here is that while it's likely better than its reputation, even with careful reconsideration of its alleged flaws the 6200 family is still notably worse than its peers.