Saturday, February 4, 2017

An experiment: introducing the PopOutPlayer (plus: microbenchmarking PowerPC SPRs)

TenFourFox 45.8 is coming along -- Operation Short Change (my initiative to rework IonMonkey to properly emit single branch instructions instead of entire branch stanzas when the target is guaranteed to be in range) is smoothing out the browser quite a bit but there are hundreds of places that need to be changed. I've finished the macroassembler and the regular expression engine (and improved both of them in a couple other minor ways), and now I'm working on the Ion code generator. This mostly benefits systems that cannot fit generated code into cache, so while the Quad G5 with its (comparatively large) 1MB per core has only improved somewhere on the order of about 5%, I expect more on low-end systems because hot code is now more likely to stay in the i-cache. I am also planning to put user agent support in 45.8, and there are a few other custodial improvements. The idea is to release that to beta testing around Valentine's Day so I don't have to buy you lot any flowers.

A long while back I mentioned a couple secret projects; one of them was SandboxSafari, and the other was intended as a followup to the now obsolete MacTubes Enabler. That second project got backburnered for awhile, but since I'm noticing changes that may obsolete the work I've done on it as well, I'm going to push it out the door in its current state. Without further ado, let's introduce the PopOutPlayer.

This is not Photoshop. There is, just as you see, a floating video window playing the Vimeo movie that TenFourFox cannot (no built-in H.264 codec). In fact, I think this is the only way you can play Vimeo movies on Tiger now -- neither Safari nor OmniWeb work either.

Yes, it works fine with YouTube also:

Drop the application in your apps folder, drop the PopOutPlayer Enabler addon on TenFourFox, and then on any Vimeo (10.4 only) or YouTube (10.4 or 10.5) page or embedded video, right-click or SHIFT-right-click and select Pop Out Video. The video floats in its own window on top of other windows. You can close the browser tab and go look at something else while the video plays; the playback is a completely independent process. When you open another video, the window pops up in the same location where you left it. On multiprocessor Power Macs it can even get scheduled on another core for even smoother playback.

Sounds great, right? Well, there's a few reasons why I hadn't released this earlier.

First, the app itself is, and I'm actually feeling ill writing this, based on these sites' Flash players. Flash was the only reliable way to do playback and was even more performant than native H.264 in WebKit, and it also had substantially fewer user interface problems at the risk of sometimes being crashy. Flash is no safer now than it was before I condemned it from the browser, but I've learned a few lessons from SandboxSafari, of which the PopOutPlayer is actually a descendant. To remember its window location means we have to run the app with your uid (which eliminates SandboxSafari's primary means of protection), but we also have the advantage of only needing to support at most two specific video player applets, so we can design a very restrictive environment that protects the app from being subverted and rejects running video or Flash applets from other sources.

This different type of sandbox implements other restrictions, most notably preventing the applet from going full-screen. This is necessity reborn as virtue because most of our systems would not do well with full-screen playback, let alone HD (which is also blocked/unsupported), and prevents a subverted player from monkeying with the rest of your screen. The sandbox doesn't let URLs open from the applet either, and has its own Mach exception handler and CGEventTap to filter other possible avenues of exploit. However, that also means that you can't do many of the things you would expect to do if the Flash player were embedded in the browser. That won't change.

The window is fixed-size and floats. Allowing the window to resize caused lots of problems because the applets didn't expect to deal with that possibility. You get one size no matter how big your screen is, and you can only close it or move it.

Although the PopOutPlayer can play more YouTube videos than the QuickTime Enabler, there are still many it cannot play, though at least the Flash player will give you some explanation (the Vevo ones are the most notorious). The PopOutPlayer also isn't intended for generic HTML5 video playback; it doesn't replace the QTE in that sense, and the QTE is still the official video solution in general. The application will reject passing it URLs from other video sites.

The really big limitation, however, is that I could not get the Vimeo applet to run on 10.5 using the hacks I devised for 10.4. Leopard WebKit can play some Vimeo videos, so all is not lost, but no matter what I tried the PopOutPlayer simply wouldn't display any video itself. For the time being, Vimeo URLs on Leopard will generate an "Unsupported video URL" message. It is quite possible this might never be able to be fixed with the current method the PopOutPlayer uses for display, so don't expect it will necessarily be repaired in the future. For that matter, Vimeo on-demand doesn't even work with 10.4.

I consider the PopOutPlayer to be highly experimental, and when (not if) Vimeo and YouTube decommission their Flash players, it will abruptly cease to work without warning. But because I expect that time is coming sooner and not later you are welcome to use the PopOutPlayer for as long as it benefits you, and if I can solve some of these issues I might even make it a supported option in the future -- just don't hold your breath.

Download it here. It is unsupported. Source code is not currently available.

So back to TenFourFox. If you would, permit me now to indulge in some gratuitous nerderosity. Part of Operation Short Change was also to explore whether our branch stanza far calls could be made more efficient. Currently, if the target of a branch instruction exceeds the displacement the branch instruction (i.e., b(l) or bc(l)) can encode, we load the target into a general purpose register (GPR), transfer it to the counter register (a special purpose register, or SPR), and branch to that (i.e., lis/ori/mtctr/b(c)ctr(l)). The PowerPC ISA does not allow directly branching to a GPR or FPR, only to the counter register (CTR) and the link register (LR), which are both SPRs.

This would be all well and good except that the G5 groups instructions together, and IBM warns that there is a substantial execution penalty if mtctr and b(c)ctr(l) are in the same dispatch group. Since mtspr instructions like mtctr must always lead dispatch groups, the above stanza is guaranteed to put them both together (recall instruction dispatch groups are no more than four, or five with a branch, with branches being the last slot). Is it faster to insert nops and accept the code bloat? What about using the link register instead?

It's time for ... assembly language microbenchmarking!

#define REPS 0x4000
_main:
        .globl _main

        mflr r0
        stwu r0, -4(r1)

        li r3,0
        lis r5,REPS
        bl .+4
        ; the location of the following mflr is now in r4
        mflr r4
        addi r4, r4, 8
        ; now r4 points to the addi below

        addi r3,r3,1
        cmp cr0,r3,r5
        beq done
#if USE_LR
        mtlr r4
#if USE_NOPS
        nop
        nop
        nop
        nop
#endif
        blr
#else
        mtctr r4
#if USE_NOPS
        nop
        nop
        nop
        nop
#endif
        bctr
#endif

done:
        lwz r0, 0(r1)
        mtlr r0
        li r3,0
        addi r1, r1, 4
        blr
This just runs a tight loop 1,073,741,824 times, branching to the loop header with either LR or CTR, and with the mtspr instruction separated from the blr/bctr with sufficient nops to put them in separate dispatch groups or not (there must be four to prevent the branch from getting in the terminal branch slot). That gives us four variations to test with a loop so tight the cost of the branch should substantially weigh on total runtime. Let's see what we get. If you're following along on your own Power Mac, compile these like so:
gcc -o lrctr_ctr lrctr.s
gcc -DUSE_NOPS -o lrctr_ctrn lrctr.s
gcc -DUSE_LR -o lrctr_lr lrctr.s
gcc -DUSE_LR -DUSE_NOPS -o lrctr_lrn lrctr.s
If you want to confirm what was actually assembled, you can look at the result with otool -tV.

Our control will be our trusty 1.0GHz iMac G4 (256K L2 cache). There should be no difference between the SPRs, and it all fits into cache and there are no dispatch groups, so if we did this right the runtimes should be nearly identical. In this case we are only interested in the user CPU time (the first field).

luxojr% time ./lrctr_ctr
7.550u 0.077s 0:08.88 85.8%     0+0k 0+2io 0pf+0w
luxojr% time ./lrctr_ctrn
7.550u 0.071s 0:08.55 89.1%     0+0k 0+0io 0pf+0w
luxojr% time ./lrctr_lr
7.550u 0.070s 0:10.08 75.5%     0+0k 0+2io 0pf+0w
luxojr% time ./lrctr_lrn
7.550u 0.069s 0:08.43 90.2%     0+0k 0+0io 0pf+0w
Excellent. Let's run it on the Quad G5 (numbers in reduced performance mode).
bruce% time ./lrctr_ctr
4.298u 0.028s 0:04.33 99.5%     0+0k 0+2io 0pf+0w
bruce% time ./lrctr_ctrn
4.986u 0.035s 0:05.03 99.6%     0+0k 0+2io 0pf+0w
bruce% time ./lrctr_lr
13.755u 0.050s 0:13.82 99.8%    0+0k 0+1io 0pf+0w
bruce% time ./lrctr_lrn
13.752u 0.048s 0:13.82 99.7%    0+0k 0+2io 0pf+0w
Wait, what? Putting mtctr and bctr in the same dispatch group was actually the fastest of these four variations. Not only was using LR slower, it was over three times slower. Even spacing the two CTR instructions apart was marginally worse. Just to see if it was an artifact of throttling, I ran them again in highest performance. Same thing:
bruce% time ./lrctr_ctr
2.149u 0.012s 0:02.16 99.5%     0+0k 0+1io 0pf+0w
bruce% time ./lrctr_ctrn
2.492u 0.010s 0:02.50 100.0%    0+0k 0+0io 0pf+0w
bruce% time ./lrctr_lr
6.876u 0.013s 0:06.89 99.8%     0+0k 0+0io 0pf+0w
bruce% time ./lrctr_lrn
6.874u 0.017s 0:06.89 99.8%     0+0k 0+1io 0pf+0w

I found this so surprising I rewrote it for AIX and put it on my POWER6, which is also dispatch-group based and uses an evolved version of the same instruction pipeline as the POWER4 (from which the G5 is derived). And, well ...

uppsala% time ./lrctr_ctr
3.752u 0.001s 0:04.41 85.0%     0+1k 0+0io 0pf+0w
uppsala% time ./lrctr_ctrn
4.064u 0.001s 0:04.94 82.1%     0+1k 0+0io 0pf+0w
uppsala% time ./lrctr_lr
13.499u 0.001s 0:16.19 83.3%    0+1k 0+0io 0pf+0w
uppsala% time ./lrctr_lrn
13.215u 0.001s 0:15.66 84.3%    0+1k 0+0io 0pf+0w
Execution times should consider that I run this POWER6 throttled in ASMI to reduce power consumption and its microarchitectural differences, but the same relative run times hold. It's actually not faster to space the CTR and branch (that is, if there's nothing better you could be doing -- see below), and the CTR is the best SPR to use for branching on the G5 regardless of any penalty paid. It may well be that using LR fouls the CPU link cache and thus tanks runtime, but whatever the explanation, using it for far calls is clearly the worst option.

Now, as you may have guessed, I've deliberately presented a false choice here because all four of these options are patently pathological. The optimal instruction sequence would be to schedule some work between the mtctr and the bctr. We don't have much work to, uh, work with here, but here's one way.

#define REPS 0x4000
_main:
        .globl _main

        mflr r0
        stwu r0, -4(r1)

        li r3,0
        lis r5,REPS
        bl .+4
        ; the location of the following mflr is now in r4
        mflr r4
        addi r4, r4, 8
        ; now r4 points to the mtctr below

        mtctr r4
        addi r3,r3,1
        cmp cr0,r3,r5
        beq done
        bctr

done:
        lwz r0, 0(r1)
        mtlr r0
        li r3,0
        addi r1, r1, 4
        blr

bruce% gcc -o ctrop ctrop.s
bruce% time ./ctrop
4.299u 0.028s 0:04.34 99.3%     0+0k 0+1io 0pf+0w
Almost identical runtimes (in reduced mode), but the beq takes the branch slot away from the bctr, guaranteeing the SPR operations will be split into two dispatch groups without trying to space them with nops. But inexplicably, if you coalesce the beq/bctr simply into bnectr (which does occupy the same branch slot), you get even faster:
bruce% time ./ctrop2
3.824u 0.039s 0:04.02 95.7%     0+0k 0+0io 0pf+0w
Is this optimal on the G4 and POWER6 as well?
luxojr% time ./ctrop
6.472u 0.064s 0:12.05 54.1%     0+0k 0+2io 0pf+0w
luxojr% time ./ctrop2
6.471u 0.063s 0:09.33 69.9%     0+0k 0+0io 0pf+0w

uppsala% time ./ctrop
3.918u 0.001s 0:04.61 84.8%     0+1k 0+0io 0pf+0w
uppsala% time ./ctrop2
2.612u 0.001s 0:03.09 84.4%     0+1k 0+0io 0pf+0w
Yup, it is, though it's worth noting the G4 did not improve with the bnectr. (This is still pathological, mind you: the best instruction sequence would be simply addi/cmp/bne, which the G5 in reduced mode runs in 2.580u 0.029s and the POWER6 in 1.261u 0.001s, reclaiming its speed crown. But when you have a far call, you don't have a choice.)

The moral of the story? Don't fix what ain't broke.

Friday, January 20, 2017

45.7.0 available for realsies

Let's try that again. TenFourFox 45.7.0 is now available for testing ... again (same download location, same release notes, new hashes), and as before, will go live late Monday evening if I haven't been flooded out of my house by the torrential rains we've been getting in currently-not-so-Sunny So Cal. You may wish to verify you got the correct version by manually checking the hash on the off-chance the mirrors are serving the old binaries.

Saturday, January 14, 2017

45.7.0 available (also: Talos fails)

TenFourFox 45.7.0 is now available for testing. In addition to reducing the layout paint delay I also did some tweaks to garbage collection by removing some code that isn't relevant to us, including some profile accounting work we don't need to bother computing. If there is a request to reinstate this code in a non-debug build we can talk about a specific profiling build down the road, probably after exiting source parity. As usual the build finalizes Monday evening Pacific time. I didn't notice that the release had been pushed forward another week, to January 24. If additional security patches land, there will be a respin. There will be a respin this weekend. The download links have been invalidated and cancelled.

For 45.8 I plan to start work on the built-in user-agent switcher, and I'm also looking into a new initiative I'm calling "Operation Short Change" to wring even more performance out of IonPower. Currently, the JavaScript JIT's platform-agnostic section generates simplistic unoptimized generic branches. Since these generic branches could call any code at any displacement and PowerPC conditional branch instructions have only a limited number of displacement bits, we pad the branches with nops (i.e., nop/nop/nop/bc) so they can be patched up later if necessary to a full-displacement branch (lis/ori/mtctr/bcctr) if the branch turns out to be far away. This technique of "branch stanzas" dates back all the way to the original nanojit we had in TenFourFox 4 and Ben Stuhl did a lot of optimization work on it for our JaegerMonkey implementation that survived nearly unchanged in PPCBC and in a somewhat modified form today in IonPower-NVLE.

However, in the case of many generic branches the Ion code generator creates, they jump to code that is always just a few instruction words away and the distance between them never moves. These locations are predictable and having a full branch stanza in those cases wastes memory and instruction cache space; fortunately we already have machinery to create these fixed "short branches" in our PPC-specific code generator and now it's time to further modify Ion to generate these branches in the platform-agnostic segment as well. At the same time, since we don't generally use LR actually as a link register due to a side effect of how we branch, I'm going to investigate whether using LR is faster for long branches than CTR (i.e., lis/ori/mtlr/b(c)lr instead of mtctr/b(c)ctr). Certainly on G5 I expect it probably will be because having mtlr and blr/bclr in the same dispatch group doesn't seem to incur the same penalty that mtctr and bctr/bcctr in the same dispatch group do. (Our bailouts do use LR, but in an indirect form that intentionally clobbers the register anyway, so saving it is unimportant.)

On top of all that there is also the remaining work on AltiVec VP9 and some other stuff, so it's not like I won't have anything to do for the next few weeks.

On a more disappointing note, the Talos crowdfunding campaign for the most truly open, truly kick-*ss POWER8 workstation you can put on your desk has run aground, "only" raising $516,290 of the $3.7m goal. I guess it was just too expensive for enough people to take a chance on, and in fairness I really can't fault folks for having a bad case of sticker shock with a funding requirement as high as they were asking. But you get the computer you're willing to pay for. If you want a system made cheaper by economies of scale, then you're going to get a machine that doesn't really meet your specific needs because it's too busy not meeting everybody else's. Ultimately it's sad that no one's money was where their mouths were because for maybe double-ish the cost of the mythical updated Mac Pro Tim Cook doesn't see fit to make, you could have had a truly unencumbered machine that really could compete on performance with x86. But now we won't. And worst of all, I think this will scare off other companies from even trying.

Tuesday, January 10, 2017

Not dead, didn't perish in an airline crash over the Pacific

Yes, I'm alive, and yes, I'm back at Floodgap orbiting headquarters. Meanwhile, candidate builds for TenFourFox 45.7 are scheduled for this weekend. Since no one has voiced any problems, the change to nglayout.initialpaint.delay mentioned in the prior post (to 100ms) will take effect. If this caused adverse issues for you, speak now, or forever hold your peace right up until you post frantic bug reports.

Saturday, December 10, 2016

45.6.0 final available

TenFourFox 45.6.0 final is available (downloads, hashes, release notes), with everything in the release notes including the crash fix for 45.5.1 and that layout performance patch Mozilla added as part of another fix. Please test on your systems, with the expectation it will go live late Monday Pacific time as usual.

45.7 will not have substantial changes and I don't anticipate doing a beta. However, one change I do intend to make is to mirror Mozilla's work on updating default settings, starting with layout paint delay. The rationale for delaying layout painting specifically was to wait for sufficient data to come through rather than guessing an incorrect layout with incomplete data that then has to be invalidated: without the delay, although the screen would be busier, the browser often would end up taking more total wall-clock time on wasted work. Now that data arrives faster on most people's systems today than in the days of dialup and low-speed DSL, it's time for these older settings from another age to be re-examined, and paint delay is probably the most visible one of those settings.

Stuff like that has long been part of the various unofficial Firefox "optimization guides" that circulate, including Erik's set for TenFourFox. I have generally avoided comment on his recommendations (except for a couple that I knew would be net negative for most users) because as far as I'm concerned, it's your computer and you can tune it as you like -- just don't file bug reports if you muck it up because some of those settings have undesirable side effects in edge cases. For that reason I have declined to move too far from the Firefox base settings because the browser out-of-the-box has to work for as many systems as possible in as many situations as possible, and one thing unique to us is we still do have a substantial minority of users using Power Macs on dialup networks. One user sticks out in my mind who is a missionary in the mountains north of Myanmar and completely reliant on the modem in his G4 mini. We don't want to unnecessarily tank these users with settings that are overly optimistic about bandwidth availability, so whatever setting Mozilla determines for Firefox users at large may not be the best fit for our legacy population.

In bug 1283302, Mozilla settled on 5ms for desktop users and left Firefox Android at 250ms (which is actually smaller than a refresh tick, so near as I can determine it might as well be zero). Since we're not in the same processing class as current machines by a long shot and we do need to still support users with limited bandwidth, I think a safer setting will be 100ms, which as an otherwise arbitrary number seemed not to regress anything on the local machines. If you want to try this, go into about:config, create a numeric pref nglayout.initialpaint.delay if it does not already exist, and enter a value of 100. Optimally it might be nice to have such settings specific to each architecture build and tuned accordingly, but that's something to consider at a later time. If you have other reasonable recommendations for this setting, do post them in the comments, along with the specifications of the system and network you tried it on. I will consider other changes in future versions as Mozilla re-examines them internally.

Meanwhile, I'll be on a plane to Australia next week on what may be my last Spaceseat flight on Air New Zealand, which I loved. Before I do, however, I'll be stopping by the parents' house to look at the dual 2.5GHz G5 they use for uploading their church videos. My suspicion is the liquid cooling system blew. Currently I am sad.

Saturday, December 3, 2016

45.6.0b1 available, plus sampling processes for fun and profit

Test builds for TenFourFox 45.6.0 are available (downloads, hashes, release notes). The release notes indicate the definitive crash fix in Mozilla bug 1321357 (i.e., the definitive fix for the issue mitigated in 45.5.1) is in this build; it is not, but it will be in the final release candidate. 45.6.0 includes the removal of HiDPI support, which also allowed some graphical optimizations the iMac G4 particularly improved with, the expansion of the JavaScript JIT non-volatile general purpose register file, an image-heavy scrolling optimization too late for the 45ESR cut that I pulled down, the removal of telemetry from user-facing chrome JS and various minor fixes to the file requester code. An additional performance improvement will be landed in 45ESR by Mozilla as a needed prerequisite for another fix; that will also appear in the final release. Look for the release candidate next week sometime with release to the public late December 12 as usual, but for now, please test the new improvements so far.

There is now apparently a potential workaround for those of you still having trouble getting the default search engine to stick. I still don't have a good theory for what's going on, however, so if you want to try the workaround please read my information request and post the requested information about your profile before and after to see if the suggested workaround affects that.

I will be in Australia for Christmas and New Years' visiting my wife's family, so additional development is likely to slow over the holidays. Higher priority items coming up will be implementing user agent support in the TenFourFox prefpane, adding some additional HTML5 features and possibly excising telemetry from garbage and cycle collection, but probably for 45.8 instead of 45.7. I'm also looking at adding some PowerPC-specialized code sections to the platform-independent Ion code generator to see if I can crank up JavaScript performance some more, and possibly some additional work to the AltiVec VP9 codec for VMX-accelerated intraframe prediction. I'm also considering adding AltiVec support to the Theora (VP3) decoder; even though its much lighter processing requirements yield adequate performance on most supported systems it could be a way to get higher resolution video workable on lower-spec G4s.

One of the problems with our use of a substantially later toolchain is that (in particular) debugging symbols from later compilers are often gibberish to older profiling and analysis tools. This is why, for example, we have a customized gdb, or debugging at even a basic level wouldn't be possible. If you're really a masochist, go ahead and compile TenFourFox with the debug profile and then try to use a tool like sample or vmmap, or even Shark, to analyze it. If you're lucky, the tool will just freeze. If you're unlucky, your entire computer will freeze or go haywire. I can do performance analysis on a stripped release build, but this yields sample backtraces which are too general to be of any use. We need some way of getting samples off a debug build but not converting the addresses in the backtrace to function names until we can transfer the samples to our own tools that do understand these later debugging symbols.

Apple's open source policy is problematic -- they'll open source the stuff they have to, and you can get at some components like the kernel this way, but many deep dark corners are not documented and one of those is how tools like /usr/bin/sample and Shark get backtraces from other processes. I suspect this is so that they can keep the interfaces unstable and avoid abetting the development of applications that depend on any one particular implementation. But no one said I couldn't disassemble the damn thing. So let's go.

(NB: the below analysis is based on Tiger 10.4.11. It is possible, and even likely, the interface changed in Leopard 10.5.)

With Depeche Mode blaring on the G5, because Dave Gahan is good for debugging, let's look at /usr/bin/sample since it's a much smaller nut to crack than Shark.

% otool -L /usr/bin/sample
/usr/bin/sample:
         /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation (compatibility version 300.0.0, current version 567.29.0)
         /System/Library/PrivateFrameworks/vmutils.framework/Versions/A/vmutils (compatibility version 1.0.0, current version 93.1.0)
         /usr/lib/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1.0.0)
         /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 88.3.4)

Interesting! A private framework! Let's see what Objective-C calls we might get (which are conveniently text strings).

% strings /usr/bin/sample |& more
__dyld_make_delayed_module_initializer_calls
__dyld_image_count
__dyld_get_image_name
__dyld_get_image_header
__dyld_NSLookupSymbolInImage
__dyld_NSAddressOfSymbol
libobjc
__objcInit
__dyld_mod_term_funcs
release
printStatistics
writeOutput:append:
stopSampling
sampleForDuration:interval:
preloadSymbols
initWithPid:symbolRichBinaries:
alloc
intValue
UTF8String
removeObjectAtIndex:
objectAtIndex:
count
indexOfObject:
arrayWithArray:
arguments
processInfo
forceStop
NSSampler
NSMutableArray
NSProcessInfo
NSAutoreleasePool
Interrupted
Not currently sampling -- exiting immediately.
-wait
-mayDie
-file
Waiting for '%s' to appear...
%s appeared.
%s cannot find a process you have access to which has a name like '%s'
sample
Sampling process %d each %u msecs %u times
syntax: sample <pid/partial name> <duration (secs)> { <msecs between samples> } <options>
options: {-mayDie} {-wait} {-subst <old> <new>}*
-file filename specifies where results should be written
-mayDie reads symbol information right away
-wait wait until the process named (usually by partial name) exists, then start sampling
-subst can be used to replace a stripped executable by another
Note that the program must have been started using a full path, rather than a relative path, for analysis to work, or that the -subst option must be specified
setObject:forKey:
dictionary
autorelease
mutableCopy
NSMutableDictionary
%s cannot examine process %d for unknown reasons, even though it appears to exist.
%s cannot examine process %d because the process does not exist.
%s cannot examine process %d (with name like %s) because it no longer appears to be running.
%s cannot examine process %d because you do not have appropriate privileges to examine it.
%s cannot examine process %d for unknown reasons.
-subst

Most of that looks fairly straightforward Objective-C stuff, but what's NSSampler? That's not documented anywhere. Magic Hat can't find it either with the default libraries, but it does if we add those private frameworks. If I use class-dump (3.1.2 works with 10.4), I can get a header file with its methods and object layout. (The header file it generates is usually better than Magic Hat's since Magic Hat sorts things in alphabetical rather than memory order, which will be problematic shortly.) Edited down, it looks like this. (I added the byte offsets, which are only valid for the 32-bit PowerPC OS X ABI.)

@interface NSSampler : NSObject

/*
{
00 BOOL _stop;
04 BOOL _stopped;
08 unsigned int _task;
12 int _pid;
16 double _duration;
24 double _interval;
32 NSMutableArray *_sampleData;
36 NSMutableArray *_sampleTimes;
40 double _previousTime;
48 unsigned int _numberOfDataPoints;
52 double _sigma;
60 double _max;
68 unsigned int _sampleNumberForMax;
72 ImageSymbols *_imageSymbols;
76 NSDictionary *_symbolRichBinaryMappings;
80 BOOL _writeBadAddresses;
84 TaskMemoryCache *_tmc;
88 BOOL _stacksFixed;
92 BOOL _sampleSelf;
96 struct backtraceMagicNumbers _magicNumbers;
}
*/

- (void) _cleanupStacks;
- (void) _initStatistics;
- (void) _makeHighPriority;
- (void) _makeTimeshare;
- (void) _runSampleThread: (id) parameter1;
- (void) dealloc;
- (void) finalize;
- (void) forceStop;
- (void) getStatistics: (void*) parameter1;
- (id) imageSymbols;
- (id) initWithPid: (int) parameter1;
- (id) initWithPid: (int) parameter1 symbolRichBinaries: (id) parameter2;
- (id) initWithSelf;
- (void) preloadSymbols;
- (void) printStatistics;
- (id) rawBacktraces;
- (void) sampleForDuration2: (double) parameter1 interval: (double) parameter2;
- (void) sampleForDuration: (unsigned int) parameter1 interval: (unsigned int) parameter2;
- (int) sampleTask;
- (void) setImageSymbols: (id) parameter1;
- (void) startSamplingWithInterval: (unsigned int) parameter1;
- (void) stopSampling;
- (id) stopSamplingAndReturnCallNode;
- (void) writeBozo;
- (void) writeOutput: (id) parameter1 append: (char) parameter2;

@end

Okay, so now we know what methods are there. How does one call this thing? Let's move to the disassembler. I'll save you my initial trudging through the machine code and get right to the good stuff. I've annotated critical parts below from stepping through the code in the debugger.

% otool -tV /usr/bin/sample
/usr/bin/sample:
(__TEXT,__text) section
00002aa4        or      r26,r1,r1        << enter
00002aa8        addi    r1,r1,0xfffc
00002aac        rlwinm  r1,r1,0,0,26
00002ab0        li      r0,0x0
00002ab4        stw     r0,0x0(r1)
:
:
:
00003260        b       0x3310
00003264        bl      0x3840  ; symbol stub for: _getgid
00003268        bl      0x37d0  ; symbol stub for: _setgid

This looks like something that's trying to get at a process. Let's see what's here.

0000326c        lis     r3,0x0
00003270        or      r4,r30,r30
00003274        addi    r3,r3,0x3b9c
00003278        or      r5,r29,r29
0000327c        or      r6,r26,r26
00003280        bl      0x37c0  ; symbol stub for: _printf$LDBL128 // "Sampling process ..."
00003284        lbz     r0,0x39(r1)
00003288        cmpwi   cr7,r0,0x1
0000328c        bne+    cr7,0x32a0 // jumps to 32a0
:
:
:
000032a0        lis     r4,0x0
000032a4        lwz     r3,0x0(r31)
000032a8        or      r5,r25,r25
000032ac        lwz     r4,0x5010(r4) // 0x399c "sampleForDuration:..."
000032b0        or      r6,r23,r23
000032b4        bl      0x3800  ; symbol stub for: _objc_msgSend
000032b8        lis     r4,0x0
000032bc        lwz     r3,0x0(r31)
000032c0        lwz     r4,0x500c(r4) // 0x946ba288 "stopSampling"
000032c4        bl      0x3800  ; symbol stub for: _objc_msgSend
000032c8        lis     r4,0x0
000032cc        lwz     r3,0x0(r31)
000032d0        lwz     r4,0x5008(r4) // 0x3978 "writeOutput:..."
000032d4        or      r5,r22,r22
000032d8        li      r6,0x0
000032dc        bl      0x3800  ; symbol stub for: _objc_msgSend

That seems simple enough. It seems to allocate and initialize an NSSampler object, (we assume) sets it up with [sampler initWithPid], calls [sampler sampleForDuration], calls [sampler stopSampling] and then calls [sampler writeOutput] to write out the result.

This is not what we want to do, however. What I didn't see in either the disassembly or the class description was an explicit step to convert addresses to symbols, which is what we want to avoid. We might well suspect -(void) writeOutput is doing that, and if we put together a simple-minded program to make these calls as sample does, we indeed get a freeze when we try to write the output. We want to get to the raw addresses instead, but Apple doesn't provide any getter for those tantalizing NSMutableArrays containing the sample data.

Unfortunately for Apple, class-dump gave us the structure of the NSSampler object (recall that Objective-C objects are really just structs with delusions of grandeur), and conveniently those object pointers are right there, so we can pull them out directly! Since they're just NSArrays, hopefully they're smart enough to display themselves. Let's see. (In the below, replace XXX with the process you wish to spy on.)

/* gcc -g -o samplemini samplemini.m \
    -F/System/Library/PrivateFrameworks \
    -framework Cocoa -framework CHUD \
    -framework vmutils -lobjc */

#include <Cocoa/Cocoa.h>
#include "NSSampler.h"

int main(int argc, char **argv) {
    NSSampler *sampler;
    NSMutableArray *sampleData;
    NSMutableArray *sampleTimes;
    uint32_t count, sampleAddr;
    NSAutoreleasePool *shutup = [[NSAutoreleasePool alloc] init];

    sampler = [[NSSampler alloc] init];
    [sampler initWithPid:XXX]; // you provide
    [sampler sampleForDuration:10 interval:10]; // 10 seconds, 10 msec
    [sampler stopSampling];

    // break into the NSSampler struct
    sampleAddr = (uint32_t)sampler;
    count = *(uint32_t *)(sampleAddr + 48);
    fprintf(stdout, "count = %i\n", count);
    sampleData = (NSMutableArray *)*(uint32_t *)(sampleAddr + 32);
    sampleTimes = (NSMutableArray *)*(uint32_t *)(sampleAddr + 36);
    fprintf(stdout, "%s", [[sampleData description] cString]);
    fprintf(stdout, "%s", [[sampleTimes description] cString]);

    [sampler dealloc];
    return 0;
}
Drumroll please.

count = 519
(
    <NSStackBacktrace: Thread 1503: 0x9000af48 0xefffdfd0 0x907de9ac 0x907de2b0 0x932bcb20 0x932bc1b4 0x932bc020 0x937a1734 0x937a13f8 0x06d53d3c 0x9379d93c 0x0 6d57bc8 0x07800f48 0x0785f004 0x0785f9cc 0x0785fd20 0x00004ed4 0x00001d5c 0x0000 1a60 0x9000ae9c 0xffffffe1 > ,
    <NSStackBacktrace: Thread 1603: 0x9002ec8c 0x00424b10 0x05069cb4 0x0504638c 0x050490e0 0x05056600 0x050532cc 0x9002b908 0x0506717c 0x0000016b > ,
    <NSStackBacktrace: Thread 1703: 0x9002bfc8 0x90030a7c 0x015a0b84 0x04d4d40c 0x015a1f18 0x9002b908 0x90030aac 0xffffffdb > ,
:
:
:
)(
    0.01796096563339233,
    0.01785099506378174,
    0.01814299821853638,
    0.01780200004577637,
:
:
:
)

We now have the raw backtraces and the timings, in fractions of a second. There is obviously much more we can do with this, and subsequent to my first experiment I improved the process further, but this suffices for explaning the basic notion. In a future post we'll look at how we can turn those addresses into actual useful function names, mostly because I have a very hacky setup to do so right now and I want to refine it a bit more. :) The basic notion is to get the map of where dyld loaded each library in memory and then compute which function is running based on that offset from the sampled address. /usr/bin/vmmap would normally be the tool we'd employ to do this, but it barfs on TenFourFox too. Fortunately our custom gdb7 can get such a map, at least on a running process. More on that later.

One limitation is that NSSampler doesn't seem able to get samples more frequently than every 15ms or so from a running TenFourFox process even if you ask. I'm not sure yet why this is because other processes have substantially less overhead, though it could be thread-related. Also, even though NSSampler accepts an interval argument, it will grab samples as fast as it can no matter what that interval is. When run against Magic Hat as a test it grabbed them as fast as 0.1ms, so stand by for lots of data!

Incidentally, this process is not apparently what Shark does; Shark uses the later PerfTool framework and an object called PTSampler to do its work instead of vmutils. Although it has analogous methods, the structure of PTSampler is rather more complex than NSSampler and I haven't fully explored its depths. Nevertheless, when it works, Shark can get much more granular samples of processor activity than NSSampler, so it might be worth looking into for a future iteration of this tool. For now, I can finally get backtraces I can work with, and as a result, hopefully some very tricky problems in TenFourFox might get actually solved in the near future.

Thursday, December 1, 2016

45.5.1 available, and 32-bit Intel Macs go Tier-3

Test builds for 45.5.1, with the single change being the safety fix for the Firefox 0-day in bug 1321066 (CVE-2016-9079), are now available. Release notes and hashes to follow when I'm back from my business trip late tonight. I will probably go live on this around the same time, so please test as soon as you can.

In other news, the announcement below was inevitable after Mozilla dropped support for 10.6 through 10.8, but for the record (from BDS):

As of Firefox 53, we are intending to switch Firefox on mac from a universal x86/x86-64 build to a single-architecture x86-64 build.

To simplify the build system and enable other optimizations, we are planning on removing support for universal mac build from the Mozilla build system.

The Mozilla build and test infrastructure will only be testing the x86-64 codepaths on mac. However, we are willing to keep the x86 build configuration supported as a community-supported (tier 3) build configuration, if there is somebody willing to step forward and volunteer as the maintainer for the port. The maintainer's responsibility is to periodically build the tree and make sure it continues to run.

Please contact me directly (not on the list) if you are interested in volunteering. If I do not hear from a volunteer by 23-December, the Mozilla project will consider the Mac-x86 build officially unmaintained.

The precipitating event for this is the end of NPAPI plugin support (see? TenFourFox was ahead of the curve!), except, annoyingly, Flash, with Firefox 52. The only major reason 32-bit Mac Firefox builds weren't ended with the removal of 10.6 support (10.6 being the last version of Mac OS X that could run on a 32-bit Intel Mac) was for those 64-bit Macs that had to run a 32-bit plugin. Since no plugins but Flash are supported anymore, and Flash has been 64-bit for some time, that's the end of that.

Currently we, as OS X/ppc, are a Tier-3 configuration also, at least for as long as we maintain source parity with 45ESR. Mozilla has generally been deferential to not intentionally breaking TenFourFox and the situation with 32-bit x86 would probably be easier than our situation. That said, candidly I can only think of two non-exclusive circumstances where maintaining the 32-bit Intel Mac build would be advantageous, and they're both bigger tasks than simply building the browser for 32 bits:

  • You still have to run a 32-bit plugin like Silverlight. In that case, you'd also need to undo the NPAPI plugin block (see bug 1269807) and everything underlying it.
  • You have to run Firefox on a 32-bit Mac. As a practical matter this would essentially mean maintaining support for 10.6 as well, roughly option 4 when we discussed this in a prior blog post with the added complexity of having to pull the legacy Snow Leopard support forward over a complete ESR cycle. This is non-trivial, but hey, we've done just that over six ESR cycles, although we had the advantage of being able to do so incrementally.

I'm happy to advise anyone who wants to take this on but it's not something you'll see coming from me. If you decide you'd like to try, contact Benjamin directly (his first name, smedbergs, us).