Friday, January 21, 2011

Their betas go to 11

Mozilla wants to freeze beta 10 this Friday and push it out next week, then freeze beta 11 the Friday of next week, and push beta 11 out the week after that (see this announcement). Besides my moment to say I told you so, two back-to-back betas in as many weeks is not good use of our more limited resources. If Mozilla pushes out beta 10 on schedule, which may in fact happen based on some of the rumblings I hear around Mountain View, then we will sit out beta 10 and go to beta 11. If beta 10 is delayed longer than next week, we will continue with beta 10. Watch this space or mozilla.dev.planning.

Wednesday, January 19, 2011

Plugins unplugged

For those of you who use Classilla on OS 9, you already know that Classilla allows you to turn plugins off, and in fact encourages it. It makes the browser more secure by not exposing old unmaintained native code, and it makes the browser more stable by not exposing old unmaintained broken code. If you really really really want to use Flash Player 7 to view something, or QT 6.0.3, you can, but you can turn it right back off again. When I am in OS 9, I just keep plugins off entirely. Plugin support for OS 9 has obviously not been updated in years.

The time is now coming for this state on PowerPC. Adobe has been making noises about ending PPC support for the Flash Player plugin for some time now, and their preliminary system requirements page for Flash Player 10.2, which is now in release candidate stage, finally makes this plain: PowerPC is no longer a supported platform for Flash, although they still support Tiger on Intel. When you download Flash 10.2, it's x86 or bust, similar to their dropping PowerPC from AIR support with 2.0.

Most of us know that Flash was never well-optimized for PowerPC anyway. Even on a fast G5 it chugs, and you can forget about full-screen or HD playback on G4s, let alone G3 systems. Losing Flash is not going to be a massive issue for functionality for awhile because much of the advanced functionality 10.2 introduces will not be widely adopted until 10.2 is. However, it is going to be an issue for security. Flash is a large potential and practical security hole, and while many of the exploits fortunately rely on x86 native code, some do not. Moreover, Flash-based privacy intrusions, while not bugs or security holes per se, have for a long time approached infamy in quantity and ubiquity. While my insiders tell me that Flash 10.2 is light-years ahead of 10.1 in performance, that certainly does not help us on our minority soon-to-be-unsupported architecture.

While Flashblock (at least until beta 9 ... fix it!) helped to block most Flash applets, and tools like AdBlock and NoScript can also serve in this capacity, the time is coming where TenFourFox will go the way of Classilla and block Flash and other plugins at the layout level. This has obvious benefits for security and performance, at the cost of losing Flash. Steve Jobs, bless him, was right to say that HTML5 supersedes many of the features Flash has long been (ab)used for -- not all, but many -- and I think that if we deal with this pain now, we will be better able to adapt. More to the point, Mozilla is moving towards plugin and graphics NPAPI models that no longer use QuickDraw (deprecated since 10.4, and almost certainly further marginalized in 10.7), and this will be an issue for 10.4 which lacks CoreAnimation. In the future it may not be technically possible to still use Mozilla's code base and maintain compatibility with the aging plugin base on PowerPC and Tiger, let alone the attendant disadvantages of continuing to use those plugins.

This post is more of a warning rather than an announcement. While I plan to ship TenFourFox 4.0 with plugin support, albeit deprecated, likely by 10.4Fx "4.1" plugins will be pref-ed off requiring you to manually re-enable them through about:config. At some point in the future when it is no longer technically feasible to support them, that code will be completely removed. Hopefully by then the impact of iOS will have introduced a world where Flash-free alternatives that are natively supported by TenFourFox and Gecko 2.0 are easily accessible.

Monday, January 17, 2011

Lies, damn lies, statistics, and the G5: a beta 9 postmortem

Beta 9 has been an ... interesting release. Besides a serious blocker that escaped Mozilla's notice causing large portions of some pages to not render at all (bug 623852), and a flickery title bar which is also a Mozilla bug (partially fixed for Tiger in issue 16; see also bug 621762), we are also seeing significantly faster screen display times and the first community statistics for the nanojit.

For those who have been sleeping under a bridge and not reading this blog or the last couple entries, the nanojit is a component of Mozilla's JavaScript interpreter that turns execution "traces" into machine language, a limited and specific form of Just-In-Time compilation (hence the name). When the trace is executed again, the compiled code is run instead, which -- at least in theory -- should be significantly faster. The last couple entries go into this in nauseating detail, so read them if you'd like to get up to speed on the concept.

The nanojit was developed on my quad G5, and the timings were done on it in the expectation that the G5's clock speed, out-of-order execution hardware (being a POWER4 disguised as a PowerPC) and fat memory bandwidth should enable it to be the fastest performer. Full-tilt, running at Highest power, the native, non-accelerated JavaScript interpreter in TenFourFox b8 and b9 chugs through SunSpider in around 3500ms. This is about 200ms faster than Firefox 3.6, and is clearly the fastest pure-interpreter score of any Power Mac in Firefox. When the nanojit is turned on for the G5, however, this number actually gets worse -- to around 5000ms. On Dromaeo, the nanojit is slightly slower overall, dragged down by the SunSpider score. Given that the G5 has all this awesome hardware in-core to execute code as fast as possible, I concluded that the nanojit would not be as valuable on PowerPC as it has been to other platforms as the G3 and G4 don't have that extra silicon. I left the code in there for people to play with and turn the pref on or off (see the Release Notes for how to do so), since Dromaeo showed it to be valuable in certain cases and perhaps I could deal with them predictively.

Well, the tables are turning over. Thanks to user PoLiYa who wrote in to tell me about his fabulous score on SunSpider with his PowerBook G4/1.5GHz and the nanojit turned on: a whopping 2866ms, down from around 6400ms. That's not a typo; his 1.5GHz G4 benched faster than a 2.5GHz G5 on SunSpider with the nanojit. I thought this had to be a mistake, so I got out my iBook/1.33GHz and repeated the test. It was no mistake. On the G4, the nanojit not only worked, it easily cut SunSpider scores in half, and Dromaeo was also about double the speed. User agg23 over at 68KMLA tried it on his G3, and also saw a similar speed improvement, from 16000ms down to around 8900. Try it yourself and post your stats in the comments.

This does have a penalty to pay: since the trace only pays off when it's cached, the browser does indeed cache it to take advantage of it as much as possible, and this in turn causes greater memory pressure. Benchmarks really make the increased memory usage noticible, but even stock Firefox can bloat a little with regular use despite its aggressive garbage collection algorithms. This was never an issue for PowerPC Firefox before, because it never had a nanojit, but now it does. (This does not happen when the nanojit is disabled, which is the default state in b9 right now.)

There is at least one bug in the nanojit I am aware of that can cause crashes or weird behaviour, but it is infrequently triggered and I have it fixed for beta 10 already.

So, now that we know it is highly beneficial for all but the G5 processor, the plan for beta 10 is to ship the G3 and G4 versions (7400 testing is pending) with the nanojit turned on for content (web pages). If this works, it may also be enabled for chrome (i.e., the browser's own JavaScript, and add-on JavaScript), but we'll start small to make sure there are no unexpected surprises.

Where did the G5 go wrong? The most reasonable explanation is to conclude that the G5 pays a heavier penalty for stalls and/or memory access than the other processors, and the PowerPC nanojit's built-in optimizer negates the 970's out-of-order advantage because it is already putting the instructions in as optimal an execution order as possible (given the relatively poor optimizability of the code to start with; see the blog entry two back about that). Indeed, there are quite a few stalls in Shark when you load the generated code into it and analyze it, but these stalls occur for the G3 and G4 too; therefore, they simply must hurt more on the G5, and the longer pipeline is likely to blame. It is worth noting that the G5's pipeline is even longer than the POWER4 from which it descends, purely for purposes of faster clock speed. Coincidentally enough, I recently acquired a POWER6 as a household server and it will be interesting to see how well it does with this code when I get Firefox 4 compiled on it. As it stands right now, however, the nanojit will be disabled for G5 by default. In the near future I will experiment with modifications to the JavaScript tracer algorithm to see if I can predict those specific code situations where the nanojit pays off for the G5 too. You can still turn it on, but it will be shipped with the prefs off in that specific build.

Assuming Mozilla does not come out with a beta 11, beta 10 will be the final TenFourFox beta. I'm not planning to release a release candidate; we don't have the space, and I'm not all that certain b10 will be the last Firefox beta either. If it is, we will follow up b10 with a formal release and welcome the masses to our fold. Tell your friends: we're bringing our beloved old Macs into the next browser generation. More innovations to come.

Wednesday, January 12, 2011

Beta 9 now available

Beta 9 is now available (changesets coming, hang tight). Mozilla is doing more work on the graphics and layers code, which is visible in a couple sites that for me have broken (and I will be filing reports), but also in that overall drawing performance is in general better. This is important because 10.4Fx does not support hardware acceleration, so anything that gives us a win in software is precious.

After working more with the nanojit, I have decided not to make it the default, but it is there, and you can use it. Go into about:config, under javascript.options.tracejit.*, and set either or both of chrome and/or content to true. Restart the browser. You'll find many things are faster, but some things are still slower, and there also were occasional crashes when quitting the browser: more of a nuisance than a serious problem, since you were quitting anyway, but still not very smooth. If you have good knowledge of PowerPC, I'd love your insight into improving it. The nanojit is in the changesets, but I have now submitted it to Mozilla and you can also read the diff on Bugzilla if you like. Feel free to comment.

Other things in beta 9 include a less flickery title bar (hopefully), which was a very annoying thing that cropped up in beta 8. Although it was an infrequent irritation on my G5, some users got really walloped by it. Mozilla fixed some part of this for Leopard, but that fix doesn't help Tiger, so 10.4Fx b9 has a separate rendering path for 10.4 that uses a naughty kludge. Because of this, the title bar does look ever so slightly different in 10.4 than 10.5, but only if you're looking for it. The title bar still does not always match the window gradient, but this is part Mozilla's bug and part ours. If you are really anal about that, install a Persona; they always work.

The UI was also adjusted to bring the app to the front when a window is clicked on. I am dithering over forcing the window up, or just forcing the app up, so that you can type but keep your window ordering -- I'm leaning towards the second, but will take arguments for the first. You can always click the Dock icon to bring the whole application and all windows forward, and this fixes the bug in beta 8 where sometimes it didn't bring anything forward when you clicked on a background window. It uses Carbon, don't tell Steve! Oh, and find Sluggo! We miss Power Computing! Classilla users will have seen an earlier version of this easter egg before. No fair reading the changesets to find the magic words.

Note that Flashblock as of this writing is not yet upgraded for beta 9, which should be a oneline change for them. If you really need it (and most of us do prefer it), you might want to wait until it is, or use an equivalent.

Thursday, January 6, 2011

How the bench was run, and what it got us*

*apologies to Michael Stipe

First some laundry list stuff. I've heard rumours about a Fink version of 10.4Fx, which sounds great. I see fangism is following this blog and I think this is his work, so ping me in the comments and maybe we can get your info into the project wiki.

Second, beta 9. Mozilla is apparently shrinking the scope for it, which is a good sign in that they are probably approaching release candidate state finally. At this rate, we should have Firefox 4 just in time for the USA presidential election. ;) This coincides nicely with the release of the nanojit, so there will be a beta 9 since it is convenient to do so. Mozilla's time frame is shooting for a code freeze, like, tomorrow, though historically their estimates have been at times wildly optimistic. That said, however, if their forecasted schedule is in any way accurate our beta 9 may lag it slightly instead of being ahead as usual. I'll explain, but first, the nanojit.

I implied in my last post that the nanojit is unfortunately not the dramatic leap I was hoping for (notice that I did not say how the bench[mark] was won), and those of you who know me at 68KMLA have probably read my post on that subject. To understand why this is a mixed bag, I should explain what the nanojit actually does, which is also useful to Mozilla hackers at large since there is precious little MDN documentation on it.

When the tracer thinks it has hot code, it turns on a trace recorder and slurps up the JavaScript opcodes as they come down the "pipeline." These opcodes are not executed as is (as the interpreter would), but are turned into an intermediate language called LIR. LIR is also the intermediate language for Flash ActionScript, as they are both part of Tamarin, though SpiderMonkey (Mozilla's JavaScript implementation) does not use other parts of the ActionScript interpreter.

After an initial spew of LIR opcodes and/or optimization, the opcodes are then handed to the nanojit assembler (in the source tree, this is in js/src/nanojit/Assembler.cpp). This section is more or less architecture-agnostic, which is important later on. It goes through each of the opcodes and as needed to generate code calls specific portions of the architecture-specific nanojit (in our case, NativePPC.cpp) which then write instructions to memory, starting with a standard function prologue. This is not done using things like __asm__(), but using a collection of macros and an opcode table to generate 32-bit instructions directly. If there is some concern about the result of an operation, a guard may be inserted as an internal consistency check, and side exits are made available to abort to the interpreter if funny business is detected. Finally, at least the first time through, an epilogue is emitted. As more code is written and more labels are fixed in memory, the assembler will go through and patch the branches in much the way a two-pass assembler would, eventually terminating in the epilogue that was originally emitted.

When it comes time to execute, the tracer makes certain statistical decisions about how useful the trace will be relative to simply falling back on the interpreter and runs one or the other. If a trace is deemed to be detrimental or pathological for performance, it is blacklisted. Otherwise, once it is compiled, it is used as long as it remains valid or until garbage collection destroys it, the rules of which are still not yet determined for Firefox 4.

As originally written, the PPC nanojit worked fine for Tamarin, but was not sufficient for Firefox. It did not have an implementation of nFragExit, which is needed to manage linkages with other code fragments and the epilogue, nor could it deal with the situation of a branch with a 14-bit or 24-bit displacement needing to be patched to a larger displacement (i.e., turning simple branch instructions into things such as mtctr b(c)ctr), and it did not implement the overflow arithmetic instructions, which use guards to catch math overflow and are heavily relied upon by Mozilla.

My first pass, then, was simply to add those features (I am grateful to Edwin Smith's help here, the original author, who gave me the critical pointers to get started). This was enough to get the browser up and running, but it didn't seem much faster, and SunSpider numbers were disastrous: on my quad G5 in Reduced power, the benchmark took around 7000ms on the stock interpreter in beta 8, but on "TraceMonkey PPC" it took nearly 10000ms -- almost half again as long! Dromaeo was less discouraging, and it did extremely well on the microbenchmarks, but its aggregate score was seriously hurt by the SunSpider dependency causing it to still be slightly slower in the final summation. (Please note that no PowerPC browser with the possible exception of Safari 5 on Leopard can run Kraken without having UI stalls, so I didn't bother.)

The suckage was, fortunately, proportional. There was no specific penalty paid other than that of the tracer and assembler themselves, and using Apple's built-in MakeDataExecutable() function -- which is still part of Carbon, but is only available on PPC -- to flush the instruction cache was actually faster than a hand-rolled routine. Less complex routines ran in proportionately less time, so the problem was the code it was generating.

It is instructive to look at the example of a simple loop:

var i, j = 0; for(i=0;i<2999999;i++) { j += 2; } print(j);

This can be written extremely fast in PowerPC assembly, especially being a single loop that can easily take advantage of the CTR branch instructions to do multiple operations simultaneously. Indeed, as a ceiling test I handwrote this into assembly and fed it to gcc, and the result was too fast to time accurately. However, when compiled by the nanojit, it took around 5000ms (!). Here's what it generates, edited for space and clarity. You can see this yourself by building a debug version of TenFourFox and starting it up with the environment TMFLAGS set to native (there are loads of other options, set it to help to see them):


/src/mozilla-central/obj-ff-dbg/dist/TenFourFoxDebug.app/Contents/MacOS/% ./js -b -d -j -p ~/empty.js
[ ... ]
0x3565ed0 [prologue]
0x3565ed0 mflr r0
0x3565ed4 stw r0,8(sp)
0x3565ed8 stw r31,4(sp)
0x3565edc mr r31,sp
0x3565ee0 stwu sp,-144(sp)
; single target patch 03565fd4 -> 03565f30 [ I'll explain this in a moment ]
0x3565ee4 stw r13,-72(r31) <= spill r13
0x3565ee8 stw r14,-68(r31) <= spill r14
[ .. rest of the prologue .. ]
0x3565f28 stw r30,-4(r31) <= spill r30
0x3565f2c stw r3,-76(r31) <= spill state
0x3565f30 [label1] ; the actual meat starts here
0x3565f30 lwz r28,8(r3)
0x3565f34 lwz r29,12(r3)
0x3565f34 ; pc now 03565f34

0x3565f38 ; swapped @ 03565f38

0x3565f38 lis r30,801 (0x3210000)
0x3565f3c ori r30,r30,57368 (0xe018)
0x3565f40 lwz r30,0(r30)
0x3565f44 cmpwi cr7,r30,0 (0x0)
0x3565f48 beq+ cr7,0x3565f58
0x3565f4c b 0x3575dec ; this goes to an exit block
0x3565f50 nop
0x3565f54 nop
0x3565f58 lwz r30,1560(r29)
0x3565f5c stw r30,0(r28)
0x3565f5c ; pc now 03565f5c

0x3565f60 ; swapped @ 03565f60

0x3565f60 li r27,2 (0x2)
0x3565f64 stw r27,8(r28)
0x3565f68 addo r30,r30,r27
0x3565f6c mcrxr cr7
0x3565f70 ble+ cr7,0x3565f84
0x3565f74 lis r0,855 (0x3570000)
0x3565f78 ori r0,r0,24140 (0x5e4c)
0x3565f7c mtctr r0
0x3565f80 bctr
0x3565f84 stw r30,1560(r29)
0x3565f88 lwz r30,1552(r29)
0x3565f88 ; pc now 03565f88

0x3565f8c ; swapped @ 03565f8c

0x3565f8c li r27,1 (0x1)
0x3565f90 addo r30,r30,r27
0x3565f94 mcrxr cr7
0x3565f98 ble+ cr7,0x3565fac
0x3565f9c lis r0,855 (0x3570000)
0x3565fa0 ori r0,r0,24236 (0x5eac)
0x3565fa4 mtctr r0
0x3565fa8 bctr ; this also goes to an exit block
0x3565fac stw r30,1552(r29)
0x3565fb0 stw r30,0(r28)
0x3565fb0 ; pc now 03565fb0

0x3565fb4 ; swapped @ 03565fb4

0x3565fb4 lis r29,45 (0x2d0000)
0x3565fb8 ori r29,r29,50879 (0xc6bf)
0x3565fbc stw r29,8(r28)
0x3565fbc ; pc now 03565fbc

0x3565fc0 ; swapped @ 03565fc0

0x3565fc0 cmpw cr7,r30,r29
0x3565fc4 blt+ cr7,0x3565fd4
0x3565fc8 b 0x3575f0c ; and this also goes to an exit block
0x3565fcc nop
0x3565fd0 nop
0x3565fd4 lis r2,0 (0x0) ; this is patched by the single target patch above
0x3565fd8 ori r2,r2,0 (0x0)
0x3565fdc mtctr r2
0x3565fe0 bctr
0x3565fe4 b 0x3575f6c
[ .. skip memory to epilogue .. ]
0x3575f6c lwz r3,-76(r31) <= restore state
0x3575f70 lwz r13,-72(r31) <= restore r13
[ ... ]
0x3575fb0 lwz r29,-8(r31) <= restore r29
0x3575fb4 lwz r30,-4(r31) <= restore r30
0x3575fb8 lis r12,1152 (0x4800000)
0x3575fbc ori r12,r12,31964 (0x7cdc)
0x3575fc0 lis r2,855 (0x3570000)
0x3575fc4 ori r2,r2,24528 (0x5fd0)
0x3575fc8 mtctr r2
0x3575fcc bctr
0x3575fd0 mr r3,r12
0x3575fd4 mr sp,r31
0x3575fd8 lwz r31,4(sp)
0x3575fdc lwz r0,8(sp)
0x3575fe0 mtlr r0
0x3575fe4 blr


Ugh!

For the sake of fairness, I have left the guards out as they have a very minimal impact on runtime, and the prologue and epilogue are fairly standard ABI stuff. But this is a lot of code for a small loop!

As originally written, the branches were not hinted, and my first attempt was to hint the branches with likely bits and allow it to optimize a patch down to a smaller set of instructions where possible. Notice the code near the end that loads CTR with 0x0 and tries to jump to it -- that is recognized by another nanojit function called nPatchBranch that (supplies!) patches branches as a long jump and the correct target is inserted into the lis-ori by the patch routine. If later in runtime the branch had to be changed, if it could be turned into a 14-bit branch with fewer instructions, it would be and the rest simply nopped out. Likewise, the nops in the code are there for room if a branch has to be "demoted" to a slower far call. This didn't help much, so I loaded the code into Shark.

Shark showed several things of note. First, as you scan the code, you will notice lots of potential pipeline stalls (ignore the swapping comments, I'll get to that). While the G5 is fairly aggressive with instruction reordering, it is not perfect, and older CPUs are less effective. Second, the branches are really hot. Third, there are a lot of store instructions inside the loop body proper aggravating the stall situation. Fourth, it's decrementing the loop the hard way.

To help with instruction ordering, I rewrote the macro that emits instructions to see if it could dynamically reorder them with the preceding instruction (call it "peephole optimization" if you like). You can see where the swapper worked in the instruction stream (just swap in your mind the instructions immediately around the swapper's log note). If there were no register dependencies and the code did not need to be emitted in strict serialized order, the instructions were swapped, which helped to turn things like


li r0,...
ori r0,...
li r1,...
ori r1,...


into a more typical and efficient


li r0,...
li r1,...
ori r0,...
ori r1,...


and cool off the registers. Although the G5 could do some of this in hardware, it helped with the G4s. A special attempt was made with comparison instructions to turn st cmp b into cmp st b so that the comparison would be more likely to be computed by the time the branch came along, and a store instruction could be safely switched with it since it would not affect the comparison. Finally, expensive instructions such as integer multiply were strength-reduced into repeated smaller operations (in mulli's case, into repeated strategic adds). These improvements yielded about 5-10% additional performance.

Still, the really ugly instructions such as branching could not be totally eliminated, and the memory stores could not be moved out of the loop because they stored stuff in the runtime stack that the side exits (if they were taken) might require. This was really painful for a processor like the PPC that is strongly oriented to load-store (whereas x86 is much better at memory operations) and had pipelining consequences. There is no way right now for the PPC nanojit to get a bird's eye view of what it's compiling so that it can more efficiently order larger functional blocks, such as moving the stack store operations into a side exit or the epilogue, or rewriting loops with the special PPC loop instructions if that sort of situation can be recognized. Perhaps that is more the domain of the methodjit, but it would be nice to have such meta-analysis available for the nanojit as well.

It could hurt less if the tracer were statistically able to predict accurately for the PowerPC which operations would be more painful for runtime, but tuning these parameters yielded small margins to date.

So am I throwing in the towel with the nanojit? No. For one thing, it still is faster at certain operations, and it can cache code it compiles (which is beneficial to L2-fat processors). While code would have to execute a large number of times to be worth it, things like the browser chrome do indeed execute a large number of times, as do many AJAXy applications. For another thing, I want to get more eyes on it to see if we as a group can make it better. While trying to beat a powerful optimizing compiler with a kludgey code generator is generally a losing proposition, there are probably other optimizations I haven't thought of.

The best and final solution would be to change some aspects of the nanojit to allow the assembler to see more of the LIR at once and make bigger level decisions (or hint the processor-specific portions about them), or to expand the LIR opcode set to embrace more specific purpose-oriented opcodes that could hint a code generator like this one about the intended purpose of a section of code. Certainly for loops there is a better way of implementing it in assembly than this. I don't know how likely that is to happen, however.

The plan for the nanojit then is this: it will come out as part of beta 9, turned on by default (you can turn it off in about:config). It has been tested mostly on G5 and G4/7450, though I also ran it on a G4/7400 and it seemed okay. I don't have a G3 running Tiger to test this with, so you guys will be on your own. See if it's faster for your workload. See if it's something you can improve upon. It's a start.