It sounded simple enough: make an ‘strace(1) for Chromium’ to log all JavaScript function calls and property accesses that cross the boundary between the JavaScript execution engine (V8 in Chromium’s case) and the web browser itself (for example, Chromium’s Blink DOM rendering engine).
Such a tool could tell us what browser APIs are being used across the web by what scripts, and it could be very useful for identifying new browser fingerprinting techniques and web scraping countermeasures. The ideal system would be complete (that is, all native API functions/properties, automatically), tamper-proof, stealthy (that is, no tell-tale artefacts visible from JavaScript), and fast enough for large-scale web crawling.
Following conventional wisdom from recent related work, we steered away from modifying Chromium itself and instead used JavaScript injected at runtime via browser automation. The conventional wisdom was a lie.
It started easy: you just ‘monkey patch’ APIs to insert logging before/after invocation! But then corner cases started emerging.
Some important DOM properties (for example, window.document
) are always read-only in Chromium. Stealth is hard (sometimes impossible) to achieve since JavaScript provides many introspection mechanisms like toString
and rich stack traces on exceptions.
Discovering native APIs on the fly required a full walk and instrumentation of the JavaScript global object namespace for each newly created DOM frame. Even if the resulting system could be made robust and stealthy, the performance was abysmal. It was time to go back to the drawing board.
Implementing VisibleV8
The alternative was to bake our instrumentation into Chromium’s C++ source code. This has been done before, but the traditional drawback has been maintainability: Browsers are updated constantly, and few researchers can prioritize keeping their patches current.
To save VisibleV8 from such a fate, we pursued a strategy of radical minimalism, confining our patches strictly to V8 rather than the whole browser and hooking a handful of ‘choke-points’ within V8 to produce all-or-nothing instrumentation. The final patches made few invasive changes and added less than 600 new source lines of code (SLOC). For comparison: Chromium as a whole comprises millions of SLOC.
Native function calls were straightforward to instrument, as V8 channels all such calls through a single gateway function.
Property accesses on native API objects were trickier, as V8 features a dazzling array of different fast-paths for optimized property access. Our solution avoided these by injecting instrumentation hooks at all property access expressions as V8’s front-end interpreter, Ignition, translates JavaScript source code into bytecode. This bytecode serves as input to both Ignition and the optimizing just-in-time (JIT) compiler, Turbofan, so bytecode injection preserves our hooks from end-to-end, even under JIT compilation. The logging logic itself filters out non-native objects to reduce log volume and overhead.
VisibleV8 allowed us to discover new artefacts we had no prior knowledge of
Of course, hooking all property accesses is expensive, even under JIT. We measured a ~60% slowdown on the Speedometer full-browser benchmark, and a few of Dromaeo’s aggregated microbenchmarks were much worse. However, we observed that VisibleV8 outperformed equivalent in-band instrumentation wherever such comparisons were possible.
Furthermore, VisibleV8 remains fully usable for interactive browsing, even on JavaScript-heavy sites such as Google Maps, and we have had no problems using it at scale (hundreds of thousands of page visits) for automated web crawling.
We visited the Alexa top 50k web sites using VisibleV8 to look for evidence of crawling countermeasures. Specifically, we looked for code probing properties that do not exist in proper browsers but which are artefacts of headless/automated browsers (that is, bots).
We found bot detection activity on 29% of the visited domains (over 73% of it coming from 3rd-party iframes).
This experiment vindicated our choice to go out-of-band, as VisibleV8’s ability to log all access to native objects (especially the global window object) allowed us to discover new artefacts we had no prior knowledge of, while any in-band approach would have needed an already existing list of properties to instrument (since the global object cannot be wrapped behind a JavaScript proxy).
VisibleV8 is freely available, and under active development and maintenance (up to Chrome 79 as of writing). We hope it will serve as a foundation for other researchers’ tools providing deep insight into dynamic behaviour on the Web.
This work was presented at IMC 2019 ‘VisibleV8: In-browser Monitoring of JavaScript in the Wild‘.
Watch: Jordan Jueckstock present on Visible V8 at IMC 2019.
Jordan Jueckstock is a Computer Science PhD student at NCSU where he works in the Wolfpack Security & Privacy Research Lab with his advisor, Alexandros Kapravelos.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.