RustWPM

This article outlines why writing an OpenWPM clone in Rust would have significant benefits for the Python implementation and other platform implementations.

Context

OpenWPM is a crawler that instruments the Firefox browser to collect data about the behavior of websites and saves that data to persistent storage. The framework gives the user fine-grained control about how Firefox and the Extension are configured and controls the browser using Selenium. It allows running multiple browsers in parallel and has a variety of persistent backends.

History

The current implementation of OpenWPM is in Python and started out as controlling a single Firefox instance using the Selenium XPCOM Extension, capturing data using a MitM Proxy and writing that data out to a SQLite DB. Over time a number of features were added:

  • The ability to run multiple browsers in parallel
  • A WebExtension that captures state from inside the browser
  • The ability to store web content (aka the loaded JS Files and similar)
  • A number of different storage backends
  • A centralized logging server that captures both Firefox and application logs
  • CommandSequence as a way to chain together multiple different actions e.g. first GET a page, accept cookies, click one link

As the XPCOM Extensions got removed and replaced by the more restricted WebExtensions Selenium switched to using Geckodriver to control the browser.

Due to the Selenium extension being notoriously unreliable a major focus of the evolving design was ensuring that no matter how much that code misbehaved the crawl would go on. This lead to each WebDriver handle (the control interface of Selenium) being held in a different process, all of which communicate to the main process over a system of channels. And since the read of those channels happens in a blocking fashion the DataAggregator and MultiprocessLogger (MPLogger) needed to each live in their own processes as well.

Claim

Ever since switching to GeckoDriver as an intermediary the Selenium has become a lot more reliable, as there are now two process boundaries between our code and the browser. TODO: Steal graphic from Selenium on how it interacts with the browser now

This means one of our primary design constraints is invalid and we should reconsider our current design.

Proposal

So far I've only been talking about how our python code base should change but this post is called RustWPM so let me outline the connection. I believe that there is a radically simpler model possible for a crawler that lacks none of the functionality that OpenWPM.

As a crawler most of the programs time is spent waiting on sites to load, timeouts to finish or data to arrive and almost no time in the code we write. However, there are a lot of outbound connections to handle, such as to the WebExtension or the logs arriving from the Firefox instances.

This pattern of behavior basically calls for async programming as it will allow us to wake up as soon as there is any work to be done and spent the rest of the time idle instead of blocking multiple threads or even processes.

Rust

So far we have only been setting up some general ideas and haven't mentioned Rust at all, so I want to explain why I think Rust is a good fit for the programming model I'm proposing here.

  1. In Rust concurrency is a first party citizen of the language and Rust's type system ensures that any code you write is race free
  2. The entirety of the standard library has been rewritten as async-std to enable developers to write code that reads as close to the synchronous equivalent as possible
  3. Rust's ecosystem has tracing a library that enables structured logging in async applications and supports the span+event model that OpenTracing promotes

CommandSequences can simply be replaced by functions or lambdas that get the same set of parameters passed as a custom function does today. This drastically simplifies implementing the BrowserManager as it's only responsibility become initializing and finalizing the visit before and after the user code.

In this model there is only one process which greatly reduces the need for interprocess communication which is one of the significant drawbacks of the current system.

If we are able to build a compiling version of this idea in Rust we can be fairly certain that it's race free and correct.

The rust prototype could then inform the rewrite in Python.

TODO:

  • Talk about the fact that the webdriver crate is not stable yet
  • How Rust is less likely to be adapted by researchers as thus the final/leading implementation should be in python
  • How Pythons asyncio library is currently not quite mature but still usable
  • Link to all relevant external pieces
  • Rewrite the entire Rust part

Conclusion

While protoyping in Rust is hard, I think we have come to a point in OpenWPMs developement where we are fairly certain which properties we need in an implementation and can now proceed to focus on reliability and correctness. Achieving these properties in Rust is easier than achieving them in Python