The future of OpenWPM
A rough collection of thoughts and ideas on what can and should happen to OpenWPM.
Context
Mozilla will significantly reduce its investment in OpenWPM.
However, the project remains significant to privacy research by providing a robust crawler that allows for easy data collection and is simple to extend.
In the last to two years alone there have been 25 papers written that were using OpenWPM.
As such, it would feel irresponsible to just let it sit there and rot.
The future
My proposal for OpenWPM is three-part.
- Prepare it for hands-off maintenance mode
- Merging any open PRs by me - Done
- Writing a short script that can be run on every FF release to create a new OpenWPM version. - Done
- We explicitly stay open to PRs This is both for people new to programming and researchers wanting to share their code.
- I’ll use OpenWPM as a project to build my brand with educational coding streams and try to build a community around that.
Here 1. should happen within the next two weeks, while 2. and 3. are things I’m willing to do in my free time.
Proposal 1. allows us to maintain a useful research tool without investing meaningful resources.
Proposal 2. is mostly born out of my frustration with unmaintained projects and I don’t want OpenWPM to become that. Also, beginners tend to have quite a lot of endurance on tedious tasks, which still feel rewarding to them as they are doing them for the first time.
Proposal 3. is me still having a lot of ideas that I’d like to try out, which were never reasonable to do on company time but might simplify and improve OpenWPM significantly.
Things I want to implement
This is not relevant to anything up above, I just wanted to write down my thoughts.
Structured logging
I dislike how bad our logs currently work in a cloud world, where we used to run most of our crawls, since we were embedding a fraction of the available context into an unparseable string.
To remediate that and to make our logs easier parsable I proposed using structlog 6 months ago. This way we can emit JSON when running in the cloud and appropriately set all the tags when sending events to sentry (and not just set tags per crawl).
Replace processes with threads
A lot of our python code is just a bunch of queues waiting for messages from another process. I think by moving this all into a single process we can more easily debug this as most tools are set up for handling threads but few are good for handling processes.
Merge the to extension directories and move it to the top level
Having the extension live under openwpm/Extension
makes it look like a python
module, which it isn’t.
Also, the split between webext-instrumentation and the Firefox specific part was
in an effort to allow OpenWPM to work with Chrome.
We have given up on that effort and are instead all in on Firefox.
Merging these two directories will simplify the build process for the extension significantly and maybe even allow us to get rid of webpack and similar making hacking on the extension even simpler.