I'm excited to share that I'm starting a new role as a research engineer on the language model interpretability team at Google DeepMind!
Deep learning is best thought of as an unfathomably vast search over all possible software programs, each of which is immensely complicated and contains billions of parameters, for the program that best fits the vast amount of data we feed it. The form of the resultant program is an inscrutable pile of matrices, and as such, by default it's entirely unclear to us what the program that results from this process is capable of, or how it works. This is unlike almost all other kinds of software, where humans understand every component of the stack. In particular, we're unsure if these programs have robustly learned what we intended them to learn, or if they are producing outputs that look good to us but might be covertly harmful.
At Google DeepMind, my research will focus on advancing the frontier of our ability to understand the internals of neural networks -- to attempt to reverse engineer this pile of matrices into human understandable concepts and algorithms. My team is situated within the broader artificial general intelligence (AGI) safety and alignment team, working under the premise that by sufficiently advancing our understanding of how our AI models work, we might be able to more robustly notice problems in how we train them, that analysis of mere input-output behaviour would not, and then use this understanding to better design and deploy safer AI models.
AI is improving at a rapid pace. This is both exciting and scary. If we are to safely create powerful general artificial intelligence and deploy it widely to automate labour and solve many of the world's most challenging problems, there will be a number of both technical and societal challenges and risks to overcome. My work focuses on a subset of these. But there is much work to be done, and perhaps not a whole lot of time to do it.
It seems plausible that we as a species will soon succeed in building machines as intelligent as humans; AI systems that go beyond the chat bots we are all familiar with, to agents that can reason, plan, and autonomously act in the world to accomplish real tasks. If we get to such a point and are insufficiently careful about how we build and deploy such systems, they may end up causing more harm than good -- an outcome I deeply care about avoiding. If we do succeed however, it's possible that such AI systems might very rapidly enable a future significantly better than the present -- an outcome worth fighting for.