Bag of, not words, but tricks
Research Advocate, Rasa
Many academic articles discuss benchmarks. Benchmarks are fine, but it’s also a distraction sometimes. There are a few reasons why.
- Many NLP articles focus on English. If a technique works well in English there’s no guarantee that it will work for all other languages as well.
- Many articles are written to brag about a new technique. This means that they sometimes paint a biased picture. Maybe a simple model will do just fine and maybe the hyperparameters matter less than the quantity of data.
- It might be that the task in the research paper is significantly different from your use-case. Predicting sentiment on the Twitter corpus is fundamentally different from predicting sentiment from transcribed telephone conversations at your company.
- It might be that the benchmark reports on a well-defined task, but that the task itself isn’t meaningful. A CSV file, in the end, often doesn’t the complexities of reality.
All of these issues are issues that we need to deal with at Rasa. We’re trying to build a software stack that will allow anybody to write a proper virtual assistant in python. That means that we need tools to run proper benchmarks easily. As part of my work, I’ve been able to open source some tools that should help you. So in this talk, I’d like to show my favorite (scikit-learn compatible!) tricks to get a proper benchmark going for text classification.
In particular, I’ll quickly show:
- whatlies – a tool that exposes word/language embeddings from over 275+ languages in scikit-learn
- tokenwiser – a tool that links spaCy with scikit-learn meant to help in situations where data doesn’t fit in memory
- human-learn – a tool that allows you to turn python functions into benchmarkable scikit-learn components
- hiplot – the best interactive visualisation tool for a grid-search
- how you can apply ipywidgets to do bulk labelling live from your notebook
The goal of my talk is to show how to quickly bootstrap a meaningful benchmark while also demonstrating some tools that I’ve been able to open source personally, as well as tools that have been open-sourced by my employer, Rasa.
My name is Vincent, ask me anything. I have been evangelising data and open source for the last 7 years. You might know my from tech talks where I attempt to defend common sense over hype in data science. Currently I work as a Research Advocate at Rasa where I collaborate with the research team to explain and understand conversational systems better. In my spare time I maintain many open source projects and I host an educational platform over at https://calmcode.io.