The key to any successful computer vision project, or any machine learning project for that matter, lies in the quality of the data.
Text Annotation Tools: Which One to Pick in 2020?
Eduardo Cerna, March 17, 2020
Following our article on annotation tools for computer vision projects, which you can read here, we’re focusing this time on the best tools for text annotation for natural language processing. When working on an NLP project, your success will depend on the quality of your model and this will, in turn, depend on the quality of the training data you feed it. Choosing the right tool for this process is a crucial first step.
The tools outlined in this article all fulfill the basic requirements for NER (Named Entity Recognition) and classification, albeit with slightly different approaches. Ultimately, the tool you choose will largely depend on your specific annotation needs and personal preferences.
Here’s our primer on some of the most popular text annotation tools for 2020:
Doccano is a web-based, open-source text annotation tool. Doccano gives you the ability to have it self-hosted which provides more control as well as the ability to modify the code according to your needs. You can try a demo of the annotation tool on their website. The export formats are as follows:
- Simple, easy-to-use interface.
- Ability to assign key shortcuts for faster labeling.
- Team collaboration functionality.
- MIT License
- When self-hosted, Doccano shuffles the order of the annotation pieces.
- Laggy and unresponsive at times (in a self-hosted environment).
- No official API. There is an unofficial attempt for an API client.
Considering AI in your company?TALK TO OUR EXPERT
This text annotation tool offers a rather outdated interface (last updated in 2012) that needs to be installed locally, as it is not available for online use. They claim that the software is designed specifically for structured annotation (where the notes are not freeform text but have a fixed form that can be automatically processed and interpreted by a computer).
Brat has a familiar drag to select or double-click gesture for annotation and relationship functionalities, and working with it is pretty straight forward. You can try an online demo on their site.
Concerning export formats, Brat exports files in brat standoff format, which can be easily converted into other formats. Specifically:
“Annotations created in brat are stored on disk in a standoff format: annotations are stored separately from the annotated document text, which is never modified by the tool.”
- Simple to use.
- Integration with automatic annotation tools.
- Brat standoff export format allows for easy conversion into other formats.
- MIT License.
- Available API for continuous model training.
- Needs to be installed locally.
- Outdated UI.
Prodigy is an annotation tool powered by active learning, as its creators claim. Prodigy boasts a sleek, modern interface and was developed by the same creators as the popular spaCy library. The active learning aspect of this annotation tool means that you only have to annotate examples the model doesn’t already know the answer to, considerably speeding up the annotation process.
You can choose from jsonl, json, txt and txt formats for exporting your files.
- Sleek, modern interface.
- Includes active learning, considerably speeding up the annotation process.
- Support for image annotation.
- Fully scriptable.
- Expensive. At the time of writing, the cheapest license went for USD 390. However, the license is forever and comes with 12 months of free upgrades.
Tagtog is a web-based text annotation tool, their web interface is straightforward and offers plenty of flexibility in terms of the formats you can work with. You can choose from plain text, working from a URL or uploading your own files and even PDFs directly (although this is a paid feature). Otherwise, Tagtog is free to use for its basic functionalities, you only need to create an account to start annotating.
Tagtog provides a wide variety of output formats, namely:
- Free to use (added functionalities come at a cost).
- Fast tagging - recognizes all occurrences of an entity once it has been manually labeled and tags them automatically.
- Support for working with multiple types of data (PDF or import the text from TXT files, HTML, CSV, source code files, Markdown, etc.)
- No installation required.
- Machine learning capabilities: learns from previous annotations and automatically generates similar annotations.
- Support for group annotation for teams.
- Own API for continuous model training.
- The interface can be a little confusing at first.
- Expensive (Approximately USD 135 per month for their cheapest tier)
An all-encompassing solution, Dataturks is a free-forever, comprehensive annotation tool that allows you to not only annotate documents, but also images and video, which we covered in more detail in our image annotation tools article.
Dataturks allows you to annotate plain text as well as PDFs and you can work with it either on the cloud or install it locally. The export format is NER JSON, which they claim is fairly similar to spaCy and which you can easily convert with a simple function that they provide on their site.
- Free forever.
- Comprehensive tool for most annotation needs.
- Support for a wide variety of text formats, including PDF.
- Cloud and local installation.
- Own API for continuous model training.
- Proprietary export format may add friction.
- As of March 2020, support for Dataturks appears to have ceased, possibly due to the company’s acquisition by Walmart Labs.
We’ve introduced Label Studio’s capabilities for image annotation purposes on our Image Annotation Tools article, which you can read here. Released in August 2019, Label Studio is an open source multi-type data annotation tool written completely in Python. The San Francisco-developed tool offers a no-brainer UI that is fully customizable and simple to work with. In order to get started with labeling any kind of data, the first step is to configure the tool for the desired purpose. Label Studio offers predefined configurations which you can modify according to your needs or you can set them up from scratch.
In this scenario, we tested the NER tagging configuration and the setup is incredibly straightforward: simply modify the names of the tags you wish to have on their customization panel and you’re done. Getting to work is as simple as importing a file in one of their supported formats (.csv, .json, .tsv).
Alternatively, you can also import files by providing a URL that contains the desired assets.
Similarly to working with images, the assets to tag seem to appear unorderly, and going back to an already tagged asset is not possible without having to code it in the tool’s UI beforehand.
All in all, Label Studio appears to be an excellent tool for data annotation, especially for small businesses who wish to produce high-quality results with smaller budgets.
- Open source
- Customizable UI
- Quick, easy set up
- Mobile friendly
- Assets for labeling appear unordered
- No simple way of returning to an already labeled asset
- No out-of-the-box statistics available
Also worth considering: Labelbox
While we haven’t had a chance to test Labelbox’s functionalities out yet, they do appear to have a solid platform and it seems they have garnered a reputable follow from top companies. We have reached out to them to request a demo and we will be updating the article once we have had it. Stay tuned!
Our pick - Prodigy
Initially, our top pick was set to be Dataturks, since it isn’t limited to text annotation only, but rather provides a robust platform for text annotation as well as for images and video. However, the company was acquired and support appears to have been discontinued.
Therefore, the runner up is Prodigy. Prodigy has the spaCy reputation to back it up and its active learning capabilities combined with its sleek, easy-to-use interface make it worth your while. While it is true that Prodigy is a rather pricey option, it is worth noting that it is a one-time fee for a lifetime license and it might well be worth the investment.
Is there a tool you would have liked to see here? Let us know!
Read More Articles
Plain English intro to your first AI project: why the idea of accuracy is dangerous and why 99% is probably not the number you need.
How an innocent-looking cognitive bias can ruin your machine learning project before it even starts. This was demonstrated on a mystery shopping experiment we conducted on Upwork.