I was recently exploring spaCy for some NLP work, and found that the default model was not sufficient for tagging entities in the domain I was exploring. The documentation was very helpful in explaining how I could train the statistical model of the named entity recognizer, but I needed training and evaluation data.
While I could tag them manually, I felt that I needed a method/tool to do it in an efficient manner – especially if I was going to be tagging hundreds of lines of data. My key considerations were:
- Free/low-cost (its a hobby project)
- Simple (no need for complex features that are not needed)
- Data is kept private/confidential
I first looked at existing solutions and found many existing solutions, but they did not fit my criteria. Some examples are below: (Note: this is not an exhaustive list, nor an endorsement/recommendation – its just some interesting ones I found during my research)
Eventually, I decided to make my own named entity tagger using Python. If you are new to my blog, I like to make my solutions/tools simple and basic, avoiding dependency on other packages so as to reduce the chances of them “breaking” down the road, due to external packages/modules/services.
I also tend to avoid using GUIs so that it can be run on a wider range of platforms (e.g. cloud instances, lightweight VMs, Docker containers) that do not have a window manager or desktop interface (reason: occasionally cheaper, and faster).
Thus, I present to you TeBaC-NET, which stands for “Text Based Custom Named Entity Tagger”. It is cross platform, runs on the command line / terminal, and uses Python 3.6 with the
os modules that come pre-installed by default (i.e. no need to pip install anything else).
GitHub link: https://github.com/davidloke/tebac-net
Instructions on how to use it can be found in the README.
In my next TeBaC-NET post, I will talk more about how it is used, and some of the UX considerations I had in mind when developing it. If you are interested to use, collaborate, feature request, etc, please leave a comment, or send me an email via the “contact” page.