What is Internet of Things (IoT)?
22 January 2018Google use AI to reduce data centres energy consumption
27 January 2018When you need to search for a text and replace it with another one, which is the standard in most data cleaning jobs, you usually use regular expressions. They do their job, but sometimes the number of terms you need to look for is counted in thousands. FlashText is a better alternative to complete the task.
FlashText in a nutshell
FlashText is an incredibly fast library that reduces the time to exchange calculations to minutes. This is a Python library specifically created to search for and replace words in a document. FlashText requires a word or list of words and strings. The words that FlashText calls keywords are then searched or replaced in a string. When keywords are passed to FlashText for search or replacement, they are stored as a Trie data structure, which is very effective in download tasks.
In the initial benchmark of the author, he improved the execution environment of the entire operation with a huge margin: from 5 days to 15 minutes. The beauty of FlashText is that the runtime is the same regardless of the number of search terms, as opposed to a regular expression, in which the runtime will increase almost linearly with the number of passwords.
It will return keywords that are present in the chain. If replaced, it will create a new string with the replaced keywords. Both these operations take place through a single pass. It is important to understand the concept of a single pass.
FlashText is a testimony to the importance of designing algorithms and data structures, showing that even with simple problems, better algorithms can easily surpass even the fastest processors. FlashText is an efficient library for searching and replacing keywords in millions of documents. If you are into the NLP field and your everyday work is to deal with this kind of problem of text cleaning and modification, it is really worth trying this library.