Python: Inverted Index for dummies
An Inverted Index is an index data structure storing a mapping from content, such as words or numbers, to its document locations and is generally used to allow fast full text searches.
The first step of Inverted Index creation is Document Processing In our case is word_index() that consist of word_split(), normalization and the deletion of stop words (“the”, “then”, “that”…).
1 | def word_split(text): |
word_split() is quite a long function that does a really simple job split words. You can rewrite it with just one line using something like re.split(‘\W+’, text).
1 | def words_cleanup(words): |
Cleanup and Normalize are just to function filters to apply after word_split(). In this case cleanup remove the stopwords (frozenset of strings) and normalize convert word in lower case. But you can do something more like removing accents, transform to singular or something else.
1 | def word_index(text): |
word_index() is just an helper, take an input text and does all the word splitting/normalization job and the result is a list of tuple that contains position and word. [(1, u’niners’), (13, u’coach’)].1
2
3
4
5
6def inverted_index(text):
inverted = {}
for index, word in word_index(text):
locations = inverted.setdefault(word, [])
locations.append(index)
return inverted
Finally we’ve our invertex_index() method that take a text as input and returns a dictionary with words as keys and locations (position of the words in the document) as values.
1 | def inverted_index_add(inverted, doc_id, doc_index): |
The Previous method, inverted_index(), returns a dictionary with just the information for the specified document. So inverted_index_add() add Document’s inverted index to a Multi-Document Inverted Index. Here we’ve words that are keys of the dictionary and values are dictionary with doc_id as key and document location as values. {‘week’:{‘doc2’: [149], ‘doc1’: [179, 187]}}.
1 | def search(inverted, query): |
Now that we’ve the inverted index, we’re able to do queries on it. The function above takes a Multi-Document Inverted index and a query, and returns the set of documents that contains all the words that you’ve searched.
Obviously to make a serious search function you need to add ranking, phrase matches, stemming and other features of full-text search. And you need to write your dictionary using an on disk btree. But this is a basis example, of how to build an inverted index.
Source code can be found on my GitHub repository under py-inverted-index:
git clone http://github.com/matteobertozzi/blog-code.git
Posted by Matteo Bertozzi at 10:07 PM