Some of the developers at match2blue are creating a text-interest-matcher. Leaving buzzword bingo aside, that means the software calculates whether a text is interesting based on users’ interests. So basically you, as a user, have to enter some interests and will be presented some pieces of data in order of their relevance. You can also think of it as text classification into either good or bad.
This software has become quite complicated, because it is necessary to have some kind of semantic knowledge about the interests. But there are different methods of text classification. I was curious about how hard it is to write the most simple text classifier that gives you decent results. Well, turns out it is remarkably simple. Creating the classifier took less than two hours. And here is the source code:
import os j = os.path.join def train(text): """ Train a dictionary with the given text. Returns a dictionary of dictionaries, that describes the probabilities of all word-folling-ocurrences in the text. For example, the string "a test" gives this result: >>> train("a test") {'': {'a': 1}, 'a': {'test': 1}} Meaning that the empty string '' has been followed once by 'a' and 'a' has been followed by 'test' as well. A longer example leads to a more complex dictionary: >>> train("this is a test oh what a test") {'': {'this': 1}, 'a': {'test': 2}, 'what': {'a': 1}, 'oh': {'what': 1}, 'this': {'is': 1}, 'is': {'a': 1}, 'test': {'oh': 1}} """ c = {} lastword = "" for word in text.split(): word = word.lower() if c.has_key(lastword): inner = c[lastword] if inner.has_key(word): inner[word] += 1 else: inner[word] = 1 else: c[lastword] = {word: 1} lastword = word return c def probability_of(dict, lastword, word): """ Helper function for calculating the probability of word following lastword in the category given by dict. >>> category = train("this is a test") >>> probability_of(category, "a", "test") 1.0 >>> probability_of(category, "any", "words") 0 """ word = word.lower() if dict.has_key(lastword): inner = dict[lastword] sumvalues = sum([v for v in inner.values()]) if inner.has_key(word): return inner[word] / (sumvalues * 1.0) return 0 def classify(text, dict): """ Returns the probability that a text is from the given category. For every pair of words the probability_of value is calculated, summarized and divided by the amount of words in the text. >>> category = train("this is a test") >>> classify("a test with some words", category) 0.2 >>> classify("just writing test or a doesn't improve the ranking", category) 0.0 """ lastword = "" probabilities = 0 for word in text.split(): probabilities += probability_of(dict, lastword, word) lastword = word return probabilities / (len(text.split()) * 1.0) if __name__ == "__main__": """ Calculate the category, that the text in ../test matches best. """ ranking = [] os.chdir("..") for file in os.listdir("categories"): trained = train(open(j("categories", file)).read()) value = classify(open("test").read(), trained) print "test is", file, "with", value, "% probability" ranking.append((value, file)) ranking.sort() print print "The test text is probably from", ranking[-1][1] print "(second guess is", ranking[-2][1] + ")"
How does it work? There are two very simple steps it does. First, the classifier has to be trained with existing textfiles. The result is a dictionary that consists of many inner dictionaries. Let’s feed it with some text to see what happens.
In [5]: train("a test a good test a good good test") Out[5]: {'': {'a': 1}, 'a': {'good': 2, 'test': 1}, 'good': {'good': 1, 'test': 2}, 'test': {'a': 2}}
In the result we can see that three words followed ‘a’. Two times it was ‘good’ and one time ‘test’. This is all we need for classifying. Now we can apply the classify function. It goes through the text to classify and looks for known word follow-ups. If there is a known one, the probability of this follow-up is added. So, in our example the probability of ‘a good’ is 2/3 and for ‘a test’ it is 1/3.
In [2]: a = train("a test a good test a good good test") In [3]: classify("is it a good test", a) Out[3]: 0.26666666666666666 In [4]: classify("text good but different style", a) Out[4]: 0.0
The first example has similar words and ordering as the trained text. The second one also has some exact same words, but they are in a different order. Therefore, the probability of this text beeing a is 0.0.
If you want to try it with some longer text, you can download the classifier from Google Code. It is easy to create your own categories by adding a file in the appropriate folder. Just make sure you have a decent amount of data. Three sentences are not enough for a good classification.
In my tests I got quite good results even for classifying authors writing about the same topic.
cd classify/src/ python classify.py