Building a very easy text classifier in python

0 Comments

Some of the developers at match2blue are creating a text-interest-matcher. Leaving buzzword bingo aside, that means the software calculates whether a text is interesting based on users’ interests. So basically you, as a user, have to enter some interests and will be presented some pieces of data in order of their relevance. You can also think of it as text classification into either good or bad.

This software has become quite complicated, because it is necessary to have some kind of semantic knowledge about the interests. But there are different methods of text classification. I was curious about how hard it is to write the most simple text classifier that gives you decent results. Well, turns out it is remarkably simple. Creating the classifier took less than two hours. And here is the source code:

import os
j = os.path.join

def train(text):
	"""
	Train a dictionary with the given text. Returns a dictionary of dictionaries,
	that describes the probabilities of all word-folling-ocurrences in the text.

	For example, the string "a test" gives this result:
	>>> train("a test")
	{'': {'a': 1}, 'a': {'test': 1}}
	Meaning that the empty string '' has been followed once by 'a' and 'a' has been
	followed by 'test' as well.

	A longer example leads to a more complex dictionary:
	>>> train("this is a test oh what a test")
	{'': {'this': 1}, 'a': {'test': 2}, 'what': {'a': 1}, 'oh': {'what': 1}, 'this': {'is': 1}, 'is': {'a': 1}, 'test': {'oh': 1}}
	"""
	c = {}
	lastword = ""
	for word in text.split():
		word = word.lower()
		if c.has_key(lastword):
			inner = c[lastword]
			if inner.has_key(word):
				inner[word] += 1
			else:
				inner[word] = 1
		else:
			c[lastword] = {word: 1}
		lastword = word
	return c

def probability_of(dict, lastword, word):
	"""
	Helper function for calculating the probability of word following lastword
	in the category given by dict.

	>>> category = train("this is a test")
	>>> probability_of(category, "a", "test")
	1.0

	>>> probability_of(category, "any", "words")
	0
	"""
	word = word.lower()
	if dict.has_key(lastword):
		inner = dict[lastword]
		sumvalues = sum([v for v in inner.values()])
		if inner.has_key(word):
			return inner[word] / (sumvalues * 1.0)
	return 0

def classify(text, dict):
	"""
	Returns the probability that a text is from the given category. For every pair of
	words the probability_of value is calculated, summarized and divided by the amount
	of words in the text.

	>>> category = train("this is a test")
	>>> classify("a test with some words", category)
	0.2

	>>> classify("just writing test or a doesn't improve the ranking", category)
	0.0
	"""
	lastword = ""
	probabilities = 0
	for word in text.split():
		probabilities += probability_of(dict, lastword, word)
		lastword = word
	return probabilities / (len(text.split()) * 1.0)

if __name__ == "__main__":
	"""
	Calculate the category, that the text in ../test matches best.
	"""
	ranking = []
	os.chdir("..")
	for file in os.listdir("categories"):
		trained = train(open(j("categories", file)).read())
		value = classify(open("test").read(), trained)
		print "test is", file, "with", value, "% probability"
		ranking.append((value, file))

	ranking.sort()
	print
	print "The test text is probably from", ranking[-1][1]
	print "(second guess is", ranking[-2][1] + ")"

How does it work? There are two very simple steps it does. First, the classifier has to be trained with existing textfiles. The result is a dictionary that consists of many inner dictionaries. Let’s feed it with some text to see what happens.

In [5]: train("a test a good test a good good test")

Out[5]:
{'': {'a': 1},
 'a': {'good': 2, 'test': 1},
 'good': {'good': 1, 'test': 2},
 'test': {'a': 2}}

In the result we can see that three words followed ‘a’. Two times it was ‘good’ and one time ‘test’. This is all we need for classifying. Now we can apply the classify function. It goes through the text to classify and looks for known word follow-ups. If there is a known one, the probability of this follow-up is added. So, in our example the probability of ‘a good’ is 2/3 and for ‘a test’ it is 1/3.

In [2]: a = train("a test a good test a good good test")

In [3]: classify("is it a good test", a)

Out[3]: 0.26666666666666666

In [4]: classify("text good but different style", a)

Out[4]: 0.0

The first example has similar words and ordering as the trained text. The second one also has some exact same words, but they are in a different order. Therefore, the probability of this text beeing a is 0.0.

If you want to try it with some longer text, you can download the classifier from Google Code. It is easy to create your own categories by adding a file in the appropriate folder. Just make sure you have a decent amount of data. Three sentences are not enough for a good classification.

In my tests I got quite good results even for classifying authors writing about the same topic.

cd classify/src/
python classify.py

Related Posts

A Pypy Runtime for AWS Lambda

Python is a great language for the on-demand style of Lambda, where startup time matters. In terms of execution speed, there are better choices available. Where computational performance matters one improvement is to use Pypy, the Python interpreter…

Paragliding data gems

Paragliding is my beloved hobby and besides offering stunning views and perfect days outside, it also provides a huge amount of flight data to process and play around with. Sites like xc.dhv.de, XContest contain millions…

Calculate wind from thermals

This post is a followup to the last one about Paragliding data gems. We have collected lots of flights and their GPS location data. From this, several million thermals were extracted and shown on a…