Ever wondered how you can automate the process of figuring out whether a webpage is about a particular topic? I’ve spent some time recently on a side project that involved solving a flavor of that exact problem. I’ve seen several related questions on Stack Overflow and other sites, so I thought I would throw together a quick post to describe bits of my implementation.
For our example, let’s suppose that we need to implement something that exposes an API like the following:
- boolean classifyWebpage(Webpage webpage)
- void trainClassifier(Map < Webpage, boolean > trainingExamples)
We will mandate that consumers call the function to train the classifier once with a training set before we can call the function to evaluate whether a webpage is about our topic. Our train classifier function will take a bunch of webpages and whether or not they are about the given topic, to use as training examples. Our classify webpage method will take a webpage and it returns true if the webpage is about the topic and false if it isn’t. To achieve this, we’ll implement a few helper functions:
- String cleanHtml(String html)
- Map < String, int > breakTextIntoTokens(String text)
- Map < String, float > getTokenValues(Map < String, int > tokenCounts)
Let’s look at how we can implement some of these pieces in detail.
Cleaning up HTML
The first piece of infrastructure that we’ll want to build is something that strips markup from an HTML string and splits it into tokens, because words like “href” and “li” are about formatting and aren’t part of the true document content. A naive but decently effective and low cost way to this is to use regular expressions to strip out everything in the contents between script and style tags, and then everything between < and >. We’ll also want to replace things like non-breaking space characters with literal spaces. Assuming that we’re working with fairly conventional webpage layouts, the blob of text that we’re left with will include body of the webpage plus some noise from things like navigation and ads. That’s good enough for our purposes, so we’ll return that and make a mental note that our classification algorithm needs to be good at ignoring some noise.
Break Text into Tokens
Once we have clean text, we’ll want to break it into tokens by splitting on spaces or punctuation and storing the results in a data structure with the number of occurrences of each token. This gives us a handy representation of the document for doing a particular kind of lexicographical analysis to bubble up the words that matter the most. Again, regular expressions are our friend.
Find the Keywords
Armed with a map of tokens and count of occurrences for each token, we want to build something that can pick the keywords for the document. Words like “the” and “to” don’t provide any clues about what a document is about, so we want to find a way to focus on keywords. The important words in a document are likely to be repeated, and they’re also not likely to be found often in most other documents about different topics. There’s a nifty algorithm called Term Frequency Inverse Document Frequency that is both easy to implement and does a great job find keywords by comparing the frequency of words in a single document with the frequency of words in a corpus of documents.
To make this work we’ll need to start by building a corpus. One option is to bootstrap by crawling a bunch of websites and running the entire set through the our initial function for cleaning and tokenizing. If we’re going to go this route we need to be sure that we’ve got a good mix of pages and not just ones about our topic, otherwise the corpus will be skewed and it will see things that should be keywords as less valuable. A better option in most cases is to use an existing corpus , assuming that one is available for the desired language, and manipulate it into whatever format we want to use for classification.
Classify a Webpage based on Keywords
The next bit is the secret sauce. We know that given any webpage we can extract keywords by doing some prep work and then comparing against a corpus, but given those keywords we need to decide whether a webpage is about a given topic. We need to pick an algorithm that will give us a boolean result that tells us whether a webpage is about our topic. Keep in mind that while we’re setting up our algorithm we have some training examples to work with where we’re given a class, in other words we know whether they are about the topic or not.
The first option that most people think of is to come up with a mathematical formula to tell whether a webpage matches a topic. We could start by boiling the problem down to how well two specific webpages match each other by coming up with a mathematical formula to compare two webpages based on similar keywords. For example we could compute a running similarity total, adding to it the product for the ranking values in each respective page for keywords that match. The result would be a scalar value, but we could convert it to a boolean value by coming up with some arbitrary threshold based on experimentation and saying that pages with similarity over our threshold are indeed about the same topic. In practice, this actually works decently well with some exceptions. With these building blocks we could figure out whether a given webpage is about a topic by finding how similar it is to webpages in our training set that are about that topic versus ones that aren’t, and making a decision based on which group has a higher percentage of matches. While it may be effective, but it has several flaws. First, like Instance Based Learning it requires comparison with the training set during classification which is slow at runtime because we have to consider many permutations. More significantly, we would have applied a human element to the algorithm by defining the threshold for a match, and humans aren’t very good at making these kind of determinations because they can’t process certain kinds of data as quickly as a computer can.
Using machine learning, we can enlist the help of a computer to apply the data to a particular learner that will output a classifier for the domain. Frameworks like Weka offer all kinds of learners that we can try use out of the box with our training data to create classifiers. For example Naive Bayes is an example of one such algorithm that tends to do a great job with text classification. If we use our words with weights as attributes and each website in the training set as an example to train on, a Naive Bayes learner will find probabilistic correlation between the occurrence of words and the topic of a webpage and will output a classifier that is likely to give more accurate results than any algorithm that a human could come up with in a reasonable amount of time.
Wiring it Up
So how do we wire these pieces together, and what does it look like to consume the finished product? Let’s suppose that we want to be able to tell whether a website is about soccer. We start by creating a whitelist of websites that we know produce editorial content about soccer. We’ll also want to create a blacklist of sites that produce content about world news, technology, rock and roll, ponies, and anything that isn’t soccer. We throw together a function that crawls the sites and for each example we infer a class based on the source (we may be wrong in some edge cases, but in general we’re going to assume that the sites in our whitelist/blacklist are producing content that is or isn’t soccer across the board). We run the webpages through our cleaning, tokenizing, and ranking functions and we end up with training examples that look like the following contrived ones:
- foo.com – True. Manchester (.85), Rooney (.75), United (.64), match (.5).
- bar.com – False. Muse (.9), Bellamy (.72), guitar (.72), cool (.48), show (.43).
Getting a Weka to speak our language may require massaging the examples into ARFF or some format that the framework understands, but at this point we can directly apply the training set to the learner. For subsequent webpages we run them through the same functions to get ranked keywords, and then we pass the new example into the classifier and we’re given a boolean result. Magic.
Note that we only used the words in the body of a webpage, but in the real world we would have access to more data. We could potentially look at the hints provided in the page title, meta tags, and other tags like heading/bold or bigger font sizes and make some assumptions about the importance of the words (of course we have to do this before stripping tags out). If we could get our hands on link text for sites that link to the article that could also serve as valuable input, although search engines aren’t making it as easy to access link data from their indexes these days. We could use this additional data to either augment our Naive Bayes learner using arbitrary weights, or we can use more complex learners like a Perceptron or a Support Vector Machine to try to let the computer decide how important we should consider these other inputs to be. It’s certainly possible that for some topics other kinds of learners may produce better results. Or we could investigate ways to use learners in combination (via Bagging or Boosting, for example) to get better accuracy than any single learner.
Classifying webpages by topic is a fairly common example of a problem that can be solved by an algorithm created by either a human or a computer. My aim in this post was to provide a quick look at one way to attack the problem and to touch on some very introductory machine learning concepts. If you’re interested in reading more about machine learning specifically there are countless resources online and some great books available on the subject. Hope you found the post helpful. Classify away, and I’ll certainly look forward hearing back from any readers who have tackled similar challenges or want to provide feedback!