Naive Bays: Spam or ham?

Hi,

I haven’t posted here before. I have a B.S. in computer science engineering with an interest in natural language processing and machine learning. I am looking to post here to ask questions and document my development those fields.

In my current project I am trying to implement my first spam filter using a naive bays classifier. I am using the data provided by UCI’s machine learning data repository (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). The data is a table of features corresponding to a few thousand spam and non-spam(ham) messages. Therefore, my features are limited to those provided by the table.

My goal is to implement a classifier that can calculate \(P (S \mid M)\), the probability of being spam given a message. So far I have been using the following equation to calculate \(P(S \mid F)\), the probability of being spam given a feature.

$$ P( S \mid F) = \frac{ P( F \mid S) } { P( F \mid S)+P(F \mid H) }$$

where \(P( F \mid S)\) is the probability of feature given spam and \(P ( F \mid H )\) is the probability of feature given ham. I am having trouble bridging the gap from knowing a \(P( S \mid F)\) to \(P(S \mid M)\) where a message is simply a bag of independent features.

At a glance I want to just multiply the features together. But that would make most numbers very small, I am not sure if that is normal.

In short these are the questions I have right now.
1.) How to take a set of \(P(S \mid F)\) to a \(P (S \mid M)\).
2.) Once \(P( S \mid M )\) has been calculated, how do I define a a threshold for my classifier?
3.) Fortunately my feature set was selected for me, how would I go about selecting or finding my own feature set?

I would appreciate any input and will continue to post on the development of this project.

"Who am I?"

Who am I?

I’m about to finish my final year of graduate study at Cornell, having decided to quit the PhD program and go looking for life outside the academy. Research has been the center of my identity and activity these many years, and the decision to let it go has left me asking: who am I?

I decided to use topic modeling on my email history to see what insights I could glean into who I’ve been, if not who I am. The results are visualized here.

Continue reading

Photo credit: Joe Robertson

Probabilistic programming with undirected graphical models

“Probabilistic programming,” blogged Rob Zinkov, “has the potential to give machine learning to the masses.” I am living proof that this is true. Thanks to a probabilistic programming language, in spite of my lack of training in probability theory, machine learning, or even college-level math, I have successfully used machine learning techniques to model linguistic data and make predictions.

But the tools that I’ve found most helpful in coming to grips with machine learning are not discussed much in Zinkov’s excellent post. Those tools are undirected graphical models and the Markov Logic templating language.

Continue reading

Project Euler for language?

I recently gave a workshop on basic programming for linguistics students, in which participants with little or no programming experience got a taste of what can be done with a bit of Python. Of course NLTK featured prominently, and in response to a question about how to continue practicing I gestured at the NLTK book, which is a great tutorial and source of exercises for both Python and NLTK. I also mentioned Project Euler, which is a collection of number-theory-related programming problems.

Project Euler is (rather famously) a great place to find toy problems to practice with a new programming language. The first couple dozen are reasonably simple (problem 1: sum all the natural numbers below 1000 that are multiples of 3 or 5) and for each problem there is a forum in which others have posted their solutions. Solving them in a new language is good practice and good fun.

Continue reading

Northumbrian Gloss of the Bible

Interlinear glossing with JavaScript and CSS

The interlinear gloss is a standard and very handy format for inline presentation of snippets of natural language data. In a gloss of this type, a chunk of object-language text is presented alongside arbitrarily many layers of annotation, with the contents of the annotations aligned to the relevant subparts of the object-language chunk.

Their standardness and ubiquity in print publications notwithstanding, interlinear glosses are not exactly straightforward to make for the web.

Continue reading

Everton: a meal in Chichimila, 1976

Kaufman & Justeson 2003, part 1

Terrence Kaufman and John Justeson’s 2003 Preliminary Mayan Etymological Dictionary (henceforth K&J03) is a remarkable data set. It collects and systematizes thousands upon thousands of cognate words from across the Mayan language family, gathered over the course of Kaufman’s long and eminent career as a field linguist.

It’s a work that cries out to be explored computationally. But how can we make it explorable?

Continue reading