Hi,

I haven’t posted here before. I have a B.S. in computer science engineering with an interest in natural language processing and machine learning. I am looking to post here to ask questions and document my development those fields.

In my current project I am trying to implement my first spam filter using a naive bays classifier. I am using the data provided by UCI’s machine learning data repository (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). The data is a table of features corresponding to a few thousand spam and non-spam(ham) messages. Therefore, my features are limited to those provided by the table.

My goal is to implement a classifier that can calculate \(P (S \mid M)\), the probability of being spam given a message. So far I have been using the following equation to calculate \(P(S \mid F)\), the probability of being spam given a feature.

$$ P( S \mid F) = \frac{ P( F \mid S) } { P( F \mid S)+P(F \mid H) }$$

where \(P( F \mid S)\) is the probability of feature given spam and \(P ( F \mid H )\) is the probability of feature given ham. I am having trouble bridging the gap from knowing a \(P( S \mid F)\) to \(P(S \mid M)\) where a message is simply a bag of independent features.

At a glance I want to just multiply the features together. But that would make most numbers very small, I am not sure if that is normal.

In short these are the questions I have right now.

1.) How to take a set of \(P(S \mid F)\) to a \(P (S \mid M)\).

2.) Once \(P( S \mid M )\) has been calculated, how do I define a a threshold for my classifier?

3.) Fortunately my feature set was selected for me, how would I go about selecting or finding my own feature set?

I would appreciate any input and will continue to post on the development of this project.