Earlier, we had discussed the introduction and types of spam filters. To get an overview, follow this link.
Here, we’ll discuss some commonly used spam filtering algorithms.
Working of Spam Filters
Using Conventional Programming Methodologies:-
- To begin, think about what spam looks like in general. You’ll note that certain terms or phrases (such “Just forU,” “credit card,” “free,” and “wonderful”) appear frequently in the subject line. You might also notice certain additional patterns in the sender’s name, the email body, and other portions of the email.
- You’d create a detection algorithm for each of the patterns you noticed, and your computer would mark emails as spam if several of them were found.
- You’d put your programme through its paces and repeat steps 1 and 2 until it was ready to go live.
Because the problem is complex, your software will most likely develop a large list of complex rules that will be tough to maintain.
By Using Machine Learning Techniques:-
Spam filtering techniques use machine learning methods such as instance-based or memory-based learning to recognize and classify incoming spam emails based on their resemblance to spam email training examples. Many academics and researchers have also developed many email spam classification techniques that have been effectively utilized to group data. Probabilistic, decision tree, artificial immune system, support vector machine (SVM), artificial neural networks (ANN), and case-based techniques are examples of these technologies.
Some algorithms for Email Spam Classification –
kNN Algorithm for Email Spam Filters
Algorithm |
1: Find Email Message class labels. 2: Input k, the number of nearest neighbors 3: Input D, the set of test Email Messages; 4: Input T, the set of training Email Messages. 5: L, the label set of test Email Message. 6: Read DataFile (TrainingData) 7: Read DataFile (TestingData) 8: for each d in D and each t in T do 9: Neighbors(d) = {} 10: if |Neighbors (d) | < k then 11: Neighbors(d) = Closest (d, t)∪ Neighbors(d) 12: end if 13: if |Neighbors(d)| ≥ k then 14: restrain(M, xj, yj) 15: end if 16: end for 17: return Final Email Message Classification (Spam/Valid email) 18: end |
Naïve Bayes Classification Algorithm for Email Spam Filters
Algorithm |
---|
1: Input Email Message dataset 2: Parse each email into its component tokens 3: Compute probability for each token S [W] = Cspam(W)/(Cham(W) + Cspam(W)) 4: Store spamminess values to a database 5: for each message M do 6: while (M not end) do 7: scan message for the next token Ti 8: query the database for spamminess S(Ti) 9: compute probabilities of message collected S [M] and H [M] 10: compute the total message filtering signal by: I [M] = f (S [M], H [M]) 11: I[M]=(I+S[M]−H[M])/2 12: if I [M] > threshold then 13: msg is labeled as spam 14: else 15: msg is labeled as non-spam 16: end if 17: end while 18: end for 19: return Final Email Message Classification (Spam/Valid email) 20: end |
Perceptron Neural Network algorithm for Email Spam Filters
Algorithm |
---|
1: Input Sample email message dataset 2: Initialize w and b (to random values or to 0). 3: Find a training sample of messages (x,c) for which sign (wT x + b). 4: if there is no such sample, then 5: Training is completed 6: Store the final w and stop. 7: else 8: update (w,b): w = w + cx, 9: b = b + c 10: go to step 8 11: end if 12: Determine email message class assign (wT x + b) 13: return Final Email Message Classification (Spam/Non-spam email) 14:end |
Firefly Algorithm for Email Spam Filters
Algorithm |
---|
1: Input Email corpus with M number of features 2: Set k = 0 3: Get population of firefly N 4: Get the number of attributes M 5: Initialize the firefly population 6: for each firefly 7: Choose the firefly which has the best fitness 8: Choose corresponding features from the testing part of the email spam corpus 9: Test the email message 10: k = k+1 11: Update each firefly 12: Classify the email message as either spam or Non-spam email 13: end for 14: return Final Email Message Classification (Spam/Non-spam email) 15:end |
Email spam Filter algorithm using Rough Set
Algorithm |
---|
1: Input Email Testing Dataset (Dis_ testing dataset), Rule (RUL), b 2: for x ∈ Dis_T E do 3: while RUL (x) = 0 do 4: suspicious = suspicious ∪ {x}; 5: end while 6: Let all r ∈ RUL (x) cast a number in favor of the non-spam class. 7: Predict membership degree based on the decision rules; 8: R = r ∈ RUL (x)|r predicts non-spam; 9: Estimate Rel (Dis_T E | x ∈ non-spam); 10: Rel (Dis_T E | x ∈ non-spam) = ∑r ∈ R Predicts (non-spam); 11: Certaintyx = 1/cer × Rel (Dis_T E | x ∈ non-spam); 12: while Certaintyx≥1 – b do 13: suspicious = suspicious ∪ {x}; 14: end 15: spam = spam ∪ {x}; 16: return Final Email Message Classification (Spam/Non-spam/Suspicious email) 17:end |
Support Vector Machine (SVM) algorithm for Email Spam Filters
Algorithm |
---|
1: Input Sample Email Message x to classify 2: A training set S, a kernel function, {c1, c2, …cnum} and {γ1, γ2, …γnum}. 3: Number of nearest neighbors k. 4: for i = 1 to num 5: set C=Ci; 6: for j = 1 to q 7: set γ=γ; 8: produce a trained SVM classifier f (x) through the current merger parameter (C, γ); 9: if (f (x) is the first produced discriminant function) then 10: keep f (x) as the most ideal SVM classifier f∗(x); 11: else 12: compare classifier f (x) and the current best SVM classifier f∗(x) using k-fold cross-validation 13: keep classifier with better accuracy. 14: end if 15: end for 16: end for 17: return Final Email Message Classification (Spam/Non-spam email) 18: end |
Decision Tree algorithm for Email Spam Filters
Algorithm |
---|
1: Input Email Message dataset 2: Compute entropy for dataset 3: while condition do 4: for every attribute/feature 5: calculate entropy for all categorical values 6: take average information entropy for the current attribute. 7: calculate the gain for the current attribute 8: pick the highest gain attribute 9: end for 10: end while 11: return Final Email Message Classification (Spam/Non-spam email) 12:end |
AdaBoost Algorithm for Email Spam Filters
Algorithm |
---|
1: Input set of email messages corpus M 2: while condition do 3: use the labeled message corpus M (labeled) to trains the classifier. 4: use the classifier to test the M (unlabeled) messages and produce scores using a scoring function. 5: relate each message with the matching score computed above. 6: label the messages with the least scores. 7: add the recently labeled messages into M (labeled) corpus. 8: eliminate the recently labeled message from the M (unlabeled) corpus. 9: end while 10: train the message corpus 11: given (x1, y1)… (xn, yn) ε St where y1 = 0.1 12: weights w1…wf = 1/f, where f = number of features in an email message 13: for t = 1 to T do 14: ∑i wi=1 15: error ej= ∑iwi |hj(xi)−(yi)| 16: Select classifier hj with the least error 17: Updateweights wt+1,i=wt,i βt1−ei where ei ={0 = if classified correctly ;1 =otherwise } 18: βt=et/1−et 19: αt=log(1/βt) 20: end for 21: return Final Email Message Classification (Spam/Non-spam email) 22:end |
Random Forests Algorithm for Email Spam Filters
Algorithm |
---|
1: Input X: number of nodes 2: Input N: number of features in the Email Message 3: Input Y: number of trees to be grown 4: while termination conditions are not true do 5: Select a self-staring Email Message S indiscriminately from the training corpus Y 6: Create tree Ri,j from the selected self-starting Email Message S 7: Choose n features arbitrarily from N; where n≪N 8: Compute the optimal dividing point for node d among the n features 9: Divide the parent node into two offspring nodes through the optimal divide 10: Execute steps 1–3 till the maximum number of nodes (x) is created 11: Create your forest by iterating steps 1–4 for Y number of times 12: end while 13: generate a result of every created tree {Rt}1Y 14: use a new Email Message for every created tree beginning at the root node 15: designate the Email Message to the group compatible with the leaf node. 16: merge the votes or results of every tree 17: return Final Email Message Classification (Spam/Non-spam email) group having the highest vote (G). 18: end |
Convolutional Neural Networks for Email Spam Filters
Algorithm |
---|
1: Input Pretreatment of Email Message 2: Input parameters N 3: file = getfile () //Find the Message Corpus 4: label = getlabel (file) //Find the labelled Messages 5: test = gettest (file) //Find the Email Message 6: vec = getword2vec () //Load the word vector 7: random = random (label) //Randomized 8: while condition do 9: Nf = CV(len (xshuffle),nf) //Cross-validation 10: for trindex, teindex in kf do 11: xtotal, ytotal = xshuffle [trindex],yshuffle [trindex] 12: xtrain, xdev,ytrain, ydev = split (xtotal, ytotal) 13: //Divide the data set 14: for i < N do 15: conv = getconv () //Convolution layer 16: h = sigmoid (conv) 17: N = getk () //Get the value of N 18: tensorr = gettensor () 19: for x,y in xtrain, ytrain do 20: value, indice = topk (tensorr) 21: //Get the Email Message feature and location information 22: tensors = get (value, indice) 23: //Get the corresponding tensor 24: tensora = append (tensors) 25: end for 26: end for 27: con = con (tensorp) 28: conn = sigmoid (con) //Sigmoid 29: getsoftmax (conn) //softmax 30: end for 31: if getdev () then 32: tr = false 33: end if 34: end while 35: return Final Email Message Classification (Spam/Non-spam email) 36: end |