Spam Filters: All you need to know (Part -II)

Table of contents

Reading Time: 5 minutes

Earlier, we had discussed the introduction and types of spam filters. To get an overview, follow this link.
Here, we’ll discuss some commonly used spam filtering algorithms.

Working of Spam Filters

Using Conventional Programming Methodologies:-

To begin, think about what spam looks like in general. You’ll note that certain terms or phrases (such “Just forU,” “credit card,” “free,” and “wonderful”) appear frequently in the subject line. You might also notice certain additional patterns in the sender’s name, the email body, and other portions of the email.
You’d create a detection algorithm for each of the patterns you noticed, and your computer would mark emails as spam if several of them were found.
You’d put your programme through its paces and repeat steps 1 and 2 until it was ready to go live.

Because the problem is complex, your software will most likely develop a large list of complex rules that will be tough to maintain.

By Using Machine Learning Techniques:-

Spam filtering techniques use machine learning methods such as instance-based or memory-based learning to recognize and classify incoming spam emails based on their resemblance to spam email training examples. Many academics and researchers have also developed many email spam classification techniques that have been effectively utilized to group data. Probabilistic, decision tree, artificial immune system, support vector machine (SVM), artificial neural networks (ANN), and case-based techniques are examples of these technologies.

Flowchart for various ML algorithms that can be used for Spam Filtering

Some algorithms for Email Spam Classification –

kNN Algorithm for Email Spam Filters

Algorithm

1: Find Email Message class labels.
2: Input k, the number of nearest neighbors
3: Input D, the set of test Email Messages;
4: Input T, the set of training Email Messages.
5: L, the label set of test Email Message.
6: Read DataFile (TrainingData)
7: Read DataFile (TestingData)
8: for each d in D and each t in T do
9: Neighbors(d) = {}
10: if |Neighbors (d) | < k then
11: Neighbors(d) = Closest (d, t)∪ Neighbors(d)
12: end if
13: if |Neighbors(d)| ≥ k then
14: restrain(M, x_j, y_j)
15: end if
16: end for 17: return Final Email Message Classification (Spam/Valid email)
18: end

Naïve Bayes Classification Algorithm for Email Spam Filters

Algorithm
1: Input Email Message dataset 2: Parse each email into its component tokens 3: Compute probability for each token S [W] = C_spam(W)/(C_ham(W) + C_spam(W)) 4: Store spamminess values to a database 5: for each message M do 6: while (M not end) do 7: scan message for the next token T_i 8: query the database for spamminess S(T_i) 9: compute probabilities of message collected S [M] and H [M] 10: compute the total message filtering signal by: I [M] = f (S [M], H [M]) 11: I[M]=(I+S[M]−H[M])/2 12: if I [M] > threshold then 13: msg is labeled as spam 14: else 15: msg is labeled as non-spam 16: end if 17: end while 18: end for 19: return Final Email Message Classification (Spam/Valid email) 20: end

Algorithm

1: Input Email Message dataset
2: Parse each email into its component tokens
3: Compute probability for each token S [W] = C_spam(W)/(C_ham(W) + C_spam(W))
4: Store spamminess values to a database
5: for each message M do
6: while (M not end) do
7: scan message for the next token T_i
8: query the database for spamminess S(T_i)
9: compute probabilities of message collected S [M] and H [M]
10: compute the total message filtering signal by: I [M] = f (S [M], H [M])
11: I[M]=(I+S[M]−H[M])/2
12: if I [M] > threshold then
13: msg is labeled as spam
14: else
15: msg is labeled as non-spam
16: end if
17: end while
18: end for
19: return Final Email Message Classification (Spam/Valid email)
20: end

Perceptron Neural Network algorithm for Email Spam Filters

Algorithm
1: Input Sample email message dataset 2: Initialize w and b (to random values or to 0). 3: Find a training sample of messages (x,c) for which sign (w^T x + b). 4: if there is no such sample, then 5: Training is completed 6: Store the final w and stop. 7: else 8: update (w,b): w = w + cx, 9: b = b + c 10: go to step 8 11: end if 12: Determine email message class assign (w^T x + b) 13: return Final Email Message Classification (Spam/Non-spam email) 14:end

Algorithm

1: Input Sample email message dataset
2: Initialize w and b (to random values or to 0).
3: Find a training sample of messages (x,c) for which sign (w^T x + b).
4: if there is no such sample, then
5: Training is completed
6: Store the final w and stop.
7: else
8: update (w,b): w = w + cx,
9: b = b + c
10: go to step 8
11: end if
12: Determine email message class assign (w^T x + b)
13: return Final Email Message Classification (Spam/Non-spam email)
14:end

Firefly Algorithm for Email Spam Filters

Algorithm
1: Input Email corpus with M number of features 2: Set k = 0 3: Get population of firefly N 4: Get the number of attributes M 5: Initialize the firefly population 6: for each firefly 7: Choose the firefly which has the best fitness 8: Choose corresponding features from the testing part of the email spam corpus 9: Test the email message 10: k = k+1 11: Update each firefly 12: Classify the email message as either spam or Non-spam email 13: end for 14: return Final Email Message Classification (Spam/Non-spam email) 15:end

Algorithm

1: Input Email corpus with M number of features
2: Set k = 0
3: Get population of firefly N
4: Get the number of attributes M
5: Initialize the firefly population
6: for each firefly
7: Choose the firefly which has the best fitness
8: Choose corresponding features from the testing part of the email spam corpus
9: Test the email message
10: k = k+1
11: Update each firefly
12: Classify the email message as either spam or Non-spam email
13: end for
14: return Final Email Message Classification (Spam/Non-spam email)
15:end

Email spam Filter algorithm using Rough Set

Algorithm
1: Input Email Testing Dataset (Dis_ testing dataset), Rule (RUL), b 2: for x ∈ Dis_T E do 3: while RUL (x) = 0 do 4: suspicious = suspicious ∪ {x}; 5: end while 6: Let all r ∈ RUL (x) cast a number in favor of the non-spam class. 7: Predict membership degree based on the decision rules; 8: R = r ∈ RUL (x)\|r predicts non-spam; 9: Estimate Rel (Dis_T E \| x ∈ non-spam); 10: Rel (Dis_T E \| x ∈ non-spam) = ∑r ∈ R Predicts (non-spam); 11: Certainty_x = 1/cer × Rel (Dis_T E \| x ∈ non-spam); 12: while Certaintyx≥1 – b do 13: suspicious = suspicious ∪ {x}; 14: end 15: spam = spam ∪ {x}; 16: return Final Email Message Classification (Spam/Non-spam/Suspicious email) 17:end

Algorithm

1: Input Email Testing Dataset (Dis_ testing dataset), Rule (RUL), b
2: for x ∈ Dis_T E do
3: while RUL (x) = 0 do
4: suspicious = suspicious ∪ {x};
5: end while
6: Let all r ∈ RUL (x) cast a number in favor of the non-spam class.
7: Predict membership degree based on the decision rules;
8: R = r ∈ RUL (x)|r predicts non-spam;
9: Estimate Rel (Dis_T E | x ∈ non-spam);
10: Rel (Dis_T E | x ∈ non-spam) = ∑r ∈ R Predicts (non-spam);
11: Certainty_x = 1/cer × Rel (Dis_T E | x ∈ non-spam);
12: while Certaintyx≥1 – b do
13: suspicious = suspicious ∪ {x};
14: end
15: spam = spam ∪ {x};
16: return Final Email Message Classification (Spam/Non-spam/Suspicious email)
17:end

Support Vector Machine (SVM) algorithm for Email Spam Filters

Algorithm
1: Input Sample Email Message x to classify 2: A training set S, a kernel function, {c₁, c₂, …c_num} and {γ1, γ2, …γnum}. 3: Number of nearest neighbors k. 4: for i = 1 to num 5: set C=Ci; 6: for j = 1 to q 7: set γ=γ; 8: produce a trained SVM classifier f (x) through the current merger parameter (C, γ); 9: if (f (x) is the first produced discriminant function) then 10: keep f (x) as the most ideal SVM classifier f∗(x); 11: else 12: compare classifier f (x) and the current best SVM classifier f∗(x) using k-fold cross-validation 13: keep classifier with better accuracy. 14: end if 15: end for 16: end for 17: return Final Email Message Classification (Spam/Non-spam email) 18: end

Algorithm

1: Input Sample Email Message x to classify
2: A training set S, a kernel function, {c₁, c₂, …c_num} and {γ1, γ2, …γnum}.
3: Number of nearest neighbors k.
4: for i = 1 to num
5: set C=Ci;
6: for j = 1 to q
7: set γ=γ;
8: produce a trained SVM classifier f (x) through the current merger parameter (C, γ);
9: if (f (x) is the first produced discriminant function) then
10: keep f (x) as the most ideal SVM classifier f∗(x);
11: else
12: compare classifier f (x) and the current best SVM classifier f∗(x) using k-fold cross-validation
13: keep classifier with better accuracy.
14: end if
15: end for
16: end for
17: return Final Email Message Classification (Spam/Non-spam email)
18: end

Decision Tree algorithm for Email Spam Filters

Algorithm
1: Input Email Message dataset 2: Compute entropy for dataset 3: while condition do 4: for every attribute/feature 5: calculate entropy for all categorical values 6: take average information entropy for the current attribute. 7: calculate the gain for the current attribute 8: pick the highest gain attribute 9: end for 10: end while 11: return Final Email Message Classification (Spam/Non-spam email) 12:end

Algorithm

1: Input Email Message dataset
2: Compute entropy for dataset
3: while condition do
4: for every attribute/feature
5: calculate entropy for all categorical values
6: take average information entropy for the current attribute.
7: calculate the gain for the current attribute
8: pick the highest gain attribute
9: end for
10: end while
11: return Final Email Message Classification (Spam/Non-spam email)
12:end

AdaBoost Algorithm for Email Spam Filters

Algorithm
1: Input set of email messages corpus M 2: while condition do 3: use the labeled message corpus M (labeled) to trains the classifier. 4: use the classifier to test the M (unlabeled) messages and produce scores using a scoring function. 5: relate each message with the matching score computed above. 6: label the messages with the least scores. 7: add the recently labeled messages into M (labeled) corpus. 8: eliminate the recently labeled message from the M (unlabeled) corpus. 9: end while 10: train the message corpus 11: given (x₁, y₁)… (x_n, y_n) ε S_t where y₁ = 0.1 12: weights w₁…w_f = 1/f, where f = number of features in an email message 13: for t = 1 to T do 14: ∑_i w_i=1 15: error e_j= ∑_iw_i _{\|hj(xi)−(yi)\|} 16: Select classifier h_j with the least error 17: Updateweights w_t+1,i=w_t,i β_t^1−ei where e_i ={0 = if classified correctly ;1 =otherwise } 18: β_t=e_t/1−e_t 19: α_t=log(1/β_t) 20: end for 21: return Final Email Message Classification (Spam/Non-spam email) 22:end

Algorithm

1: Input set of email messages corpus M
2: while condition do
3: use the labeled message corpus M (labeled) to trains the classifier.
4: use the classifier to test the M (unlabeled) messages and produce scores using a scoring function.
5: relate each message with the matching score computed above.
6: label the messages with the least scores.
7: add the recently labeled messages into M (labeled) corpus.
8: eliminate the recently labeled message from the M (unlabeled) corpus.
9: end while
10: train the message corpus
11: given (x₁, y₁)… (x_n, y_n) ε S_t where y₁ = 0.1
12: weights w₁…w_f = 1/f, where f = number of features in an email message
13: for t = 1 to T do
14: ∑_i w_i=1
15: error e_j= ∑_iw_i _{|hj(xi)−(yi)|}
16: Select classifier h_j with the least error
17: Updateweights w_t+1,i=w_t,i β_t^1−ei where e_i ={0 = if classified correctly ;1 =otherwise }
18: β_t=e_t/1−e_t
19: α_t=log(1/β_t)
20: end for
21: return Final Email Message Classification (Spam/Non-spam email)
22:end

Random Forests Algorithm for Email Spam Filters

Algorithm
1: Input X: number of nodes 2: Input N: number of features in the Email Message 3: Input Y: number of trees to be grown 4: while termination conditions are not true do 5: Select a self-staring Email Message S indiscriminately from the training corpus Y 6: Create tree Ri,j from the selected self-starting Email Message S 7: Choose n features arbitrarily from N; where n≪N 8: Compute the optimal dividing point for node d among the n features 9: Divide the parent node into two offspring nodes through the optimal divide 10: Execute steps 1–3 till the maximum number of nodes (x) is created 11: Create your forest by iterating steps 1–4 for Y number of times 12: end while 13: generate a result of every created tree {Rt}1Y 14: use a new Email Message for every created tree beginning at the root node 15: designate the Email Message to the group compatible with the leaf node. 16: merge the votes or results of every tree 17: return Final Email Message Classification (Spam/Non-spam email) group having the highest vote (G). 18: end

Algorithm

1: Input X: number of nodes
2: Input N: number of features in the Email Message
3: Input Y: number of trees to be grown
4: while termination conditions are not true do
5: Select a self-staring Email Message S indiscriminately from the training corpus Y
6: Create tree Ri,j from the selected self-starting Email Message S
7: Choose n features arbitrarily from N; where n≪N
8: Compute the optimal dividing point for node d among the n features
9: Divide the parent node into two offspring nodes through the optimal divide
10: Execute steps 1–3 till the maximum number of nodes (x) is created
11: Create your forest by iterating steps 1–4 for Y number of times
12: end while
13: generate a result of every created tree {Rt}1Y
14: use a new Email Message for every created tree beginning at the root node
15: designate the Email Message to the group compatible with the leaf node.
16: merge the votes or results of every tree
17: return Final Email Message Classification (Spam/Non-spam email) group having the highest vote (G).
18: end

Convolutional Neural Networks for Email Spam Filters

Algorithm
1: Input Pretreatment of Email Message 2: Input parameters N 3: file = getfile () //Find the Message Corpus 4: label = getlabel (file) //Find the labelled Messages 5: test = gettest (file) //Find the Email Message 6: vec = getword2vec () //Load the word vector 7: random = random (label) //Randomized 8: while condition do 9: Nf = CV(len (xshuffle),nf) //Cross-validation 10: for trindex, teindex in kf do 11: xtotal, ytotal = xshuffle [trindex],yshuffle [trindex] 12: xtrain, xdev,ytrain, ydev = split (xtotal, ytotal) 13: //Divide the data set 14: for i < N do 15: conv = getconv () //Convolution layer 16: h = sigmoid (conv) 17: N = getk () //Get the value of N 18: tensorr = gettensor () 19: for x,y in xtrain, ytrain do 20: value, indice = topk (tensorr) 21: //Get the Email Message feature and location information 22: tensors = get (value, indice) 23: //Get the corresponding tensor 24: tensora = append (tensors) 25: end for 26: end for 27: con = con (tensorp) 28: conn = sigmoid (con) //Sigmoid 29: getsoftmax (conn) //softmax 30: end for 31: if getdev () then 32: tr = false 33: end if 34: end while 35: return Final Email Message Classification (Spam/Non-spam email) 36: end

Algorithm

1: Input Pretreatment of Email Message
2: Input parameters N
3: file = getfile () //Find the Message Corpus
4: label = getlabel (file) //Find the labelled Messages
5: test = gettest (file) //Find the Email Message
6: vec = getword2vec () //Load the word vector
7: random = random (label) //Randomized
8: while condition do
9: Nf = CV(len (xshuffle),nf) //Cross-validation
10: for trindex, teindex in kf do
11: xtotal, ytotal = xshuffle [trindex],yshuffle [trindex]
12: xtrain, xdev,ytrain, ydev = split (xtotal, ytotal)
13: //Divide the data set
14: for i < N do
15: conv = getconv () //Convolution layer
16: h = sigmoid (conv)
17: N = getk () //Get the value of N
18: tensorr = gettensor ()
19: for x,y in xtrain, ytrain do
20: value, indice = topk (tensorr)
21: //Get the Email Message feature and location information
22: tensors = get (value, indice)
23: //Get the corresponding tensor
24: tensora = append (tensors)
25: end for
26: end for
27: con = con (tensorp)
28: conn = sigmoid (con) //Sigmoid
29: getsoftmax (conn) //softmax
30: end for
31: if getdev () then
32: tr = false
33: end if
34: end while
35: return Final Email Message Classification (Spam/Non-spam email)
36: end