Detect E-mail Spam Using Python

Noman BD

4.33/5 (4 votes)

Mar 11, 2018

CPOL

6 min read

48164

1310

Power of Naive Bayes Classification

Training and Test data - 1.1 MB

Introduction

Anyone having an e-mail address must have faced unwanted e-mails which we call spam mail. Modern spam filtering software are continuously struggling to detect unwanted e-mails and mark them as spam mail. It is an ongoing battle between spam filtering software and anonymous spam mail senders to defeat each other. Because of that, it is very important to improve spam filters algorithm time to time. Behind the scenes, we use Machine-learning algorithm to find unwanted e-mails. More specifically, we use text classifier algorithm like Naïve Bayes, Support Vector Machine or Neural Network to do the job. In this article, I will try to show you how to use Naïve Bayes algorithm to identify spam e-mail. I will also try to compare the results based on statistics. We will use Python to do the job. I will try to show you power of python in Machine Learning world.

Using the Code

Download training and test data from here. First, split the file into two files, one for training data and another for test data. Name training data file as training.csv and test.csv respectivly. For your convenience, I have uploaded both the files in Training and Test.zip file.

import pandas as pd 
# read training data & test data 
df_train = pd.read_csv("training.csv")
df_test = pd.read_csv("test.csv")

Let's see the first five rows of our training data.

df_test.sample(5)

Output

	type	email
1779	Ham	<p>Into thereis tapping said that scarce whose...
1646	Ham	<p>Then many take the ghastly and rapping gaun...
534	Spam	<p>Did parting are dear where fountain save ca...
288	Spam	<p>His heart sea he care he sad day there anot...
1768	Ham	<p>With ease explore. See whose swung door and...

In the same way, let's see the first five rows of our test data.

df_train.sample(5)

Output

	type	email
1779	Ham	<p>Into thereis tapping said that scarce whose...
1646	Ham	<p>Then many take the ghastly and rapping gaun...
534	Spam	<p>Did parting are dear where fountain save ca...
288	Spam	<p>His heart sea he care he sad day there anot...
1768	Ham	<p>With ease explore. See whose swung door and...

See what is given in the email body.

df_test.sample(5)

Output

	type	email
58	Ham	<p>Sitting ghastly me peering more into in the...
80	Spam	<p>A favour what whilome within childe of chil...
56	Spam	<p>From who agen to sacred breast unto will co...
20	Ham	<p>Of to gently flown shrieked ashore such sad...
94	Spam	<p>A charms his of childe him. Lowly one was b...

We have two columned CSV files here. Type column contains whether the email is marked as Spam or Ham. and email columns contains body (main text) of the email. Note that our test data also has the type data. It is given in advance so that we can cross check accuracy level of our algorithm. Now we will see some descriptive statistics of our training data.

df_train.describe(include = 'all')

Output

	type	email
count	2000	2000
unique	2	2000
top	Spam	<p>Along childe love and the but womans a the ...
freq	1000	1

Here, we can see that there are 2000 records. We have two unique Type and 2000 unique emails. Let's see more about Type column.

df_train.groupby('type').describe()

	email
	count	unique	top	freq
type
Ham	1000	1000	<p>Broken if still art within lordly or the it...	1
Spam	1000	1000	<p>Along childe love and the but womans a the ...	1

In our test data, we have equal number (1000 each) of Spam and Ham. There is no duplicate data in email column. Now we will senetize our data.

import email_pre as ep
from gensim.models.phrases import Phrases 

def do_process(row): 
global bigram 
temp = ep.preprocess_text(row.email,[ep.lowercase, 
ep.remove_html, 
ep.remove_esc_chars, 
ep.remove_urls, 
ep.remove_numbers, 
ep.remove_punct, 
ep.lemmatize, 
ep.keyword_tokenize]) 

if not isinstance(temp,str): 
print temp 

return ' '.join(bigram[temp.split(" ")]) 

def phrases_train(sen_list,min_ =3): 
if len(sen_list) <= 10: 
print("too small to train! ") 
return 

if isinstance(sen_list,list): 
try: 
bigram = Phrases.load("email_EN_bigrams_spam") 
bigram.add_vocab(sen_list) 
bigram.save("email_EN_bigrams_spam") 
print "retrain!" 

except Exception as ex: 
print "first " 
bigram = Phrases(sen_list, min_count=min_, threshold=2) 
bigram.save("email_EN_bigrams_spam") 
print ex

Phrase Model train (run once & save mode also re-trainable).

train_email_list = [ep.preprocess_text(mail,[ep.lowercase, 
ep.remove_html, 
ep.remove_esc_chars, 
ep.remove_urls, 
ep.remove_numbers, 
ep.remove_punct, 
ep.lemmatize, 
ep.keyword_tokenize]).split(" ") for mail in df_train.email.values]

print "after pre_process :"
print " " 
print len(train_email_list) 
print df_train.ix[22].email,">>"*80,train_email_list[22]

Output

After pre_process:

2000
<p>Him ah he more things long from mine for. Unto feel they seek other adieu crime dote. Adversity pangs low. Soon light now time amiss to gild be at but knew of yet bidding he thence made. Will care true and to lyres and and in one this charms hall ancient departed from. Bacchanals to none lay charms in the his most his perchance the in and the uses woe deadly. Save nor to for that that unto he. Thy in thy. Might parasites harold of unto sing at that in for soils within rake knew but. If he shamed breast heralds grace once dares and carnal finds muse none peace like way loved. If long favour or flaunting did me with later will. Not calm labyrinth tear basked little. It talethis calm woe sight time. Rake and to hall. Land the a him uncouth for monks partings fall there below true sighed strength. Nor nor had spoiled condemned glee dome monks him few of sore from aisle shun virtues. Bidding loathed aisle a and if that to it chill shades isle the control at. So knew with one will wight nor feud time sought flatterers earth. Relief a would break at he if break not scape.</p><p>The will heartless sacred visit few. The was from near long grief. His caught from flaunting sacred care fame said are such and in but a.</p> ['ah', 'things', 'long', 'mine', 'unto', 'feel', 'seek', 'adieu', 'crime', 'dote', 'adversity', 'pangs', 'low', 'soon', 'light', 'time', 'amiss', 'gild', 'know', 'yet', 'bid', 'thence', 'make', 'care', 'true', 'lyres', 'one', 'charm', 'hall', 'ancient', 'depart', 'bacchanals', 'none', 'lay', 'charm', 'perchance', 'use', 'woe', 'deadly', 'save', 'unto', 'thy', 'thy', 'might', 'parasites', 'harold', 'unto', 'sing', 'soil', 'within', 'rake', 'know', 'sham', 'breast', 'herald', 'grace', 'dare', 'carnal', 'find', 'muse', 'none', 'peace', 'like', 'way', 'love', 'long', 'favour', 'flaunt', 'later', 'calm', 'labyrinth', 'tear', 'bask', 'little', 'talethis', 'calm', 'woe', 'sight', 'time', 'rake', 'hall', 'land', 'uncouth', 'monks', 'part', 'fall', 'true', 'sigh', 'strength', 'spoil', 'condemn', 'glee', 'dome', 'monks', 'sore', 'aisle', 'shun', 'virtues', 'bid', 'loathe', 'aisle', 'chill', 'shade', 'isle', 'control', 'know', 'one', 'wight', 'feud', 'time', 'seek', 'flatterers', 'earth', 'relief', 'would', 'break', 'break', 'scapethe', 'heartless', 'sacred', 'visit', 'near', 'long', 'grief', 'catch', 'flaunt', 'sacred', 'care', 'fame', 'say']

df_train["class"] = df_train.type.replace(["Spam","Ham"],[0,1]) 
df_test["class"] = df_test.type.replace(["Spam","Ham"],[0,1])

Bigram Training

phrases_train(train_email_list,min_=3) 
bigram = Phrases.load("email_EN_bigrams_spam") 
len(bigram.vocab)

retrain!

Output

159158

print len(dict((key,value) for key, value in bigram.vocab.iteritems() if value >= 15))

Output

4974

df_train["clean_email"] = df_train.apply(do_process,axis=1) 
df_test["clean_email"] = df_test.apply(do_process,axis=1) 
# df_train.head() 
print "phrase found train:",df_train[df_train['clean_email'].str.contains("_")].shape 
print "phrase found test:",df_test[df_test['clean_email'].str.contains("_")].shape

Output

phrase found train: (371, 3)
phrase found test: (7, 3)

Training for Spam detection

df_train.head()

Output

	type	email	clean_email	class
0	Spam	<p>But could then once pomp to nor that glee g...	could pomp glee glorious deign vex time childe...	0
1	Spam	<p>His honeyed and land vile are so and native...	honey land vile native ah ah like flash gild b...	0
2	Spam	<p>Tear womans his was by had tis her eremites...	tear womans tis eremites present dear know pro...	0
3	Spam	<p>The that and land. Cell shun blazon passion...	land cell shun blazon passion uncouth paphian ...	0
4	Spam	<p>Sing aught through partings things was sacr...	sing aught part things sacred know passion pro...	0

from sklearn.pipeline 
import Pipeline from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.naive_bayes import MultinomialNB 
text_clf = Pipeline([('vect', CountVectorizer()), 
           ('tfidf', TfidfTransformer()), ('clf', MultinomialNB()), ])


text_clf.fit(df_train.clean_email, df_train["class"]) 
predicted = text_clf.predict(df_test.clean_email)

from sklearn import metrics 
array = metrics.confusion_matrix(df_test["class"], predicted) 
import seaborn as sn 
import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline 

df_cm = pd.DataFrame(array, ["Spam","Ham"], 
["Spam","Ham"]) 

sn.set(font_scale=1.4)#for label size 
sn.heatmap(df_cm, annot=True,annot_kws={"size": 16})# font size

Output

print metrics.classification_report(df_test["class"], predicted, 
target_names=["Spam","Ham"])

Output

             precision    recall  f1-score   support

       Spam       1.00      1.00      1.00        43
        Ham       1.00      1.00      1.00        57

avg / total       1.00      1.00      1.00       100

To test the model, we have put the test data into our model and compare our result with the result, which is already given in the test data. It shows out of 43 spam mail, the model successfully identifies all the 43 spam mails. And in the same way, out of 57 ham mail, the model successfully identifies all the 57 Ham mails.

Conclusion

It is very unusual to having 100% success from a model. Obviously, it is due to small training and test dataset. I have tested my own emails using the model. It turned out that it is not as effective as my existing paid spam filter. It makes sense. There are many ways we can improve the model. If the model trains with sufficient data, it will deliver more accurate results.