Click here to Skip to main content
15,881,413 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more: , +
I HAVE USED A DIABETES DATASET FOR TRAINING MY NEURAL NETWORK, AND MY MODEL IS OVERFITTING. HOW CAN I PREVENT THIS, I HAVE TRIED FEW METHODS TOO BUT IT DIDNT WORKED WELL. HOPE I CAN FIND ANY SOLUTION TO THIS PROBLEM ASAP. TRY RUNNING THIS CODE I HAVE CREATED THIS IN GOOGLE COLAB, THE FIRST FEW LINES IN THE CODE WHICH I HAVE COMMENTED ARE USED IN GOOGLE COLLAB TO IMPORT FILE FROM SYSTEM IF YOU ARE RUNNING CODE IN ANY OTHER IDE THEN DONT USE THAT LINE OF CODE. I HAVE ALSO PROVIDED THE LINK FOR THE DATASET, WHICH IS A GOOGLE DRIVE LINK YOU CAN DOWNLOAD DATASET FROM THE LINK.

#IMPORTING FILE FROM SYSTEM

""" import pandas as pd from google.colab import files uploaded = files.upload()

for fn in uploaded.keys():

print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))

"""

import numpy as np

import pandas as pd

data = pd.read_csv('DIABETES DATA.csv')

data.isnull().any()

data.shape

#PROCESSING DATA

X = data.iloc[:,1:9].values

y = data.iloc[:,9].values

#SCALING DATA

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X = sc.fit_transform(X)

#SPLITTING DATA IN TEST AND TRAIN

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state=1)

#CREATING MODEL

import keras

from keras.models import Sequential

from keras.layers import Dense# Neural network

from tensorflow.keras import regularizers

import matplotlib.pyplot as plt

model = Sequential()

model.add(Dense(20, input_dim=8, activation='relu', activity_regularizer=regularizers.l2(1e-4)))

model.add(Dense(15, activation='relu', activity_regularizer=regularizers.l2(1e-4)))

model.add(Dense(10, activation='relu', activity_regularizer=regularizers.l2(1e-4)))

model.add(Dense(3, activation='relu', activity_regularizer=regularizers.l2(1e-4)))

model.add(Dense(1, activation='sigmoid'))

model.summary()

#SELECTING METHODS

model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])

#TRAINING MODEL WITH TRAINING SET AND CROSS VALIDATION SET

history = model.fit(X_train, y_train, epochs=100, batch_size=64, validation_data=(X_test,y_test))

#PLOTTING ACCURACY AND LOSS

loss_train = history.history['loss']

loss_val = history.history['val_loss']

epochs = range(1,101)

plt.plot(epochs, loss_train, 'g', label='Training loss')

plt.plot(epochs, loss_val, 'b', label='validation loss')

plt.title('Training and Validation loss')

plt.xlabel('Epochs')

plt.ylabel('Loss')

plt.legend()

plt.show()

loss_train = history.history['accuracy']

loss_val = history.history['val_accuracy']

epochs = range(1,101)

plt.plot(epochs, loss_train, 'g', label='Training accuracy')

plt.plot(epochs, loss_val, 'b', label='validation accuracy')

plt.title('Training and Validation accuracy')

plt.xlabel('Epochs')

plt.ylabel('Accuracy')

plt.legend()

plt.show()

LINK FOR DATA SET: https://drive.google.com/file/d/1ECaKuAyniai1m1KD7bs-ffJ3PTNVJR9U/view?usp=sharing

What I have tried:

I have tried a few things but they didn't work well.
Posted
Updated 8-Jul-20 22:48pm
Comments
Kris Lantz 8-Jul-20 12:02pm    
Why are you yelling? We're right here.
Richard MacCutchan 8-Jul-20 12:07pm    
He's probably downstairs. :)
Richard MacCutchan 8-Jul-20 12:09pm    
"I have tried a few things but they didn't work well."
That really is not a very useful comment. Please edit your question, format your code properly, show what you have tried, and explain exactly what happens or does not happen.
Abhay binwal 9-Jul-20 10:19am    
I have made few comments in the code and I have only tried dropout method for preventing overfitting but that didn't worked well.

1 solution

There are huge range of possibilities on why a model is over-fitting. I would like to address few common issues.

Before getting into answer I would like to give a short explanation on what Dropout is from this research paper published by the original author on:Dropout

Dropout:
A simple method to prevent over-fitting of neural network models by removing or dropping units(also called as neurons) while training randomly. There by the model won't be dependent on particular neurons or units.


Since you are facing a over-fitting problem you need to add a dropout layers along with your dense layers.

Stochastic Gradient Descent Tricks tells why SGD is good for training larger datasets.

Also in your case since it is a binary classification problem I would go for binary_crossentropy as my loss function.

Okay now lets get into the code and see:

Add our imports first:
Python
# SWAMI KARUPPASWAMI THUNNAI
import pandas as pd
import numpy as np
# vague explanation: convert labels into numbers
from sklearn.preprocessing import LabelEncoder
# Feature scaling
from sklearn.preprocessing import StandardScaler
# Our sequentially stacker NN model
from keras.models import Sequential
from keras.layers import Dense, Dropout


Read the dataset:
Python
dataset = pd.read_csv("DIABETES DATA.csv")


Our scalers and encoders:

Python
encoder = LabelEncoder()
scaler = StandardScaler()


Our independent and dependent variables:

Python
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]


Encode the categorical vars and scale X

Python
X.iloc[:, 0] =  encoder.fit_transform(X.iloc[:, 0:1])
X.iloc[:, :] = scaler.fit_transform(X.iloc[:, :])
X = np.array(X)


Our neural network:

Python
classifier = Sequential()
classifier.add(Dense(1000, input_dim=9, activation="relu"))
classifier.add(Dropout(0.2))
classifier.add(Dense(500, activation="relu"))
classifier.add(Dropout(0.2))
classifier.add(Dense(250, activation="relu"))
classifier.add(Dropout(0.2))
classifier.add(Dense(100, activation="relu"))
classifier.add(Dropout(0.2))
classifier.add(Dense(1, activation="relu"))
classifier.summary()

Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 1000)              10000     
_________________________________________________________________
dropout_1 (Dropout)          (None, 1000)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 500)               500500    
_________________________________________________________________
dropout_2 (Dropout)          (None, 500)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 250)               125250    
_________________________________________________________________
dropout_3 (Dropout)          (None, 250)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 100)               25100     
_________________________________________________________________
dropout_4 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 101       
=================================================================
Total params: 660,951
Trainable params: 660,951
Non-trainable params: 0


Choosing adam since SGD works well for larger datasets

Python
classifier.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])


Finally:

Python
classifier.fit(X, y, batch_size=20, epochs=100)


Gives:

768/768 [==============================] - 1s 679us/step - loss: 0.4177 - acc: 0.7539
Epoch 99/100
768/768 [==============================] - 1s 677us/step - loss: 0.4609 - acc: 0.7552
Epoch 100/100
768/768 [==============================] - 1s 712us/step - loss: 0.4449 - acc: 0.8008
 
Share this answer
 
Comments
Visweswaran N 9-Jul-20 4:49am    
I don't know why it has too many blank lines in my answer.
Abhay binwal 9-Jul-20 10:20am    
What about the accuracy of cross-validation accuracy?
Abhay binwal 9-Jul-20 10:21am    
I mean I'm too getting a good accuracy for training set but when I move to cross-validation set the accuracy falls down.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900