Click here to Skip to main content
15,867,308 members
Articles / Artificial Intelligence
Article

CodAI -Programming language detection AI

Rate me:
Please Sign up or sign in to vote.
4.59/5 (13 votes)
4 Mar 2018CPOL1 min read 22.2K   632   25   5
Programming language Detection AI

Note: You can evaluate CodAI by visiting https://codai.herokuapp.com/

Image 1

Introduction

In this article we will discuss programming language detection using deep neural networks . I used Keras with tensorflow backend for this task. CodAI uses a neural network quite similar to the my previous article LSTM Spam detector network  https://www.codeproject.com/Articles/1231788/LSTM-Spam-Detection .

This article contains the following topics:

  1. Prepare train and test data
  2. Build the model
  3. Serve the model as REST api

Using the code

1.Prepare train and test data

First step is to prepare the test data ,our test data is a text file with the HTML comprising PRE blocks that contain a code sample. I used BeautifulSoup to extract all PRE tag contents

Python
soup = BeautifulSoup(open("LanguageSamples.txt"), 'html.parser')
count=0
code_snippets=[]
languages=[]

for pretag in soup.find_all('pre',text=True):
    count=count+1
    line=str(pretag.contents[0])
    code_snippets.append(line)
    languages.append(pretag["lang"].lower())

Next we need to tokenize the input, Keras Tokenizer is used for this with maximum features of 10000 and word indexes are saved to json file.

max_fatures=10000

tokenizer = Tokenizer(num_words=max_fatures)
tokenizer.fit_on_texts(code_snippets)

dictionary = tokenizer.word_index
# Let's save this out so we can use it later

with open('wordindex.json', 'w') as dictionary_file:
    json.dump(dictionary, dictionary_file)

X = tokenizer.texts_to_sequences(code_snippets)
X = pad_sequences(X,100)
Y = pd.get_dummies(languages)

2.Build the model

CodAI neural network consists of convolutional neural network,LSTM and feed forwarded network.

Python
embed_dim =128
lstm_out = 64

model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = 100))
model.add(Conv1D(filters=128, kernel_size=3, padding='same', dilation_rate=1,activation='relu'))
model.add(MaxPooling1D(pool_size=4))
model.add(Conv1D(filters=64, kernel_size=3, padding='same', dilation_rate=1,activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(lstm_out))
model.add(Dropout(0.5))
model.add(Dense(64))
model.add(Dense(len(Y.columns),activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])

Model summary as shown below

Image 2

This model trained for 400 epoches and gave 100% accuracy on validation data.

Image 3

3. Serve the model as REST API

I used Flask and Heroku  cloud platform for serving Keras model. convert_text_to_index_array function is used to convert input code snippet int to word vectors and this is fed into our neural network.

Python
def convert_text_to_index_array(text):

wordvec=[]
global dictionary

for word in kpt.text_to_word_sequence(text) :
   if word in dictionary:
      if dictionary[word]<=10000:
        wordvec.append([dictionary[word]])

      else:
          wordvec.append([0])
    else:
          wordvec.append([0])

return wordvec

predict route processes the input and predicts each class score and returns the result as json.

Python
@app.route("/predict", methods=["POST"])

def predict():
   global model
   data = {"success": False}
   X_test=[]
   if flask.request.method == "POST":
      code_snip=flask.request.json
      word_vec=convert_text_to_index_array(code_snip)
      X_test.append(word_vec)
      X_test = pad_sequences(X_test, maxlen=100)
      y_prob = model.predict(X_test[0].reshape(1,X_test.shape[1]),batch_size=1,verbose = 2)[0]
      languages=['angular', 'asm', 'asp.net', 'c#', 'c++', 'css', 'delphi', 'html',
                'java', 'javascript', 'objectivec', 'pascal', 'perl', 'php',
                 'powershell', 'python', 'razor', 'react', 'ruby', 'scala', 'sql',
                   'swift', 'typescript', 'vb.net', 'xml']

      data["predictions"] = []

      for i in range(len(languages)):
       r = {"label": languages[i], "probability": format(y_prob[i]*100, '.2f') }
       data["predictions"].append(r)
       data["success"] = True
      return flask.jsonify(data)

Conclusion

I learned many new things from this project.Programming language detection is a bit challenging one  for me.Hope you enjoyed this article.

History

Updated broken image link

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



Comments and Discussions

 
QuestionDoes not detect java correctly Pin
Member 441012712-Aug-18 11:36
Member 441012712-Aug-18 11:36 
The code snippet below (evidently Java) was detected 95.9% as REACT.

public static  boolean buildChain(X509Certificate certificate, LinkedList<X509Certificate> answer,
                                       Map<String, List<X509Certificate>> knownCerts) {
        java.security.Principal subject = certificate.getSubjectDN();
        java.security.Principal issuer = certificate.getIssuerDN();
        // Check if the certificate is a root certificate (i.e. was issued by the same Principal that
        // is present in the subject)
        if (subject.equals(issuer)) {
            answer.addFirst(certificate);
            return true;
        }
        // Get the list of known certificates of the certificate's issuer
        List<X509Certificate> issuerCerts = knownCerts.get(issuer.getName());
        if (issuerCerts == null || issuerCerts.isEmpty()) {
            // No certificates were found so building of chain failed
            return false;
        }

PraiseNice article Pin
Member 1392492924-Jul-18 23:52
Member 1392492924-Jul-18 23:52 
BugConfusion: 'Functional Codes' AND 'String Contained Codes' Pin
Alaa Ben Fatma8-Apr-18 3:49
professionalAlaa Ben Fatma8-Apr-18 3:49 
Questionmis-detection of x86 asm Pin
kevinrhoads5-Mar-18 9:42
kevinrhoads5-Mar-18 9:42 
AnswerRe: mis-detection of x86 asm Pin
Rupesh Sreeraman6-Mar-18 3:42
Rupesh Sreeraman6-Mar-18 3:42 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.