Can we trust deep learning to detect duplicate code? GitHub did it

GitHub as a platform doesn’t need an introduction. Everyone that has even been closely or remotely connected to some sort of code development, testing, production or even marketing of software products, has a very clear understanding of GitHub.

As a software development platform, GitHub opens its doors for developers all over the world to host, share, and discover software, particularly but not limited to open-source. Consequently, one is inspired to ask just how much data of code GitHub actually stores.

Clair Sullivan, Machine Learning Engineer at GitHub, came to Data Innovation Summit to enlighten us about how much code data they actually have and what exactly they do with it.

Clair Sullivan Machine Learning Engineer at GitHub presenting at the Data Innovation Summit 2019 — *Photo by Hyperight AB® / All rights reserved.*

At the time of her presentation, Clair stated that GitHub stored around 0.8 petabytes of data of code produced by developers.

Going a step forward, Clair explains that all that data can be represented in a graph – of connected data. What does that mean? Simply put, the graph presents two pieces of data connected together, like a developer writing code, pulling requests, creating issue, making a comment, cloning a repository.

Read Clair Sullivan’s Data and AI Innovators interview

Duplicate code – Why it’s such a problem for GitHub

Clair’s talks focused on a particularly interesting aspect of code – duplicate code. You may wonder why copying a piece of code and posting it in another repository would cause such a fuss.

To give us an idea of why duplicate code is the focus of her talk, Clairs draws attention to a paper on duplicate code in GitHub and how much of it resides there.

A map of duplicate code on GitHub — **Image source**: Clair Sullivan’s PDF presentation at the Data Innovation Summit 2019

As we can see, a staggering 93% of JavaScript on GitHub is duplicate. Considering the amount of duplicate code, GitHub is focused on finding a way to leverage it and help developers in their code development by suggesting existing code based on similarity.

Types of duplicate code

Clair describes that there are 5 different types of duplicate code:

Type 0 – completely identical
Type 1 – the only difference is in comments and whitespace
Type 2 – can also include variations in identifier names and literal values
Type 3 – syntactically similar with differences as the statement level (can be added, removed, or modified)
Type 4 – syntactically different but semantically similar (they do the same thing)

Detecting Type 0, 1 and 2 is quite easy using hashing, but the real fun starts with Type 3 and 4 as they are much more complex to detect.

Clair describes that there are 5 different types of duplicate code:

Type 0 – completely identical
Type 1 – the only difference is in comments and whitespace
Type 2 – can also include variations in identifier names and literal values
Type 3 – syntactically similar with differences as the statement level (can be added, removed, or modified)
Type 4 – syntactically different but semantically similar (they do the same thing)

Graphs of code

Explaining Type 3 duplicate code, Clair gives the example of the two functions below:

Image source: Clair Sullivan’s PDF presentation at the Data Innovation Summit 2019

The functions may look very different at first sight, but both are doing a form of addition.

ASTs or abstract syntax trees are how code is put together an in its essence it’s nothing other than graphs, which are directed and acyclic, meaning there are no loops in them.

If we represent the two above AST functions into graphs, we get these two visuals below:

Turning AST functions into graphs of code — **Image source**: Clair Sullivan’s PDF presentation at the Data Innovation Summit 2019

Graph calculations are represented through the adjacency matrix – good news for machine learning people that want to crunch numbers. The adjacency matrices turn the graph into numbers and vectors that ML people can work with.

Watch the interview with Clair Sullivan

Adjacency matrix for graphs — **Image source**: Clair Sullivan’s PDF presentation at the Data Innovation Summit 2019

But going back to the headline of the article, it said using deep learning and for that we need images. What if we take these numbers, rows and columns and turn them into an image, Clair suggests. This is what Clair did in her research; she took the code and the graph of it and turned it into a picture of the code, the AST adjacency matrix of a function, to be more precise:

Graphs of code represented as an image — **Image source**: Clair Sullivan’s PDF presentation at the Data Innovation Summit 2019

The next step was to apply a convolutional neural network to see if she can detect duplicate images, i.e. codes.

If we can identify developers going down the path of creating duplicate code, can we help them write that code?

Watch the full presentation with Clair Sullivan

Deep learning model for detecting duplicates in graphs of code

Clair explained how the workflow went with running deep learning on the graph images for the purpose of detecting duplicate code.

She relied on the labelled dataset called Big Clone Bench which contains 6.7 million Java files labelled for their type of duplicates. The idea was to go through all of them and pick the Type 3 duplicates, which are the same language, syntactically slightly different, but semantically identical.

First, she identified two files in the Big Clone Bench dataset which are labelled as Type 3 duplicates and compared them. Then she created a training set based on those labels of either Type 3 or not Type 3 leaving them with Type 3 clone or a negative clone which they used in the training.

Then, the code is converted into ASTs JSON, which is converted into an adjacency matrix and a convolutional neural network code is written to define if they are Type 3 duplicates or negative duplicates. The process for detecting duplicates is done at a file and function level.

Results

After running the test on file to file comparison, Clair got an accuracy on the test set of almost 93%. The training was done on 5000 negative clones and 8220 Type 3 clones, with 20% loss.

File to file comparison on images of graphs of code — **Image source**: Clair Sullivan’s PDF presentation at the Data Innovation Summit 2019

As for the function to function comparison, she got 97.7% accuracy on a test done with nearly 28.000 negative clones and 53.000 total training cases, which are great numbers.

In order to make sure she wasn’t being fooled by the results, as it doesn’t happen rarely in deep learning algorithms, Clair added random noise to the images by 1%. As a result, the test accuracy went down, as expected, but as she states, there is no 100% proof that results are accurate and have to be taken with a grain of salt.

But at the end of the day, Clair presented only one way, one node in the whole graph, for doing graph analytics on code.

Future work

Clair admits that GitHub’s current approach for clone detection using O(n2) is not efficient and that’s why it’s not in production. As she says, GitHub is working on finding a way out of this O(n2) modality.

However, they focus the majority of their efforts on detecting Type 4 duplicates, which as Clair states “is the Holy Grail” for research. For detecting Type 3 they worked with Java, but they want to expand to comparing Java to Haskell.

But code is not the only data on GitHub as we’ve seen. There’s data on the users, their activity and interactions on the site and all of it can potentially be represented as a graph, Clair points out. These are some of GitHub’s exciting area of work that hopefully we’ll soon see results of.

Watch the full presentation with Clair Sullivan

Watch the interview with Clair Sullivan

Cookie	Duration	Description
__cfduid	1 month	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
bp_user-registered	13 years 8 months 8 days	This cookie is used to set which users can access the private pages of the website. It is a functional cookie.
bp_user-role	13 years 8 months 8 days	This is a functional cookie. It is used to set restriction to the user on acessing certain pages like back office, account page etc.
bp_ut_session	13 years 8 months 8 days	This is a functional cookie. This cookie is used to set restriction to the user on acessing certain pages like back office, account page etc.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.

Cookie	Duration	Description
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Duration	Description
_gat_gtag_UA_62786802_1	1 minute	No description
CONSENT	16 years 9 months 21 days 15 hours 5 minutes	No description
ihc_workflow_restrictions_0	1 month	No description
ihcMedia	1 hour	No description

Can we trust deep learning to detect duplicate code? GitHub did it