Data Innovation Summit turns five next March. Along the way, we have had fantastic speakers unselfishly sharing their knowledge on stage with their peers. Without them, this journey would be impossible.
This interview is part of an interview series dedicated to humanising Data and AI innovation and celebrating speakers who have presented at Data Innovation Summit. The emphasis lies on the Data/AI people/practitioners, their professional journey and their stories.
If all the code at GitHub can be presented as graphs and images, we can imagine the importance of applying machine and deep learning to understand code. This is the topic Clair Sullivan introduced at the Data Innovation Summit 2019.
Clair demonstrated a deep learning method that can be used to detect duplicate code at GitHub, which is more than half of all the code on the platform. Clair’s Data Innovation Summit presentation is also summarised in an article here on Hyperight Read.
Clair’s talk was really compelling, so we talked to her to follow up on the progress with detecting duplicate code at GitHub, as well as her views on the latest developments in the field.
Hyperight: Hi Clair, we are glad to have you with us today and have the chance to catch up. You were a speaker at Data Innovation Summit 2019. To refresh our memories and introduce yourself to our readers, please tell us a bit about yourself and the company you are coming from.
Clair Sullivan: I am a data scientist for GitHub, which is a platform for developer collaboration and home to the world’s largest community of developers and their code.
Hyperight: Next year we are celebrating our 5th anniversary. A lot has changed with data and advanced analytics during these 5 years. From your point of view, where do we see the biggest changes and advancements we have had.
Clair: To me, the biggest changes in data science and analytics has been the rise and availability of easy-to-use libraries for deep learning. This has truly taken data science to the next level by democratizing the ability of data scientists to apply GPUs to their problems with best-in-class deep learning models.
Hyperight: Clair on stage at Data Innovation Summit, you presented how GitHub uses deep learning on graphs of code to detect duplicate code. You also mentioned that you were working on expanding to Type 4 for detecting duplicate code and representing all GitHub data in a graph. How is it going with this so far?
Clair Sullivan: If you think about it, all of GitHub can be represented as a graph. We have users connected to other users, users connected to code, and code connected to code. At the Data Innovation Summit last year, I gave some results illustrating a method that could be used to analyze the later. However, we are also continuing to look at all of those graphs. The technical challenges I described regarding detecting duplicate code still remain, but I am very excited about the progress we’ve made and where we are going with all types of graphs!
“If you think about it, all of GitHub can be represented as a graph. We have users connected to other users, users connected to code, and code connected to code.”
Hyperight: I’m sure our readers are really interested to do know, what are some other use cases where GitHub employs machine learning and deep learning?
Clair Sullivan: GitHub represents the largest data set in the existence of code. So you can imagine that applying machine and deep learning to understand code is of great importance. But we have also released other products recently, such as the ability to identify trending repositories and users that you might want to follow or to point out good first issues that new open-source developers might want to get involved with.
To me, the biggest changes in data science and analytics has been the rise and availability of easy-to-use libraries for deep learning.
Hyperight: And finally, what are your future outlooks for machine learning and deep learning if we are looking 10 years from now? Where do you see it all going?
Clair Sullivan: We are starting to see tools coming out that allow people to apply ML methods without even having to write any code at all! Deep learning has long been regarded as a field requiring a great deal of expertise. However, when we have tools that allow the non-experts to deploy models, we will see improved solutions because of the additional diversity of thought. This opens up the door to many more people being able to apply these methods to a lot more different problems, which is very exciting.
I also think that as a community we will need to be involved with determining the best practices for data ethics. We will need to develop an understanding of how our models are biased and how these biases impact everyday life. There are some great starts going on here with some of the best minds out there, but I think this is an area that will really grow in the next 10 years.