défense publique de la dissertation de doctorat de Natarajan Chidambaram

Titre de la dissertation: « Improving Bot Identification in Collaborative Software Development on GitHub ».

Promoteur: Monsieur Tom Mens et co-promoteur: Monsieur Souhaib Ben Taieb

Résumé de la dissertation: Contemporary social coding platforms such as GitHub facilitate collaborative distributed software development. This enables developers to contribute to software projects from different parts of the world. Developers engaged in this platform often perform various activities such as committing files, creating pull requests, performing code reviews, creating and deleting branches, updating documentation, deploying and releasing new software versions and so on. As these activities could be effort-intensive, repetitive and error-prone, repository maintainers and developers frequently use automated mechanisms (e.g., bot accounts, apps, automated workflows, and other internal or external automation services) for performing these activities. Bot accounts and GitHub Apps are being widely used in GitHub repositories and are among the most active contributors in certain repositories. Determining whether a contributor corresponds to a bot or a human is important in socio-technical studies, for example to assess the positive and negative impact of using a bot, analyse their evolution and usage, identify and accredit top contributors, and so on. The main aim of this dissertation is to improve bot identification in GitHub. While multiple bot identification approaches have been proposed in the past, they suffer from certain limitations that make them difficult to be used in practice. By creating multiple datasets, developing new bot identification models, and performing quantitative analysis, we provide several novel contributions. We show that bots are regularly among the most active contributors, even though GitHub does not explicitly acknowledge the presence of several bots. Also, we show that existing bot identification approaches do not perform well in identifying bot contributors. So, as a first step, we develop two models to improve bot identification in GitHub. One model leverages the predictions made by an existing approach across multiple GitHub repositories and provide an overall improved performance. Another is an ensemble model that combines the prediction made by these existing approaches to improve bot identifi- cation in GitHub. Then, we propose a dataset of contributor activity sequences that can be extracted from low-level events provided by the GitHub REST API. Based on these activities, we identify features that can statistically differentiate bots from human contributors engaged in collaborative software development. Also, we propose a manually labelled ground-truth of bots and human contributors that can be used to compare the performance, efficiency and limitations of existing bot identification approaches. Using this ground-truth and distinguishing features, we train BIMBAS, a new binary classification model to identify bots based on their recent activities in GitHub. Through an empirical study, we use BIMBAS to detect the presence of bots among thousands of contributors and identify how bots are being used in a large software ecosystem. We reveal behavioural differences between bots and humans, and between different bot categories. To enable the practical use of BIMBAS, we develop an open-source bot identification tool called RABBIT, which outperforms the efficiency of existing bot identification approaches.

Adresse

Avenue du Champ de Mars, 22
7000 Mons, Belgium

Voir dans Google Maps