Emil Milanov:

An alternative confusion matrix visualization for PreCall

Requirements

Web-technologies (JavaScript, e.g. React)
Basic knowledge regarding Machine Learning (binary classifiers) and User Interfaces

Academic Advisor

Prof. Dr. Claudia Müller-Birn

Discipline

Interactive Information Visualization, Software Engineering, User Interfaces for Machine Learning Systems

Degree

Bachelor of Science (B.Sc.)

Context

Due to Wikipedias popularity and therefore its high number of edits per day, manually reviewing those edits is not possible anymore. Wikipedians have been using automated quality control tools, most of which are employing Machine Learning classifiers. (1)

However, having strict quality control policies for the automated tools can lead to new users feeling discouraged and unwelcome, because edits they made in good faith are being automatically reverted without any additional feedback. As a consequence, a need arises for newly developed automated review tools to manage to strike a balance between effective quality control and new user socialization. This is precisely the goal of ORES. It encompasses a set of different Machine Learning models used to evaluate articles and edits on Wikipedia according to different metrics. For example, the "damaging" model provides a classification of whether a certain edit is a vandalism or not.

To make the usage of ORES more accessible for non-expert users in the field of Machine Learning, we designed PreCall (2) - a visual interface for ORES' "damaging" model. PreCall visualizes some of the hyperparameters one can set for the model, allowing the user to get an intuitive overview of how the different settings affect the results of the model.

Problem

The pilot usability-study of PreCall revealed difficulties non-expert users had with the interface. Two different use cases arise:

(a) using ORES to identify damaging edit for human review (manual vandalism detection): A high recall (true positive rate) is key but of the cost of a low threshold, which leads to a high number of edits a human would need to review.

(b) using ORES for optimizing an auto-revert bot: The precision needs to be high, to ensure a low number of false positives (good edits falsely detected as damaging), resulting in choosing a very high threshold. This would lead to a high number of false negatives (damaging edits not detected as spam).

Objectives

The goal of this work is to improve and extend the existing PreCall interface using the Classee-Approach (3). The outcome should be a user interface that would allow a user with a limited machine learning experience to explore the settings and range of the ORES model.

Procedure

Investigate the literature about PreCall (2,5), Classee (3,4) and ORES (1,6,7)
Derive requirements for a new prototype
Designing a paper prototype and test it
Implement a high-fidelity prototype (web-based interactive UI)
Plan, conduct and evaluate a user stud

References

(1) Halfaker, A., & Geiger, R. S. (2019). ORES - Lowering Barriers with Participatory Machine Learning in Wikipedia. CoRR, https://arxiv.org/abs/1909.05189

(2) Kinkeldey, Christoph, Claudia Müller-Birn, Tom Gülenman, Jesse Josua Benjamin, and Aaron Halfaker. “PreCall: A Visual Interface for Threshold Optimization in ML Model Selection.” ArXiv:1907.05131 [Cs], July 11, 2019. http://arxiv.org/abs/1907.05131.

(3) Beauxis-Aussalet, Emma, Joost van Doorn, and Lynda Hardman. “Supporting End-User Understanding of Classification Errors,” 2018. https://doi.org/10.1145/3232078.3232096.