You are here: SE » ThesesHome » ThesisDPPConsistencyRCA

Improving the reliability of Saros using Root Cause Analysis

worked on by: Sebastian Starroske


I continued the Root Cause Analysis after handing in my Master Thesis. I will publish the results on this website on Sunday, February 10.
Current Version: RCA

Outline

In dieser Arbeit geht es, einige wichtige grundsätzliche Schwächen des Saros-Produktes aufzudecken, die die Stabilität/Vermeidung von Inkonsistenzen betreffen und dabei zugleich (soweit möglich) herauszufinden, welche Änderungen am Saros-Entwicklungsprozess künftig ähnliche Schwächen zu verhindern helfen sollten bzw. dabei helfen, die aktuellen Schwächen gründlich und dauerhaft abzustellen.

Der Zugang erfolgt dabei über Defektkorrekturen: Es werden der Reihe nach ein paar Defekte aus der Defektdatenbank ausgewählt, im Code lokalisiert und dann aber nicht nur einfach behoben, sondern zusätzlich daraufhin analysiert, welche Produkteigenschaften vermutlich dazu geführt oder beigetragen haben, dass sie aufgetreten sind, und welche Prozesseigenschaften entweder für diese Produkteigenschaften verantwortlich sind oder aber verhindert haben, dass trotz dieser Produkteigenschaften der Defekt vermieden werden konnte. Im Zuge dieser Analyse wird also eine Kette von Ursachen und Wirkungen identifiziert, die qualitätswichtige Zusammenhänge im Softwareprozess beschreibt. Die höherrangigen Ursachen in dieser Kette nennt man auch Urgründe (root causes) und diese sind ein wertvolles und erprobtes Hilfsmittel für Prozessverbesserungen.

Darauf aufbauend soll der Kern dieser Arbeit darin bestehen, strukturelle Verbesserungen sowohl im Produkt als auch im Prozess zu identifizieren, die möglichst viele dieser Urgründe abstellen, und einige davon ganz oder teilweise umzusetzen.

Thesis Requirements

  • Improve consistency of Saros
  • Name and describe all current inconsistency types occuring in Saros
  • Determine risk and probability of all types
  • Understand the causes of the types which have the highest risk value
  • Performing a Root Causes Analysis (RCA) for the types with thwe highest risk value
    • analysis the results regarding coverage
    • analyze tools and methode and create a short handbook / documentation how RCA should be performened in the future of the Saros project
  • Present / implement possible solutions for the found root causes
  • nice-to-have: Analyze how the solved root causes reduce other non-inconsitency problem

Milestones and Planning

A milestone is a scheduled event signifying the completion of a major deliverable or a set of related deliverables. A milestone has zero duration and no effort -- there is no work associated with a milestone. It is a flag in the workplan to signify some other work has completed. Usually a milestone is used as a project checkpoint to validate how the project is progressing and revalidate work. (Source: http://www.mariosalexandrou.com/definition/milestone.asp)

Milestone no. Milestone Goals target Past CW accomplished wrench
1 Register Thesis literature research
working on welcome checklist
getting to know Saros
planning the project
DONE CW27 not accomplished - CW28
2 Concept Presentation further literature research
clustering of bugs/events
determining risk value for each bug /cluster (occurence * consequences)
outline of thesis
PSP
DONE CW31 not accomplished - CW 34
started working on RCA I
3 Presentation of a an detailed schedule time scheduling DONE CW33 accomplished
4 RCA I gathering data through fixing errors
try to identify first root causes
DONE CW39 in progress - currently performing last steps
5 RCA II identifaction and fixing of root causes
statistical analysis on coverage
DONE CW45 in progress
6 Hand in thesis finishing thesis
finishing presentation
finish open tasks
DONE CW50  

Weekly Status

Week 3 (CW 23)

Activities

  • Vacation

Week 4 (CW 24)

Activities

  • worked on clustering the bugs from the bug tracker
  • JarSync
  • Literature

Results

  • possible way of clustering bugs / events / phenomena:
    • Inconsistency and Invitation
    • Inconsistency and Network / Protocol / internal Read-Only
    • Inconsistency and file / directory or SVN operation (OS level)
    • User Read-Only Mode
    • GUI missbehaviour

Next Steps

  • clustering
  • evaluating risk and prioritizing
  • steps for registering thesis

Problems

Week 5 (CW 25)

Activities

  • work on clustering the bugs from the bug tracker
  • JarSync
  • Analyzing reproducibility of the bugs
  • Evaluating risk (occurrence and impact)

Results

  • updated clustering:
    • Inconsistency and Invitation (10 entries)
    • Basic Inconsistency - Recovery (12 entries)
    • Basic Inconsistency - Partial Sharing (2 entries)
    • Basic Inconsistency - Communication (6 entries)
    • Follow Mode (12 entries)
    • Inconsistency and file / directory or SVN operation (OS level)(8 entries)
    • User Read-Only Mode (7 entries)
    • GUI missbehaviour (6 entries)
  • finished Risk Analysis and priorisation of clusters
    • Inconsistency and Invitation
    • Basic Inconsistency - Recovery
    • Inconsistency and file / directory or SVN operation (OS level)
    • User Read-Only Mode (7 entries)
    • the other clusters don't have many serious open bugs (most of them are alreay closes)

Next Steps

  • start with cluster: Inconsitency and Invitation
    • comprehend the causes of already closed bugs and understand how they have been fixed
    • from those bugs, go deeper to find more underlying causes / contributing factors or try to apply the knowledge gained to fix open bugs

Problems

Week 6 (CW 26)

Activities

  • work on bugs 3458952, 3512804 and 3300579

Results

  • bugs were understood and could be reproduced
  • 3300579 could be reopened, because it still exists in the curretn version of Saros

Next Steps

  • RM during the next week
  • Try to reproduce and fix bug 3489409

Problems

Week 8 (CW 28)

Activities

  • Register Thesis

Week 9 - 13 (CW 29 - CW 33)

Activities

  • work on 3541540 Activity queuing is broken during synchronization

Results

  • uploaded first patch in CW 33
  • changed activity queuing: all activities are now queued in project specific Blocking Queues and then executed by Dispatcher Threads

Next Steps

  • Preparing concept presentation

Problems

Week 14 (CW 34)

Activities

  • work on 3541540 Activity queuing is broken during synchronization
  • preparing concept presentation

Results

  • presentation took place on August 23rd

Next Steps

  • continue working on 3541540

Week 15 (CW 35)

Activities

  • still working on 3541540 Activity queuing is broken during synchronization

Results

  • stable version checked in on September 3rd
  • the dispatcher Thread problem on Unix systems was also tested successfully with this patch

Next Steps

  • analyzing how this patch fixes the sub entries (in SF)
  • analyzing since when the problem with the wrong Activity queuing occured the first time

Week 16 (CW 36)

Activities

  • analyzing how this patch fixes the sub entries (in SF)
  • analyzing since when the problem with the wrong Activity queuing occured the first time

Results

  • activity queuing was never flawless, but at the beginning of the project it was really hard to exploit this and cause failures
  • 2 related bugs in SF are partly fixed with this patch, one can not be tested and one bug seems not to be related to #3541540

Problems

  • a problem was detected in the patch for fixing the activity queuing: Activities sent before and partly while the projetc archive is created cuase inconsistencies
  • watchdog can not detect those inconsitencies

Next Steps

  • work on the problems discribed above

Week 17 - 18 (CW 37-38)

Activities

  • working on the activity queuing

Results

  • possible solution was found, but not implemented yet

Problems

Next Steps

Week 19 - 23 (CW 39-43)

Activities

  • working on the activity queuing, in specific on the problem that not all Activities need to be sent to the invited person during the invitation process, since they might already be included in the archive

Results

  • problem could be fixed
  • the OutgoingInvitationProcess now has a Set containing all files, which have already been packed in the archive
  • this is used by the SarosSession to determine, if an Activity needs to be sent

Problems

  • Eclipse and Filesystem were out of sync --> changes caused by Activities were not included in the archive

Next Steps

  • check if ReadOnly can be disabled during invitation
  • make a detailed description of the patches and give hints for reviewers

Week 24 (CW 44)

Activities

  • working on the description of the patches
  • collecting information about the solution process
  • analyzing ReadOnly
  • working on an outline

Results

  • review description was sent to DPP-DEVELOP
  • created outline and collected information for some of the chapters
  • implemented a stress test, where multiple files are edited during an invitation process

Problems

  • Inconsistencies can occur when a non-host invites

Next Steps

  • forbid non-host invitation
  • working on the introduction