Monitoring User Interactions to Support Bug Reproduction .

Get Complete Project Material File(s) Now! »

App Store Analysis

The app store analysis area studies information regarding mobile apps mined from app stores. William et. al. [116] have identified seven key sub-fields of app store analysis: API analysis, feature analysis, release engineering, review analysis, security analysis, store ecosystem comparison, and size and effort prediction. We focus on analyzing the related work in the two sub-fields related to this thesis: reviews and API analysis. In particular, in the API analysis sub-field we focus on approaches which concern the study of the permissions of APIs.

Review Analysis

In this section, we discuss recent approaches that analyze app reviews posted by users on the Google Play Store. Table 2.2 summarizes prior work in the Review Analysis sub-area. Ha et al. [83] manually analysed user reviews available on Google Play Store to understand what users write about apps. Performing this task manually becomes infeasible due to the large amount of available reviews. Chen et al. [124] present AR-Miner, a tool for mining reviews from Google Play Store and extract user feedback. First, they filter reviews that contain useful information for developers to improve their apps. Then, they prioritize the most informative reviews before presenting the content of reviews. They use LDA to group the reviews discussing about the same thematics. Fu et al. [70] propose WisCom, a system to analyze user reviews and ratings in order to identify the reasons why users like or dislike apps. Iacob and Harrison [94] propose MARA, a tool for extracting feature requests from online reviews of apps. First, they use a set of predefined rules to identify sentences in online reviews which refer to feature requests. Then, they apply LDA for identifying the most common topics among the feature requests. Maalej et al. [109] propose an approach to automatically classify reviews into four categories: bug reports, feature requests, user experiences, and ratings. Similar to these studies we analyze reviews available in stores to extract informative feedback. Nevertheless, our work differs from these studies since we focus on reviews as an oracle of error-proneness. The reviews constitute a trigger to further analyze apps.

Android Permission Analysis

The API analysis sub-field studies the API usage extracted from apps. In Android, APIs are protected by a permission model. More specifically, we focus on the literature that analyzes the usage of permissions.
CHABADA [82] proposes an API-based approach to detect apps that misbehave regarding their descriptions. CHABADA clusters apps with similar descriptions and identifies API usage outliers in each cluster. These outliers point out potential malware apps. In contrast, we analyze user reviews (and not on app descriptions) to automatically identify buggy apps (as opposed to identify malware). Frank et al. [69] propose a probabilistic model to identify permission patterns in Android and Facebook apps. They found that permission patterns differ between high-reputation and low-reputation apps. Barrera et al. [47] studied the permission requests made by apps in the different categories of the Google Play store by mapping apps to categories based on their set of requested permissions. They show that a small number of Android permissions are used very frequently while the rest are only used occasionally. Jeon et al. [96] proposed a taxonomy that divides official Android permissions into four groups based on the permission behaviors (e.g., access to sensors, access to structured user information). For each category, they propose new fine-grained variants of the permissions.
Chia et al. [54] performed a study on Android permissions to identify privacy risks on apps. They analyze the correlation between the number of permissions requested by apps and several signals, such as app popularity and community rating. Our work differs from previous studies as we focus on permissions as a proxy for bugginess. Our taxonomy has a different goal, the aim of our classification is helping to identify error-sensitive permissions. Previous studies focus on official Android-specific permissions, we also analyze Google-defined, Vendor-defined and Developer-defined permissions.

Monitoring User Interactions to Support Bug Reproduction

Previous research have monitored user interactions for testing and bug reproduction purposes in Web ([88]) and desktop applications (e.g., FastFix [138], [139]). In the mobile domain, MonkeyLab [106] is an approach to mine GUI-based models based on recorded executions of Android apps. The extracted models can be used to generate actionable scenarios for both natural and unnatural sequences of events. However in MonkeyLab, apps are exercised by developers in lab. This thesis aims to synthesize realistic scenarios form user traces collected in the wild. In addition, our approach also deals with context information, since context is crucial to reproduce failures in mobile environments.

Automated UI Testing Tools

First, there are several frameworks that support automatic testing of mobile apps. Android provides the tools Espresso [34], and UI Automator [36]. In addition, other popular app testing frameworks are: Calabash [51], Robotium [20], and Selendroid [142]. These frameworks enable testing apps without having access to the source code. These frameworks require the developer to manually define the test scenarios to execute. Thus, they risk to lack unexplored code of the apps. In addition, there is a broad body of research in automated input generation for mobile apps. These tools automatically explore apps with the aim of discovering faults and maximizing code coverage. These tools can be categorized in three main classes concerning the exploration strategy they implement [57]: Random testing: These tools generate random UI events (such as clicks, touches) to stress apps to discover faults. The most popular tool in this category is Monkey [35], provided by Android. Dynodroid [113] is also a random tool which generates UI and system events. Model-based testing: The second class of tools extract the GUI model of the app and generate events to traverse the states of the app. The GUI models are traversed following different strategies (e.g. breath or depth first search) with the goal of maximizing code coverage. Examples of tools in this category are: MobiGUITAR [31], PUMA [87], and ORBIT [56]. Systematic testing: The third class of tools use techniques, such as symbolic execution and evolutionary algorithms, to guide the exploration of the uncovered code. Examples of these tools are: Sapienz [115], EvoDroid [114], ACTEve [33], A3E [46], and AppDoctor [91].
Despite the rich variety of approaches, previous tools lack support to test apps under different execution contexts—i.e., networks, sensor states, device diversity, etc. Therefore, they cannot detect context-related bugs which are inherent to the mobile domain. There exist cloud services to test an app on hundreds of devices simultaneously on different emulated contexts. Caiipa [104] is a cloud-based testing framework that applies the contextual fuzzing approach to test mobile apps over a full range of mobile operating contexts. In particular, they consider three types of contexts: wireless network conditions, device heterogeneity, and sensor input. However the concepts are generic to any mobile platform, current implementation of Caiipa is only available for Windows Phones. There are other cloud-based commercial solutions where developers can upload their apps to test them on hundreds of different devices simultaneously, such as Google Cloud Test Lab [80], Xamarin Test Cloud [156] and testdroid [22]. Despite the prolific research in the automated UI testing sub-area, testing approaches cannot guarantee the absence of unexpected behaviors in the wild. This thesis aims to complement existing testing solutions, with a post-release solution to help developers to efficiently detect and debug failures faced by users.

Table of contents :

List of figures
List of tables
I Preface
1 Introduction
1.1 Motivation
1.2 Problem Statement
1.3 Thesis Goals
1.4 Proposed Solution
1.5 Publications
1.5.1 Publication Details
1.5.2 Awards
1.6 International Research Visits
1.7 Thesis Outline
2 State of the Art
2.1 App Store Analysis
2.1.1 Review Analysis
2.1.2 Android Permission Analysis
2.2 Debugging
2.2.1 Crash Reporting Systems
2.2.2 Field Failure Reproduction
2.2.3 Automated Patch Generation
2.3 Crowd Monitoring
2.3.1 Monitoring User Interactions to Support Bug Reproduction .
2.4 Mobile App Testing
2.4.1 Automated UI Testing Tools
2.4.2 Record and Replay
2.4.3 Performance Testing
2.5 Conclusion
3 The Vision of App Store 2.0
3.1 App Store 2.0 Overview
3.2 Main Building Blocks
3.2.1 Crowd Monitoring Block
3.2.2 Crowd Leveraging Block
3.3 Conclusions
II Contributions
4 Monitoring the Crowd
4.1 Types of Bugs in Mobile Apps
4.1.1 App Crashes
4.1.2 UI Jank
4.2 What Information to Monitor from the Crowd?
4.2.1 Monitoring User Feedback
4.2.2 Monitoring App Context
4.2.3 Monitoring App Executions
4.3 Conclusions
5 Leveraging the Crowd in vitro
5.1 Reporting Risky Apps a priori
5.1.1 Empirical Study of Google Play Store
5.1.2 Analyzing App Permissions
5.1.3 Generating Risk Reports
5.1.4 Implementation Details
5.2 Reporting on Performance Degradations
5.2.1 Aggregating Performance Logs
5.2.2 Identifying Performance Deviations
5.2.3 Generating Performance Reports
5.2.4 Implementation Details
5.3 Conclusions
6 Leveraging the Crowd in vivo
6.1 Reproducing Crash Scenarios a posteriori
6.1.1 Aggregating Crowdsourced Crash Logs
6.1.2 Identifying Crash Patterns
6.1.3 Generating Reproducible Scenarios
6.1.4 Implementation Details
6.2 Patching Defective Apps in the Wild
6.2.1 Patch strategy 1: Muting unhandled exceptions
6.2.2 Patch strategy 2: Deactivating UI features
6.2.3 Implementation Details
6.3 Conclusions
III Empirical Evaluations
7 Evaluation of in vitro Approaches
7.1 Evaluation of Crowd-based Checkers
7.1.1 Empirical Study Design
7.1.2 Dataset
7.1.3 Empirical Study Results
7.1.4 Threats to Validity
7.2 Evaluation of DUNE
7.2.1 Empirical Study Design
7.2.2 Dataset
7.2.3 Empirical Study Results
7.2.4 Threats to Validity
7.3 Conclusion
8 Evaluation of in vivo Approaches
8.1 Evaluation of MoTiF
8.1.1 Empirical Study Design
8.1.2 Dataset
8.1.3 Empirical Study Results
8.1.4 Threats to Validity
8.2 Evaluation of CrowdSeer
8.2.1 Empirical Study Design
8.2.2 Dataset
8.2.3 Empirical Study Results
8.2.4 Threats to Validity
8.3 Conclusion
IV Final Remarks
9 Conclusions and Perspectives
9.1 Contributions
9.2 Perspectives
9.2.1 Short-term Perspectives
9.2.2 Long-term Perspectives
9.3 Final Conclusion
References