Software Engineering – Not Entirely Unproblematic

Get Complete Project Material File(s) Now! »

Chapter 2 Choosing the Right Data Model

A data model specifies a particular way of structuring data and operations through which the data can be used. This chapter describes diﬀerences between several data models and database technologies. For my project I had to chose one of them for the repository subsystem, which is depicted at the bottom of the architectural overview in Fig. 1.4 and outlined in Sect. 1.4.2. In order to make a well-founded decision, I examined their functional as well as non-functional characteristics. Many data models, even though widely used, are bound to a particular set of requirements and limitations. Therefore, popularity of a particular data model could not be the main criteria for such a choice.
Note that the choice of a data model can be independent from the choice of a data representation. The same data can exist in diﬀerent representations, while exhibiting the same structural properties. For example, a concrete data model could be represented as text, as in a programming language, as graphics, as in diagrams, or in other less human readable forms, e.g. binary code. Dealing with model-based CASE data does not enforce a particular representation, but some kinds of representations are more suitable for certain purposes than others.
Section 2.1 discusses the relational data model. Section 2.2 looks at common properties of object-oriented data modeling. Section 2.3 describes the parsimonious data model, which is less known. Section 2.4 and 2.5 discuss the popular UML and XML data models, respectively. Section 2.6 concludes the chapter. Parts of this chapter were published in [143].

The Relational Data Model (RDM)

The RDM was first proposed in 1969 by Edgar Frank Codd [48]. It is a formal model that uses set theoretical concepts in order to describe the structure of data. It has since been very successful in the database world and has replaced many older data models used in database systems, such as the hierarchical and the network models. It soon became the predominant model used in industrial-strength database management systems, and remains so until today.
The basic notion is very simple: all data is stored in typed, mathematical relations. That is, data is stored in tuples (a1, . . . , an) which are elements of a Cartesian product A1 × . . . × An. In addition to a type A, each of a tuple’s components has a name a, and together they are called an attribute. The type of a relation is defined by its attributes. A relation itself is defined by its tuples, which contain values for the attributes, and also has a name.
A relation can define a primary key, which is a set of its attributes that is used to identify its tuples. For this to work, all tuples must have unique primary key values. Some relations use natural primary keys, which means that part of the data that is naturally stored in a tuple is used as primary key. Examples for a natural primary key are the passport number for a relation containing data about persons, or registration numbers for cars. Relations can also use artificial primary keys, i.e. artificially generated values that identify each tuple, such as running integer numbers or GUIDs. A third possibility is that of mixed keys which include artificially generated as well as natural information.
In order to define connections between diﬀerent tuples, foreign keys are used. A foreign key is a set of attributes of a relation that is used as a reference to a primary key. For each attribute of the primary key, the foreign key contains a corresponding one. This way, tuples can refer to other tuples of a diﬀerent or the same relation by citing their primary key values. All the possible kinds of associations between data elements can be modeled in that manner.
A relational schema is a set of relation types. It forms the basis of a relational database. In general, all foreign keys used in a schema refer to primary keys of relation types that are also part of that schema. That is, a schema usually forms a self-contained data model for a particular problem or domain.

Example

As an illustration and for further discussion I would like to consider the following relational schema S, which models records about persons and bank accounts:
S = {P erson(id, name, address), BankAccount(number, ownerid, balance)}.
The underlined attributes form the primary keys of the respective relation. While relation P erson has an artificial primary key, i.e. one that does not carry any meaning outside the database, relation BankAccount has the account number as a natural primary key. Attribute ownerid of BankAccount forms a foreign key that refers to the Person tuple of the owner of an account. The types of the relations can be given as the Cartesian products that form the supersets of the relations:
P erson
BankAccount
⊂ I nteger × String × String ⊂ String × I nteger × Double
For this example I assume that the attributes have simple primitive data types, such as I nteger, Double and String. Relational primitive types are diﬀerent from the primitive types in most programming languages in that they are flat domains. A flat domain is a very simple complete partial order (CPO) with incomparable elements (i.e. incomparable according to the approximation CPO) and a bottom element ⊥ that approximates all other elements [1], as illustrated in Fig. 2.1. This simply means that in addition to the primitive values, such types also have an element ⊥ that represents the state in which the value of an attribute is unknown, e.g. undefined.

Associations and Multiplicities

One of the advantages of the relational data model is that associations, i.e. links between tuples that are defined via foreign key references, can be navigated in both directions. This is because the usual way of formulating queries, relational algebra (e.g. see [84]), establishes connections between foreign and primary keys by joining corresponding tuples together into a new relation. The join operation does not take into account which set of attributes is a primary key and which one is a foreign key. Instead, two relations are combined with a Cartesian product, and then the resulting tuples are filtered with a boolean predicate so that only those tuples remain where the values of primary and foreign key match.
The RDM has the disadvantage that certain multiplicities of associations between data elements are hard-coded in a relational schema, and cannot be changed without changing the schema’s structure. That is, the topology of a data model is not orthogonal to the concept of multiplicities. This creates a dependency between concerns that should ideally be separated, and results in maintenance problems during the evolution of a database.
For example, consider the association between persons and bank accounts. In the example schema, a bank account is owned by at most one person, i.e. exactly one person if we demand that ownerid must not be ⊥. However, a person can own arbitrarily many bank accounts since many BankAccount tuples can have the same ownerid value. If we wanted to change the multiplicities so that each person can have at most one bank account but a bank account can have arbitrarily many owners, we would have to put a foreign key referencing a BankAccount tuple into relation P erson. We would also remove the foreign key ownerid in BankAccount that references a P erson. This is shown in the schema S′ , where P erson now has the foreign key bankaccount:
S′ = {P erson(id, name, address, bankaccount), BankAccount(number, balance)}.
If we wanted to change the multiplicities of the association between persons and bank accounts so that a person can have arbitrarily many bank accounts and a bank account arbitrarily many owners – a many-to-many association, then the change would be even more drastic: there would be no foreign keys in the relations P erson and BankAccount, but we would have to create a new relation that associates the tuples of the two relations. This is shown in schema S′′ :
S′′ = {P erson(id, name, address), BankAccount(number, balance), Ownership(personid, accountnumber)}.
Relation Ownership contains a foreign key personid that references a tuple of P erson, and a foreign key accountnumber that references a tuple of BankAccount. It is thus possible to join persons with their bank accounts by first joining P erson with Ownership, and then joining the result with BankAccount. The primary key of relation Ownership comprises both attributes personid and accountnumber, so that each connection between a person and a bank account can be stored once.

READ Preliminary data analysis: Gauteng and Limpopo

Relational Database Management Systems (RDBMSs)

The terminology used in RDBMSs diﬀers slightly from the one used in relational algebra, although the concepts are the same. Relations are called tables, tuples are called rows, and attributes are called fields. Modern RDBMS extend the basic relational data model with other concepts, such as constraints, triggers, views, stored procedures, and user-defined functions. The standard language for creating, modifying and querying a database is SQL [116], which is more powerful than relational algebra. Furthermore, RDBMS oﬀer functionality to support the eﬃcient execution of database queries and the prevention of data loss, such as indexes and logs. More about theses concepts can be found in [84].
Modern RDBMSs are very mature and oﬀer many advantages. They are very reli-able, very eﬃcient and oﬀer advanced features for safety and security, such as transaction processing and role-based access control [190]. They support automatic checking of in-tegrity constraints on the data, and event-based data management with triggers. Most good RDBMS can be programmed with stored procedures and extended with user-defined functions. They can be accessed over a network, and distributed using database repli-cation techniques. With SQL, access to relational databases is relatively standardized. There are very good free open-source implementations, e.g. Firebird [211].
I implemented the repository of the AP1 system on a RDBMS because of the formal maturity of the relational data model, and the practical maturity of modern RDBMS. The relational data model reflects essential mathematical concepts, which allow it to define a database in a concise manner. Furthermore, RDBMS satisfy many of AP1’s requirements, as we will see later on.
The object-oriented (OO) data model emerged in the context of object-oriented program-ming (OOP), which emerged in the 1960’s. Probably the first language to support object-orientation is Simula [168], which was developed by Ole-Johan Dahl and Kristen Nygaard at the Norwegian Computing Center, Oslo. As the name suggests, the Simula language was intended for simulation of complex systems. It was successfully used, for example, for the simulation of telephone traﬃc systems, electronic circuits, aircraft surveillance, and neural networks [175]. Unsurprisingly, one of its inventors, Kristen Nygaard, was very involved in the field of operations research.
However, object-orientation was not used much for mainstream software development until the 1980’s. Its influence became stronger with languages such as C++ [204], which is an OO extension of the popular C language, and the emergence of GUIs. The usage of languages for GUI programming was perceived as a good match. In the 1990’s OO established itself as the predominant software development paradigm. The Java language, which also makes use of the popular C-style syntax, contributed a lot to its popularity with its virtual machine concept and use on the World Wide Web.
OOP came with a considerable hype as many of its advocates claim that it “revolution-ized” software development. However, there is no clear evidence that object orientation makes software development significantly more eﬃcient, e.g. see [184, 136]. In fact, several ,studies cast a shadow on its alleged benefits [106, 37, 61].
Basic OODM concepts are classes, objects and inheritance. Classes are product types that contain typed data fields. Objects are values of class types. Objects are identified with object references, which are artificial values that usually simply describe the memory location of an object. Inheritance makes it possible to define a hierarchy on classes: a subclass can be defined as an extension to a superclass, which means that it inherits its fields and can be used in its place – a property also known as the Liskov substitution principle [140]. In that way, common parts of classes can be reused by extracting them into a common superclass.
OOP adds to these rather data-related concepts features for managing executable code, such as methods, method polymorphism and dynamic binding. It should be noted that these concepts are not new. OOP is characterized by the way they are mixed and used, e.g. in the form of OO design patterns [96]. Furthermore, OO languages often support features that are not typically OO, such as features from the functional programming paradigm. Consequently, OOP is not a “pure” concept by itself, i.e. not at all orthogonal to other programming paradigms. It is possible, and not unusual, that programs written in a non-OO language implement OO features by hard-coding them explicitly.
Methods are code routines that are associated with a particular class, and thus are meant to work primarily with the data defined by that class. The information hiding principle in OOP relies primarily on the fact that classes encapsulate data and code that accesses that data, and that data and code within a class can be shielded from external access. Similar to fields, also methods can be inherited from a superclass to its subclasses, so that functionality common to several classes can be extracted into a common superclass. It is important to note that this characteristic property of OO to bundle data and code is actually not present in all OO languages, e.g. CLOS [95] and Dylan [193] maintain separate hierarchies for methods and classes.
Method polymorphism means that there can be diﬀerent versions of a method with a particular name. Through overloading it is possible to define several methods with the same name in the same class that are distinguished by their parameter types. Overriding makes it possible to redefine a method of a superclass in one or more of its subclasses. This is used through dynamic binding, also known as dynamic method dispatch: when a method is called on an object, and the method has been defined for diﬀerent classes through overriding, then the definition that is used is chosen at runtime depending on the actual type of the object.

1 Introduction
1.1 Software Engineering – Not Entirely Unproblematic
1.2 Is Software Special?
1.3 CASE-Tools
1.4 A Platform for Model-Based Software Engineering
2 Choosing the Right Data Model
2.1 The Relational Data Model (RDM)
2.2 The Object-Oriented Data Model (OODM)
2.3 The Parsimonious Data Model (PDM)
2.4 The Unified Modeling Language (UML)
2.5 Extensible Markup Language (XML)
2.6 Conclusion
3 The Repository
3.1 Requirements
3.2 Overview
3.3 Using a RDBMS
3.4 Mapping the PDM onto a Static Relational Schema
3.5 Mapping the PDM onto a Dynamic Relational Schema
3.6 Operations
3.7 Reflection
3.8 Data Interchange
3.9 The Repository Client Library
3.10 Related Work
3.11 Conclusion
4 Change Control
4.1 Introduction
4.2 A Fine-Grained Perspective on Relational Data
4.3 The Change Log
4.4 Using the Change Log
4.5 Synchronous Centralized Collaborative Development
4.6 Asynchronous Centralized Collaborative Development
4.7 Decentralized Collaborative Development
4.8 Related Work
4.9 Conclusion
5 Robust Content Creation with Form-Oriented User Interfaces
5.1 Introduction
5.2 The Form-Oriented User Interface Model
5.3 Content Modeling
5.4 Two-Stage Interaction
5.5 Configuration vs. Construction
5.6 Conclusion
6 Reflection as a Principle for Better Usability
6.1 Reflection and HCI
6.2 Approaches for Reflection in User Interfaces
6.3 Examples
6.4 Related Work
6.5 Conclusion
7 The Generic Editor
7.1 Requirements
7.2 Overview
7.3 The Workbench
7.4 Views
7.5 Customizability
7.6 Collaborative Work
7.7 Conclusion
8 Code Generators
8.1 Introduction
8.2 The Genoupe Language
8.3 Generator Type Safety
8.4 The Genoupe Type System
8.5 Integrating Genoupe into the AP1 System
8.6 Related Work
8.7 Conclusion
9 Conclusion
9.1 Achievements
9.2 Future Directions
9.3 Reflections
GET THE COMPLETE PROJECT