NMF as a pre-processing tool for clustering

Get Complete Project Material File(s) Now! »

Sociological considerations

It is pretty obvious that a high quality public transportation system allows citizens to minimize their use of personal motorized vehicles to travel across the city.
Thus, [Litman, 2016] showed that it also have a positive impact on user’s health. Indeed, he proved that it reduces vehicles accidents and pollution emissions while increasing passenger’s mental health and fitness, as they walk more to access stations and stops than people using their

Residential District
(a) Theoretical diagram of the land value in a monocentric city model. The land value decreases when the distance to the city center in-creases.
(b) Home value to replacement cost ratio in New York City and its surroundings. This is an example of the land values that describes this metropolis as monocentric. Image by Issi Romem on Build-zoom [Romem, 2017].
Figure 2.1: Monocentric city model – theoretical diagram and example vehicle. Moreover, an eﬃcient network permits disadvantaged people with no car to access more neighborhood and thus more services (such as shops or health services) and to improve their lifestyle.
The authors of [Gendron-Carrier et al., 2018] also showed that when a new subway line is opened, a decrease in pollution particles is measured. In [Zion and Lerner, 2017], mobility pattern are used to understand the diﬀerent neighborhoods of the city, and then understand the sociology of the city.

Economical considerations

In the monocentric city model we mentioned earlier, commute trips by private mode have often been studied. Indeed, in [Arnott et al., 1993; An-derson and De Palma, 2007] for example, the authors studied congestion at one city gates and in its parking during peak periods. Although American cities – on which most of these studies are based – have generally little developed infrastructures of public transportation, it is not the case of Eu-ropean cities. Unfortunately, we found only a few references dealing with the impact of a developed public transportation network on the economic health of the city.
Some papers [De Palma et al., 2015, 2017] focused on congestion issues in public transportation for commuters. Both these papers highlight the need of optimal timetables and train capacities to oﬀer less congestion to commuters. Indeed, trains with larger capacities will allow more comfort to peak-period passengers, but an optimal timetable will also allow travelers willing to avoid crowd to arrive earlier or later to their destination in less crowded trains. This last proposition permits a congestion spread.
However, [Litman, 2015] studied the diﬀerences between cities with a large rail infrastructure, cities with a small rail infrastructure and cities with no rail infrastructure. The author stated that a bigger rail infras-tructure – thus cities with a more eﬃcient network – implies bigger rider-ship by capita, less traﬃc fatalities and less households budget allocated to transportation. Then, budgets are higher for other goods or services. All these results show that an eﬃcient transportation network implies a healthier and wealthier local economy. Another example is given by [Pang, 2018], which shows that a dense network facilitates the employability of low-skilled workers.
The economic studies mentioned here show that an optimal public transportation network not only have a better attractiveness for citizens, but also appears to be an axis of economic development for cities. Every-thing that has been described in this thesis so far motivates the detailed study of transport data.

Smart Cities and Urban Data

In traditional studies, yearly data are used. But since the development of digital miniaturization, every conceivable type of object contains a com-puter that generates huge quantities of data. And as [Batty, 2013] present it, we are now able to better understand how cities function on much shorter term than before. As seen previously, traditional studies focus on location of land use and long-term functioning in cities. However, with these new ubiquitous sensors, it is easier to study movement and mobility than before. In the same way we call our phones « smart » since they are small computers, we can now boast about smart cities.
This term first appeared in 1994, but is more and more used in many papers. A full definition of smart cities is obtained by combining the works of [Dameri and Cocchia, 2013; Nam and Pardo, 2011; Harrison et al., 2010] to name but a few. The goal of a smart city is to oﬀer the best life conditions to its citizens and visitors.
To achieve it, the city needs to be instrumented, that is being able to collect a large amount of data through the use of senses, meters, kiosks, personal devices, cameras, appliances, smart phones, implanted medical devices and even social networks. The authorities then need to be inter-connected by implementing a platform where these data can be stocked and communicate about it among the several city services. Finally, the city needs to become intelligent, by including analytics, modeling, com-puting and visualization. All these can then serve to solve more and more recurring urban issues, such as road congestion, noise and air pollution, energy and water consumption and waste treatment.
A smart city is also a sustainable city, since they try to solve similar problems, that are mostly environment related. In 2014, [Lee et al., 2014] gave the example of the city of San Francisco, among other cities. At that time, the city was still defining its smart strategy, but had already launched its own open data platform called ’DataSF’, had several intel-ligent analytical tools based on real-time and integrated transportation services for real-time prediction or demand responsive pricing for parking.
As both the EU and the United Nations (UN) set ambitious climate and energy targets for the years to come, [Ahvenniemi et al., 2017] pointed that we urgently have to find smarter ways to decrease pollution and improve energy eﬃciency.
(a) Pollution over Paris (France). Photo by Alberto Hernandez on flickr.
(b) Congestion in Gurgaon (In-dia). Photo by Taresh Bhardwaj on flickr.
Figure 2.2: Some issues encountered by large cities
In fact, a part of the solution is to use urban computing and statistics as pointed in [Zheng et al., 2014a]. The authors also recall that we can find several type of urban data:
Geographical data: In Beijing, several studies have been using the GPS trajectories of taxicabs. These data served to evaluate the ef-fectiveness of urban planning, detect traﬃc anomalies and detect the function of each area of the city [Zheng et al., 2011; Pan et al., 2013; Yuan et al., 2012]. In Australia, Bluetooth detectors have been placed across the city of Brisbane. These data have been spatiotem-porally clustered to be able to describe vehicles dynamics in the city in [Laharotte et al., 2015].
Traﬃc data: The authors of [Zhang et al., 2017] forecast the crowds traﬃc in each region of the cities of New York City and Beijing. In Pisa, GPS data are used to detect traﬃc congestion and incident and inform other drivers in the area [D’Andrea and Marcelloni, 2017]. Finally, [Castro et al., 2013] indicates that these data are used to analyze three diﬀerent dynamics: social dynamics, traﬃc dynamics and operational dynamics.
Mobile Phone Signals: Mobile Phones generates a variety of data, useful to urban planning. For example, geolocation sensors help recommend new places or event to users [Bothorel et al., 2018]. In Singapore, the authors of [Jiang et al., 2017] related that mobile phone call details record are used to better understand spatial human mobility pattern through the city.
Environmental Monitoring Data: In France, [Abadi et al., 2017] ex-plains that smart meters are quite new for water, but they have been able to understand and forecast water consumption. For air qual-ity, [Zheng et al., 2014b] showed that in nine Chinese cities, they used historical and real-time air pollution data to infer air quality in cities area without monitor station.
Social Network Data: Social Networks oﬀer a large amount of de-tailed data about their users. For example, [Zheng, 2011] use user’s location history to recommend him new friends to meet and to create communities with the same interests. In Japan, [Lee and Sumiya, 2010] detect unusual events such as festivals with geotagged tweets, whereas in Beijing [Pan et al., 2013] describe traﬃc anomalies with the WeiBo microblogging platform. More recently, [Atefeh and Khre-ich, 2015] propose a survey of techniques for event detection on tweets.
Economy: The authors of [Di Clemente et al., 2018; Louail et al., 2017] used credit card purchases to cluster the population to reveal their urban lifestyle and to analyze the shopping mobility practices to counterbalance the socioeconomic inequalities between neighbor-hoods.
Energy: In Ireland, households have been clustered on their con-sumption behavior, thanks to smart electric meters [Melzi et al., 2017]. Moreover, socio-economic data have been used to analyze the clusters.
Health Care: In [Dzhambov et al., 2018], the authors use mental health data and urban noise data to establish a link between these two phenomenon, whereas [Guarnieri and Balmes, 2014] establish a link between asthma and urban air pollution.
Commuting Data: Finally, the data we have the most interest in are commuting data. There is a lot of literature about studies on that type of data. In the French territory of Val d’Amboise, inter-modal travels are studied and these data served to make an economic comparison between a bike-and-ride service and a park-and-ride ser-vice [Papon et al., 2017]. Several papers deal with bike sharing sys-tems data [Côme and Oukhellou, 2014; Bouveyron et al., 2015] in order to understand the relationships between neighborhood’s type and the most common mobility pattern and to assign a function to each region. In [Briand et al., 2017; El Mahrsi et al., 2017], the au-thors cluster smart card data in order to create groups of passengers having similar temporal behavior and group of stations having the same type of usage. To be able to identify pickpocket suspects, [Du et al., 2018] detect unusual daily transit records. In [Toqué et al., 2017], the authors use smart card data to forecast travel demand on a short (15 to 30 minutes) and a long (1 year) term in the area of La Défense in Paris.
There is a large diversity of urban data and [De Palma and Dantan, 2017] addresses the challenges of dealing with such a variety of data. In-deed, even if this is not the topic of this thesis, the accumulation of such personal data raises several problems. These ethical problems deal mainly with data security [Hardt et al., 2016], data anonymization [Gadouche and Picard, 2017] and non-use for purposes of discrimination [Abadi et al., 2016; Ji et al., 2014].
In this thesis, conducted at CREST thanks to Transdev funding, we aimed to propose progress on the analysis of certain types of data described

READ Langevin Monte-Carlo with inaccurate gradient

Transdev

Transdev is a French transportation operator operating internationally. The company as been created in 2011 – firstly under the name of Veolia Transdev – by merging Veolia Transport (from Veolia) and Transdev (from Caisses des Dépôts et Consignations). As this is being written, these com-panies are still the main shareholders of Transdev, but Veolia announced its intention to sell its shareholding to the Rethmann Group before the end of 2018 [Trompiz and Mazzilli, 2018]. By operating buses, tramways, ferries, taxis, coach lines, trains, shuttles, medical services, school services and autonomous vehicles across 20 countries, Transdev transports 11 mil-lions passengers everyday.

Table of contents :

1 Introduction (French)
1.1 Contexte et motivations
1.1.1 Considérations sociologiques
1.1.2 Considérations économiques
1.1.3 Villes intelligentes et données urbaines
1.1.4 Transdev
1.2 Résumé substantiel des chapitres
1.2.1 Segmentation
1.2.2 Régression et Prévision
2 Introduction (English)
2.1 Context and motivations
2.1.1 Sociological considerations
2.1.2 Economical considerations
2.1.3 Smart Cities and Urban Data
2.1.4 Transdev
2.2 Summary of the chapters
2.2.1 Clustering
2.2.2 Regression and Forecasting
3 NMF as a pre-processing tool for clustering
3.1 Introduction
3.2 The data
3.3 Results obtained by EM
3.4 Results obtained by NMF
3.5 Conclusion
4 Dimension Reduction and Clustering with NMF-EM
4.1 Introduction
4.2 Factorization of mixture parameters and the NMF-EM algorithm
4.2.1 Factorization of mixture parameters
4.2.2 The NMF-EM algorithm
4.2.3 The NMF-EM algorithm for mixture of multinomials
4.2.4 Discussion on the choice of H and K
4.3 Simulation study
4.4 Application to ticketing data
4.4.1 Description of the data
4.4.2 Passenger profile clustering
4.4.3 Stations profile clustering
4.4.4 Passengers profile clustering on another network
4.5 Conclusion
5 Forecasting and anomaly detection
5.1 Introduction
5.2 Data presentation
5.3 Modelization
5.3.1 Linear model
5.3.2 Generalized Additive Model
5.3.3 Random Forest
5.4 Confidence intervals
5.5 Application: impact of the 2018 SNCF social strike on one network
5.5.1 Introduction
5.5.2 Model selection
5.5.3 Results
5.6 Conclusion