Séminaire Mensuel / Monthly seminar | SSB - Statistics for Systems Biology

The seminars

Mardi 12 Avril 2016 / April 12th (Paris, AgroParisTech, room 31)

14h00: Alexandre d'Aspremont (DI, ENS)

Relaxations convexes pour l'ordonnancement de données ADN

La sériation cherche à reconstruire un ordre linéaire entre un série de variables, en utilisant des données de similarité entre ces variables. Ce problème a des applications directes en archéologie et en assemblage de séquences ADN par exemple. Nous montrons l'équivalence entre le problème de sériation et un problème combinatoire quadratique sur les permutations (2-SUM). Nous proposons une relaxation convexe de 2-SUM qui améliore la robustesse des solutions dans le cas ou les données sont bruitées. Cette relaxation nous permet également d'inclure des contraintes structurelles sur la solution, pour résoudre des problèmes de sériation semi-supervisés.

16h00: Julie Josse (AgroCampus Ouest)

A missing values tour with principal components methods

The problem of missing values exists since the earliest attempts of exploiting data as a source of knowledge as it lies intrinsically in the process of obtaining, recording, and preparation of the data itself. Clearly, (citing Gertrude Mary Cox) ``The best thing to do with missing values is not to have any’’, but in the contemporary world of increasingly growing demand in statistical justification and amounts of accessible data this is not always the case, if not to say more. Missing values occur for a variety of reasons : machines that fail, survey participants who do not answer certain questions, destroyed or lost data, dead animals, damaged plants, etc. In addition, the problem of missing data is almost ubiquitous for anyone analyzing multi-sources data, performing meta analysis, etc. Missing values are problematic since most statistical methods can not be applied directly on a incomplete data. In this talk, we show how to perform dimensionality reduction methods such as Principal Component Analysis (PCA) with missing values. PCA is a powerful tool to study the similarities between observations, the relationship between variables and to visualize data. Then, we show how principal component methods can be used to predict (impute) the missing values. These approaches showed excellent performance in recommendation systems problems such as the "Netflix challenge" and consequently caught the attention of the machine learning community. Indeed, the methods can handle large matrices with large amount of missing entries. We present other popular techniques to impute missing values, discuss the potential pitfalls of the different approaches and challenges that need to be addressed in the future.