Distances, Neighborhoods, or Dimensions? Projection Literacy for the Analysis of Multivariate Data

Projections are some of the most common methods for presenting high-dimensional datasets on a 2D display. While these techniques provide overviews that highlight relations between observations, they are unavoidably subject to change depending on chosen configurations. Hence, the same projection technique can depict multiple compositions of the same dataset, depending on its parameter setting. Furthermore, projection techniques differ in their underlying assumptions and computation mechanisms, favoring the preservation of either distances, neighborhoods, or dimensions. This article aims to shed light on the similarities and differences of a multitude of projection techniques, the influence of features and parameters on data-representations, and give a data-driven intuition on the relation of projections. We postulate that, depending on the task and data, a different choice of projection technique, or a combination of such, might lead to a more effective view.

1

Introduction

"t-SNE is the best projection technique available"—this might be a typical sentence from someone who understands projection techniques only on a superficial level. State-of-the-art techniques—such as the latest t-SNE—produce "nice" results promptly. However, that so many people think t-SNE is synonymous for the state-of-the-art emphasizes two aspects: (1) people don't know much about projection techniques: linear vs. non-linear, global vs. local, and (2) they do not know how to interpret them. This expectation is in line with the survey of Sedlmair et al., who found that "most users who used DR [dimensionality reduction, i.e., projection techniques,] for data analysis struggled or failed in their attempt (of 22, 13 struggled and 6 failed). These numbers underline the need for further usage-centered DR development and research." For example, sometimes users are not able to distinguish actual patterns from artifacts introduced by a dimensionality reduction technique, or even lack an overall conceptual model of the dimensionality reduction technique they apply.

In this blog post, we attempt to enable you to gain a better understanding of how to interpret projections of high-dimensional datasets. Therefore, we guide you through selecting a dataset and some techniques. Both are shown in interactive visualizations that allow you to have a closer look at any time. Do not hesitate going back and forth to find out how swapping the dataset or technique, and changing a parameter influences what you see. Grasping how to interpret projections is an interactive " learning by doing" exercise.

This article guides the readers through different aspects of "Projection Literacy", based on a selected dataset. We argue that the choice of the appropriate projection technique and its parameters is dependent on the data and individual user tasks. Each technique thus emphasizes different relations in the data and should, therefore, be interpreted under these considerations. The three most prominent features objectified in projections are distances, neighborhoods, and dimensions.

2

Datasets

For this blog post we selected a number of well known datasets next to some very simple synthetic examples. The simplistic examples will especially help you to better understand how different projection techniques represent structures and distances. Please note that we display only 500 observations per dataset to achieve interactive performance. All datasets are available as part of scikit-learn A, so please help yourself and digg deeper.

To get started, we selected the well known S curve dataset from the list below for you. In the list, each entry shows one dataset by its name, and a preview based on a projection using Principal Components Analysis (PCA). Below entries you can find the number of observations as well as the number of dimensions in the full dataset. We also indicate whether the dataset originates from a real world source Real or was generated synthetically Synt. Further, each dataset is assigned one of three main tasks: regression R, classification C, or manifold learning M. Finally, if available, there are links to external references such as the original academic paper A, the scikit-learn API S, and Wikipedia W.


2.1

Details on

Below you can find a matrix displaying the dataset you selected including all dimensions as rows and currently selected observations as columns. The tone of red in each cell denotes the value of one observation for a dimension. Higher values are more red, while exact numbers are not of interest for our pursuit. If you chose a dataset for classification C, class labels are denoted by pastel colors in the background of column labels.

You can click on a column label to select one observation and its ten nearest neighbors in the high-dimensional space. If you hover a column label, distances from the hovered observation to other observations are shown in projections below in blue. The matrix will stick at the top of your screen for easy access when you actually reach the projections. If you hover a row label, values of observations on the hovered dimension are shown in red. Finally, hovering the "ID / Class" label shows class labels, values on the dependent dimension, or position on the main direction of the manifold, depending on the dataset's main task being classification C, regression R, or manifold learning M.

Dimensions (rows) in the matrix are sorted by variance with the most diverse column at the top. Observations (columns) are sorted by . If you sort them by the single-linkage clustering, observations are clustered in the high-dimensional space, and the cluster dendrogram will be shown above the matrix. For some datasets—like the ten Gaussian blobs—you can see that clustering in the high-dimensional space clearly reveals the ten clusters, while some projection techniques—e.g., PCA—may not allow to distinguish them individually.

Dataset: — Techniques:
3

Dimensionality Reduction / Projection Techniques

There are hundreds of techniques developed by mathematicians, statisticians, visualization researchers, and others. Technically, there are some differences, for example, between projections and embeddings, and the broader set of dimensionality reduction techniques. To keep things simple, we use the term "projection" for all kinds of techniques. Further, we only consider unsupervised techniques—i.e., the projection technique does not know about class labels—which do not include a clustering step. Hence, you will not find techniques like Linear Discriminant Analysis (LDA A A W) and Self-Organizing Maps (SOM A W) here. Our selection focuses on a diverse set of commonly used techniques. We use the implementations of scikit-learn, so that you can try out things easily by yourself in case you want to have a closer look.

Before diving into projections, you need to understand a central design decision. Projection techniques transform high-dimensional data to a lower-dimensional space while preserving its main structure. Typically, the data is transformed to two-dimensional space and visualized as a scatter plot as a means to analyze and understand the data. In order to transform data from a higher-dimensional to a lower-dimensional space, we distinguish between two categories: linear and non-linear projection techniques. Linear projection techniques produce a linear transformation of data dimensions in lower-dimensional space. Proximity between data points indicates similarity. The more similar data points are, the closer they are located to each other and vice versa. This is why linear projection techniques are also known as global techniques. In contrast, non-linear projection techniques, also known as local projection techniques, aim at preserving the local neighborhoods across the features in the data. Hereby, proximity highlights differences and coherences between observations and is not to put on the same level as similarity.

Linear techniques: PCA, SparsePCA, Classical MDS, Factor Analysis, FastICA. Non-linear techniques: (Hessian / Modified) LLE, t-SNE, Isomap, Kernel PCA, LTSA, Laplacian Eigenmaps.
Figure 1: Projection techniques can be broadly separated in two groups; linear/global and non-linear/local.

For example, the technique t-SNE is commonly used for identifying clusters in the data. Yet, depending on the choice of parameters, built clusters can be of different size, as well as of different proximity to each other. Either way, there is a clear separation of clusters and we can identify them as being different or coherent. The so-called S curve dataset is often used to showcase the described difference between linear (global) and non-linear (local) projection techniques (See Figure 1). In the case of linear techniques, global characteristics will be preserved and the shape of the ‘S’ is represented even in two dimensions. In the case of non-linear techniques, local characteristics—i.e., the local neighborhoods—count and the ‘S’ is typically rolled out in two dimensions.

3.1

Choose some projection techniques

Below you find our set of projection techniques. Each technique is presented on a card featuring the technique name, and a projection of the selected dataset. Feel free to brush (hold down mouse key and drag) some observations in a scatter plot to see how they are distributed by other techniques. Click on the technique name to show details in the following section. Below the scatter plot you can select the techniques to be displayed in section 4.3 (up to 5 techniques). In case the techniques has parameters you may click the P symbol to get a more detailed description and set these parameters. Finally, there are some links to external references, such as the original academic publication A, the scikit-learn API S, and Wikipedia W.

If you want to learn more about available techniques, van der Maaten, Postma and van den Herik as well as Gisbrecht and Hammer provide overviews.

3.2

Details on

3.3

Set parameters of

You can set parameters using the slider(s) below. The table shows you how much changing parameters will affect the projection visually. On the one hand, you can compare projections directly. On the other hand, brown borders show the calculated difference using a measure inspired by those used by Lehmann and Theisel and Cutura et al.. The more brown the border between two projections the larger is their difference, and hence the sensitivity to changing the parameter. Please, note that the calculated difference tries to be invariant to scaling, mirroring, sheering and rotation—i.e, affine transformations. As a result, especially outliers may drastically change the visual appearance, while the measured difference remains low.


Evaluating parameter influence: This table shows you projections by one technique with different parameters set. The brown borders represent calculated differences between adjacent projections. The stronger the brown color the more both projections differ, and hence the more sensitive the technique is to chooseing parameters in that region.
4

Visual Assessment of Projections

Planar projections map high-dimensional data to a lower-dimensional space and try to preserve the main characteristics of the data. However, depending on the projection technique, different characteristics are preserved.

A nice analogy is the shade a tree casts in the sunlight. If the sun is directly above the tree, its shadow does not resemble a tree too much. The shadow is more like a wool ball than a tree—we can argue that certain tree features were not adopted in this particular form of projection. However, if the angle of the sun changes, other features, such as the tree trunk or certain branches, may be visible.

This behavior is very similar to planar projections. Depending on which projection technique you choose, results might look very different. For example, when choosing a kernel function for Kernel PCA the user has plenty of options. With a linear kernel—i.e., classical PCA—the shape of the S-curve is well represented; polynomial kernels introduce distortion to the ‘S’, but still allow for grasping the general shape. With a cosine kernel, on the other hand, the ‘S’ structure is not present in the planar projection at all.

While in these simple cases high quality representations of the overall structure of the three-dimensional dataset can be achieved, in a more general high-dimensional setting this will not be possible most of the time. In our research, we thus identified three main characteristics that mainly steer a projection:

  1. Distances between observations
  2. Neighborhoods of observations
  3. Representation of original dimensions

Furthermore, and most importantly, users should ask themselves the following question before deep-diving into any projection result: "Which kind of pattern do I search for / am I interested in?" Depending on this question, one has to find a trade-off between distances, neighborhoods, and the relevance of dimensions across observations. However, there can also be no single projection technique that is superior across all application scenarios and tasks, as techniques require to trade-off between the aforementioned three characteristics. For example, in a high-dimensional dataset that consists of n observations, (n^2-n)/2 unique distances exist. However, in a two-dimensional projection only 2n-3 distances can be enforced. Likewise, considering two dimensions—as in a regular scatter plot—is enough to place all observations.

In order to assess whether the chosen projection reflects the distances, the neighborhoods, or the relevance of dimensions, we developed a visual representation called "feature map". Instead of color-encoding the observation itself, we choose to encode the area in which the observations is unique using Voronoi Tessellation. The area is particular interesting, because it properly reflects which characteristics are encoded and how wide they are spread.

4.1

Assessment of distances

For the first characteristic, we encode the distance as proposed by Aupetit in 2007. Each Voronoi cell color encodes the distance to the selected observation. If a projection perfectly reflected its high-dimensional counterpart, it would form a perfectly aligned color gradient across all observations.

The top of the browser window shows a matrix of observations. Return to Section 3.1 (the overview of projections) and hover any observation (column) label—e.g. "o10". When trying out different observations, you will see that major differences in quality may occur. With all the aforementioned projection techniques, it is challenging to tell how well distances are preserved, before exploring the visual representation.

This is where the knowledge about the aim of a projection technique comes into play. Using a linear technique, distances have meaning. Using for example MDS, the distance will typically express the aggregated Euclidean distance between observation. In contrast, a non-linear technique, such as t-SNE, emphasizes neighborhoods, where visual distances have no specific meaning. This is also why the visual representation of the results after applying t-SNE effectively reflect classification results. Classes are grouped, but there is no "real, meaningful" distance between them.

While the technique introduced by Aupetit that we feature in this blog is sufficient as a starting point to learn how to interpret distances across projections of high-dimensional data, there exist several other methods. For example, Seifert, Sabol and Kienreich introduced Stress Maps. More techniques for inspecting the quality of projected distances were proposed by Schreck, von Landesberger and Bremm Stahnke et al. as well as Heulot, Fekete and Aupetit.

4.2

Assessment of neighborhoods

Click on the observation label in the matrix or directly on the observation in one of the scatter plots in Section 3.2 to see where the nearest neighbors of the selected observation are located. The selected observation is highlighted in blue and its ten nearest neighbors in gray. Remaining observations are highlighted in a very light gray.

While nearest neighbors are typically displayed as close as expected, they sometimes may spread all over the projection. Similar to distances, it is not possible to guarantee a preservation of neighborhoods, although preservation of neighborhoods is simpler from a mathematical point of view—i.e. if all distances are preserved, then also all neighborhoods are preserved, but not vice versa.

As mentioned before, the preservation of neighborhoods is of great value when it comes to tasks that do not require semantically meaningful distances, such as classification. One must be aware of the fact that distances between classes and densities within such groups have no particular meaning. Analogous to the Voronoi-based visualization of distances, we limit the depicted representations to the ones helping you to get an initial idea of which projection technique to choose for which characteristic. Additional techniques for investigating neighborhoods were introduced by Lespinats and Aupetit, Martins et al. as well as Martins, Minghim and Telea.

4.3

Assessment of dimensions

The curse of dimensionality typically impairs the ability of a projection to visually carve out significant characteristics of a dimension. The more dimensions are taken into account, the less expressive the results of planar projection are. Yet, well-represented original dimensions can ease the interpretation of a projection result. Figuring out which dimensions allow for a reasonable interpretation in the projection is key to avoid misinterpretations. Meanwhile, the number of original dimensions that may enable approximate interpretations in a projection is not limited to two. Correlations and dependencies in high-dimensional space can lead to more than two interpretably represented input dimensions.

The "feature map" table below enables you to effortlessly compare the representation of the input dimensions. It is generated automatically based on the Voronoi Tessellation, just like the distance maps above. Instead of encoding distances to one focus observation, each cell color encodes the value of its observation along an input dimension. More colorful cells represent higher values within input dimensions. Therefore, the map shows which value along each dimension one might expect at a location based on the nearest neighbor.

Brushing observations in one of the scatter plots above highlights only those cells in the table below, which are located within the range of the largest and smallest value of the brushed observations. In other words, if a new observation is "similar" to the selected set of observations, you expect it to be placed in the colored regions. Caution: This description is purely hypothetical, because some projection techniques produce very different results if a new observation is added to the projection.

Table 1: Values of dimensions (rows) are mapped to shades of red. The more red the higher the value is in respective dimension. Colors are relative to the minimum and maximum in each dimension. You do not need exact values as you do not expect such precision. Just compare the distributions of features.

Again, there are more techniques far assessing the representation of dimensions available, for example those introduced by da Silva et al., Faust and Scheidegger, Cavallo and Demiralp (2017), and Cavallo and Demiralp (2018).

4.4

Assessment using additional information

Often the main interest is not in the dimensions available, but an additional dimension to be predicted. In a classification task the main interest is in the class labels of observations, and for regression in the values of a dependent dimension. As these outcome dimensions are not used by unsupervised projection techniques, looking at them may also provide hints on the quality of projections. Hover the "ID / Class" label at the top left of the matrix to show—depending on the main task of the selected dataset—observations' class labels (pastel colors), their values along the dependent dimension for regression (grayscale), or their position along the main direction of the synthetic manifold (green). Keep in mind though that, as Aupetit notes, evaluating projection quality based on class separation alone may not be a good idea.

5

Lessons learned

As you have seen above, it is not straightforward to interpret projections. Be sure about what you expect from the projection—show patterns such as classes, or show similarities or do subspace analysis. Then, the design decisions taken by projection techniques—e.g., linear vs. non-linear—are a central point to consider. You may want to make sure that your task demands align with the properties of your preferred projection technique. It is also always good to start with simple data distribution analysis for what to possibly expect from the data. We are well aware, that the datasets presented in this blog post are rather simple. So, whichever technique worked well here may not be your favorite choice in a practical application. Similarly, a technique that worked well for you colleague—or even yourself last time—may not work for you now, having different data and other tasks.

Hopefully, you are now better able to interpret the different projections you can generate from your data. The so-called curse of dimensionality is responsible for many issues that come with projection techniques. A thorough analysis of interesting observations prior to projection can lead to better results. So, to date our best advice is: Have a look at your data, try multiple projection techniques with various parameter settings, then choose those that provides best results relative to your task demands. However, this blog post is just the tip of the iceberg, or your first step on a long hike to generating good projections. For example, Bertini, Tatu and Keim looked at quality metrics for the visualization of high-dimensional data. Nonetheless, to the best of our knowledge there are no automated recommendation techniques working across tasks. In the end, it is still up to you to find a projection that provides what you need. There is one catch though, try not to overfit your data to your expectations.

For more guidance on choosing projection techniques you may consult the works of Sedlmair, Munzner and Tory, and Etemadpour et al.. Cutura et al. recently presented VisCoDeR for choosing projection techniques and setting parameters. Further, Sacha et al. provide an overview of options for interaction with projections.

6

Conclusion

Taken together, always beware that you need to know which technique was used to produce a projection in order to arrive at reasonable interpretations, as a projection rarely shows the full picture of a dataset and its interpretation depends on the technique used for construction. Projecting to two dimensions almost always means losing information. The question is, whether or not relevant features of a dataset are preserved by the projection technique. As a result, you need to choose one or more techniques that suit your task, and you should not expect that there were a one-size-fits-all technique or an algorithm that works out-of-the box. Much of the power of projection techniques depends on a good match between data and technique as well as—in most cases—a finely tuned set of parameters. Turning your wrap into a pizza does not come for free.

Acknowledgements

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 700381

Credits

The interactive visualizations in this article build on D3.js and Python's Scikit-learn.

Contributions

The authors jointly structured, wrote, and edited this text. Dirk Streeb implemented the interactive visualizations.

Reviewers

Some text with links describing who reviewed the article.

Imprint, Contact, Data protection

This page is hosted by the Data Analysis and Visualization Group at the University of Konstanz: Imprint, Data protection. In case you have any requests, please contact dsgvo@dbvis.inf.uni-konstanz.de. Please also note that this page includes scripts from distill.pub, so their data protection terms may additionally apply.