Improving the Effectiveness and Efficiency of Stochastic Neighbour Embedding with Isolation Kernel

This paper presents a new insight into improving the performance of Stochastic Neighbour Embedding (t-SNE) by using Isolation kernel instead of Gaussian kernel. Isolation kernel outperforms Gaussian kernel in two aspects. First, the use of Isolation kernel in t-SNE overcomes the drawback of misrepresenting some structures in the data, which often occurs when Gaussian kernel is applied in t-SNE. This is because Gaussian kernel determines each local bandwidth based on one local point only, while Isolation kernel is derived directly from the data based on space partitioning. Second, the use of Isolation kernel yields a more efficient similarity computation because data-dependent Isolation kernel has only one parameter that needs to be tuned. In contrast, the use of data-independent Gaussian kernel increases the computational cost by determining n bandwidths for a dataset of n points. As the root cause of these deficiencies in t-SNE is Gaussian kernel, we show that simply replacing Gaussian kernel with Isolation kernel in t-SNE significantly improves the quality of the final visualisation output (without creating misrepresented structures) and removes one key obstacle that prevents t-SNE from processing large datasets. Moreover, Isolation kernel enables t-SNE to deal with large-scale datasets in less runtime without trading off accuracy, unlike existing methods in speeding up t-SNE.

1 Introduction and Motivation t-SNE [18] has been a successful and popular dimensionality reduction method for visualisation.It aims to project high-dimensional datasets into lower-dimensional spaces while preserving the similarities between data points, as measured by the KL divergence.The original SNE [8] employs a Gaussian kernel to measure similarity in both high and low-dimensional spaces.t-SNE replaces the Gaussian kernel with the distance-based similarity (1 + d ij ) 2  (where d ij is the distance between instances i and j) in low-dimensional space, while retaining the Gaussian kernel for high-dimensional space.
When using the Gaussian kernel, t-SNE has to fine-tune a bandwidth of the Gaussian kernel centred at each point in the given dataset because Gaussian kernel is independent of data distribution.In other words, t-SNE must determine n bandwidths for a dataset of n points.
If we look into the bandwidth determination process, it is accompanied by using a heuristic search with a single global parameter called perplexity such that the Shannon entropy is fixed for all probability distributions at all points in adapting each bandwidth to the local density of the dataset.As the perplexity can be interpreted as a smooth measure of the effective number of neighbours [18], the method can be interpreted as using a user-specified number of nearest neighbours (aka kNN) in order to determine the n bandwidths (more on this point in the discussion section.)Whilst there is a single external parameter perplexity, a bandwidth setting must be optimised for each data point internally.
This becomes the first obstacle in dealing with large datasets due to massive computational cost of the bandwidth search process.In addition, the point-based bandwidth is also the cause of misrepresentation in high-dimensional space under some conditions.
To date, the common practice is still using Gaussian kernel in t-SNE on high-dimensional datasets.However, sound and workable solutions to its drawbacks mentioned above have not been brought up yet.The contributions of this paper are: (1) Uncovering two deficiencies due to the use of the Gaussian kernel.First, the point-based-bandwidth Gaussian kernel often creates misrepresented structure(s) which do not exist in high-dimensional space under some conditions.Second, the use of the data-independent kernel requires t-SNE to determine n bandwidths for a dataset of n points, despite the fact that a user needs to set one parameter only.This becomes one key obstacle in dealing with large datasets.
(2) Revealing the advantages of using a partition-based data-dependent kernel in t-SNE.First, this kernel represents the true structure(s) in the high-dimensional space under the same condition mentioned above.Second, the data-dependent similarity is set with a single parameter only; this allows it to be computed more efficiently.This enables t-SNE to deal with large-scale datasets without trading off accuracy with faster runtime, without resorting to approximation methods.
(3) Proposing an improvement to t-SNE by simply replacing the data-independent kernel with a data-dependent kernel, leaving the rest of the procedure unchanged.
(4) Verifying the effectiveness and efficiency of the data-dependent kernel in t-SNE.
The adopted data-dependent kernel is Isolation kernel [24,20] and the experiment result shows that using Isolation kernel will improve the performance of t-SNE and solve the issues brought by Gaussian kernel in t-SNE.
The rest of the paper is organised as follows.The current t-SNE and related work are described in Section 2. The deficiencies of using Gaussian kernel is presented in Section 3. In Section 4, we characterise the selected Isolation kernel and Section 5 presents the empirical evaluation of using Isolation kernel in t-SNE.Discussion and conclusions are given in the last two sections.

Basics of t-SNE
Given a dataset D = {x 1 , . . ., x n } in R d , t-SNE aims to map D ∈ R d to D ∈ R d where d d such that the similarities between points are preserved as much as possible from the high-dimensional space to the low-dimensional space.As t-SNE is meant for a visualisation tool, d = 2 usually.
The similarity between a pair of points x i , x j (resp.x i x j ) in a high (resp.low)-dimensional space is measured by a probability p ij (resp.p ij ) that point x i picks x j as its neighbour.The probability distributions are computed based on distance measures between the points in the respective space.The aim of this family of projection methods is to project the points from x to x in such a way that the probability distributions between p ij and p ij are as similar as possible.
The similarity between x i and x j is measured using a Gaussian kernel as follows: t-SNE computes the conditional probability p j|i that x i would pick x j as its neighbour as follows: The probability p ij , a symmetric version of p j|i , is computed as: t-SNE performs a binary search for the best value of σ i such that the perplexity of the conditional distribution equals a fixed perplexity specified by the user.Therefore, the bandwidth is adapted to the density of the data, i.e., small (large) values of σ i are used in dense (sparse) regions.The perplexity is defined as: ) where P i represents the conditional probability distribution over all other data points given data point x i and H(P i ) is the Shannon entropy: The perplexity is a smooth measure of the effective number of neighbours, similar to the number of nearest neighbours k used in kNN methods [8].Thus, σ i is adapted to the density of the data, i.e., it becomes small for dense data since the k-nearest neighbourhood is small and vice versa.In addition, [18] point out that there is a monotonically increasing relationship between perplexity and the bandwidth σ i .
The similarity between x i and x j in the low-dimensional space is measured as: and the corresponding probability is defined as: The distance-based similarity s is used because it has heavy-tailed distribution, i.e., it approaches an inverse square law for large pairwise distances.This means that the far apart mapped points have p ij which are almost invariant to changes in the scale of the low-dimensional space [18].
Note that the probability distributions are defined in such a way that p ii = 0 and p ii = 0, i.e. a node does not pick itself as a neighbour.
The location of each point x ∈ D is determined by minimising a cost function based on the (non-symmetric) Kullback-Leibler divergence of the joint probability distribution P in the low-dimensional space from the joint distribution P in the high-dimensional space: The use of the Gaussian kernel K sharpens the cost function in retaining the local structure of the data when mapping from the high-dimensional space to the low-dimensional space.The main computational step in applying t-SNE is to determine the value of bandwidth σ for each data point.
The procedure of t-SNE is provided in Algorithm 1.Note that m = n for small datasets.For large datasets, m n; and this is to be discussed in Section 5.4.[18] and its variations have been widely applied in dimensionality reduction and visualisation.In addition to t-SNE [18], which is one of the commonly used visualisation methods, many other variations have been proposed to improve SNE in different aspects.
There are improvements based on some revised Gaussian kernel functions in order to get better similarity measurements.
[5] propose a symmetrised SNE; [29] enable t-SNE to accommodate various heavy-tailed embedding similarity functions; and [26] propose an algorithm based on similarity triplets of the form "A is more similar to B than to C" so that it can model the local structure of the data more effectively.
Based on the concept of information retrieval, NeRV [27] uses a cost function to find a trade-off between precision and recall of "making true similarities visible and avoiding false similarities", when projecting data into 2-dimensional space for visualising similarity relationships.Unlike SNE which relies on a single Kullback-Leibler divergence, NeRV uses a weighted mixture of two dual Kullback-Leibler divergences in neighbourhood retrieval.Furthermore, JSE [11] enables t-SNE to use a different mixture of Kullback-Leibler divergences, a kind of generalised Jensen-Shannon divergence, to improve the embedding result.
To reduce the runtime of t-SNE, [25] explores tree-based indexing schemes and uses the Barnes-Hut approximation to reduce the time complexity to O(nlog(n)), where n is the data size.This gives a trade-off between speed and mapping quality.To further reduce the time complexity to O(n), [14] utilise a fast Fourier transform to dramatically reduce the time of computing the gradient during each iteration.The method uses vantage-point trees and approximates nearest neighbours in dissimilarity calculation with rigorous bounds on the approximation error.
Some works focus on analysing the heuristics methods for solving non-convex optimisation problems for the embedding [15,21].Recently, [1] theoretically analyse this optimisation and provide a framework to make clusterable data visually identifiable in the 2-dimensional embedding space.These works focus on changing the optimisation problem and are not related to similarity measurements.
So far, however, none of these studies has investigated the suitability of Gaussian kernel in t-SNE.The following two sections will uncover the issues of using Gaussian kernel in t-SNE and propose to replace it with Isoaltion kernel.
3 Deficiencies of Gaussian kernel when used in t-SNE Here we list two identified deficiencies of Gaussian kernel that cause poor visualisation outputs and high computational cost in t-SNE.As bandwidth σ i of the Gaussian kernel is fixed for each point x i , we identify the following observation: Observation 1. Gaussian kernel with point-based bandwidth can misrepresent the structure of a data distribution, having points significantly denser than the majority of the points in a sample generated from the distribution.
Intuitively, as each point-based bandwidth represents one local density only, the Gaussian kernel can misrepresent the relationship between multiple clusters in the joint distribution of the overlap region.We provide two example cases in which misrepresentation occurs, i.e., there are multiple subspace clusters; each is a Gaussian distribution of the same mean with: (i) different variances; and (ii) the same variance.
Let X 1 and X 2 be two subspace regions in a high-dimensional space, and points in the two clusters are generated from the Gaussian distributions N [0, v 1 ] and N [0, v 2 ], respectively; and the distributions only overlap at the origin O.
In case (i) where variance v 1 v 2 .Let point x k1 ∈ X 1 be the point closest to O in the dense cluster, and point x k2 ∈ X 2 be the point closest to O in the sparse cluster.Then, In case (ii) where v 1 = v 2 , using an appropriate setting in the current t-SNE procedure, each point x in either X 1 or X 2 would have learned approximately the same bandwidth σ, except the origin O because O has at least double the density than any point in either cluster.As a result, ∀x i , x j ∈ X 1 (or ∀x i , x j ∈ X 2 ) and O − This means that the origin is very dissimilar to any points in either cluster.
Simulations of the two cases are given below: (i) Five subspace clusters having different variances in a 50-dimensional space (see the simulation details in the footnote1 .) Table 1: Visualisation results of the t-SNE using Gaussian kernel and Isolation kernel on a 50-dimensional dataset with 5 subspace clusters, each in a different 10-dimensional subspace.The black cross indicates the mapped point of the origin in the high-dimensional space shared by three clusters in different subspaces.Note that in (c), all points of the red cluster (cluster 1) are concentrated and they overlap with the mapped origin.perplexity and ψ are the key parameters for Gaussian kernel and Isolation kernel, respectively.Using Gaussian kernel, SNE creates a misrepresentation of the structure in the high-dimensional space.The simulation result is shown in the first row in Table 1: t-SNE is unable to identify the joint component of the three clusters in different subspaces which share the same mean at the origin only in the high-dimensional space but nowhere else.Notice that the mapped origin point is misrepresented to be associated with one cluster only; and it is totally disassociated with the other two clusters.
In contrast, the same t-SNE algorithm employing the Isolation kernel [24,20], instead of a Gaussian kernel, produces the mapping which truly represents the structure in the high-dimensional space: the three clusters are well separated and yet they share some common points, indicated by the mapped origin point as shown in the second row in Table 1.
(ii) Two subspace clusters in a 200-dimensional dataset with two subspace clusters having the same Gaussian distribution N [0, 1] but in different subspaces 2 .Table 2 shows the simulation results.When Gaussian kernel is used, the t-SNE with a small perplexity produces small bandwidth for every point-leading to each point has almost the same low similarity with every other point in the dataset, as shown in Figure (a) in Table 2.Note that the two clusters could not be distinguished in the visualization if the colors, indicating the ground truth labels, are not used in the plot.Yet, the t-SNE with a large perplexity produces large bandwidths for all points, except the origin which has a significantly smaller bandwidth-note that the origin (denoted as ×) and the rest of the points are at the opposite corners in Figure (c) in Table 2.This is because the origin, being the only overlap point between the two clusters, has a significantly higher density than all other points.As both clusters have the same variance, all their points have low density (relative to the origin) are 'learned' to have approximately the same bandwidth-which is significantly larger than that of the origin.As a result, the origin is very dissimilar to all other points; though all the other points are correctly clustered into two separate groups.In contrast, when the Isolation kernel is used, the origin is always positioned in-between the two clusters, independent of the ψ parameter setting.
four Gaussian distributions.In other words, no clusters share a single relevant attribute.In addition, all clusters have significantly different variances (the variance of the 5th cluster is 625 times larger than that of the 1st cluster).The first three clusters share the same mean; but the last two have different means.The five clusters have distributions: 16], N [0, 81], N [400, 256] and N [500, 625] in each dimension. 2Each cluster has 500 points, sampled from a 100-dimensional Gaussian distribution N [0, 1] with the other 100 irrelevant attributes having zero values; and no clusters share a single relevant attribute.Note the above-mentioned deficiency is not restricted to subspace clusters without shared attributes.An example using subspace clusters with shared attributes can be found in Appendix A.

No need for point-based bandwidth in Isolation kernel
The space partitioning mechanism of the Isolation kernel [24,20] determines the size of the partitions in the local region: it produces large partitions in the sparse region and small partitions in the dense region (see Section 4.2 for more details.)As it is partition-based, points in the local neighbourhood are most likely to be in the same partition.As such, points in the intersection of clusters (in different subspaces as shown in Table 1) are almost always captured by the same partition of Isolation kernel.
An example distribution of similarities based on the dataset shown in Table 1 is given in Figure 1.Let x k1 be the origin O's closest point in the dense cluster (i.e., cluster 1); and x k2 be O's closest point in a sparse cluster (cluster 2 or 3). Figure 1b shows that K ψ (O, x k1 ) ≈ K ψ (O, x k2 ) when the Isolation kernel is used.When the Gaussian kernel is used, K(O, x k1 ) K(O, x k2 ), as shown in Figure 1a.   1, where each is a 10-dimensional cluster (See the details in Footnote 1.) The similarities are computed in the 50-dimensional space.The left-most point in each cluster is the point closest to the origin O having the highest similarity: x k1 is the red left-most point; x k2 is the yellow (or green) left-most point.
This explains why the points in the intersection are better mapped in the low-dimensional space by using the Isolation kernel than using the Gaussian kernel.
In other words, the Isolation kernel ensures that the local structure is truly reflected in the similarities among local points in the high-dimensional space, unlike the misrepresentation exhibited in Table 1 and Table 2 when the Gaussian kernel is used.As a result, t-SNE using the Isolation kernel produces the improved visualisation quality which has no misrepresentations.

3.2
The second deficiency

Low computational efficiency problem with Gaussian kernel
The use of a Gaussian kernel necessitates the search for a local bandwidth for each local point.t-SNE utilises a binary search for the value of σ i that makes the entropy of the distribution over neighbours equal to log K, where K is the effective number of local neighbours or "perplexity" [18].This search is the key component that determines the success or failure of t-SNE.A gradient descent search has been used successfully to perform the search for n parameters for small datasets [18].This formulation has two key limitations for large datasets.First, the need for n-parameters search poses a real limitation in terms of finding appropriate settings for a large number of parameters.Second, it cannot deal with large datasets because its low computational efficiency, i.e., the time complexity is O(n 2 ).

High computational efficiency with Isolation Kernel
The computational complexities of the Guassian kernel and Isolation kernel [24,20] used in t-SNE are shown in Table 3. 3 Although the parameter ψ of Isolation kernel corresponds to the bandwidth parameter of the Gaussian kernel, the Isolation kernel needs no optimisation to determine n bandwidths locally.This is because the partitioning mechanism used by the Isolation kernel produces small partitions in dense regions and large partitions in sparse regions; and the sizes of the partitions are monotonically decreasing with respect to ψ.As the local adaptation has already been done during the process of deriving the kernel, no further adaptation is required after the kernel is derived.
Though the Isolation kernel derivation from data takes constant O(tψ) time, it is significantly less than the optimisation required to determine n bandwidths which takes O(n 2 ) time in Gaussian kernel.For a large dataset, when using Gaussian kernel, it is infeasible to estimate a large number of bandwidths with an appropriate degree of accuracy, and its computational cost is prohibitively high.In contrast, the consequence of using Isolation kernel is that the runtime of step 1 in the t-SNE algorithm is significantly reduced.Thus, the Isolation kernel enables t-SNE to deal with large datasets.More experimental details are provided in Sections 5.4 and 6.3.
4 The proposed solution: using the Isolation kernel in t-SNE Since t-SNE needs a data-dependent kernel, we propose to use a recent data-dependent kernel called Isolation kernel [24,20] to replace the data-independent Gaussian kernel in t-SNE.
The Isolation kernel is a perfect match for the task because a data-dependent kernel, by definition, adapts to local distribution without any additional optimisation.The kernel replacement is conducted in the component in the high-dimensional space only, leaving the other components of the t-SNE procedure unchanged.
Sections 4.1 and 4.2 are literature reviews of Isolation Kernel [24] and a known fact [20].Sections 4.3, 4.4 and 4.5 are our original contributions to Isolation kernel and t-SNE in this paper.

Isolation kernel
The key idea of Isolation kernel is that using a space partitioning strategy to split the data space into different cells, e.g., we uniformly sample ψ points from the given dataset and generate ψ Voronoi cells, then the similarity between any two points is how likely the two points can be split into the same cell.
The details of Isolation kernel [24,20] are provided below.
In practice, the Isolation kernel K ψ is constructed using a finite number of partitionings H i , i = 1, . . ., t, where each H i is created using D i ⊂ D: where θ is a shorthand for θ[z]; and t can usually be set to a default value.ψ is the sharpness parameter and the only parameter of the Isolation kernel.The larger ψ is, the sharper the kernel distribution is.This corresponds to σ in the Gaussian kernel, i.e., the smaller σ is, the narrower the kernel distribution is.Note that t is the number of partitionings and t can be fixed to a large value to ensure the stability of the estimation.
As Equation ( 7) is quadratic, K ψ is a valid kernel.For brevity, K ψ (x, y) is used to denote K ψ (x, y|D) hereafter.

How Isolation kernel differs from Gaussian kernel
The key difference is that the Isolation kernel adapts to local density distribution, but the Gaussian kernel is independent of the data distribution.
In addition, the technical differences can be observed in two aspects.First, the Isolation kernel has no closed-form expression.Second, it is derived directly from a dataset, without explicit learning or optimisation.Its adaptation to local density is a direct outcome of its isolation mechanism used to partition space, i.e., the mechanism produces large partitions in sparse regions and small partitions in dense regions [24,20].A natural isolation mechanism that has this characteristic is a Voronoi diagram.Given a sample of the underlining distribution, each Voronoi cell isolates a point from the rest of the points in the sample; and the cells are small in the dense region and large in the sparse region.Note that the Voronoi diagram is obtained very efficiently, i.e., given a sample, nothing else needs to be done in the training stage because boundaries in the Voronoi diagram can be obtained at the testing stage as the equal distance between the two nearest points in the given sample.
Figure 2 shows two examples of partitioning H using the nearest neighbour or a Voronoi diagram on the same dataset with two different subsample sizes ψ.These examples show that there are more (small) cells in the dense region than (large) cells in the sparse region for each ψ; and the sizes of the cells are usually decreasing with respect to ψ.Two points located in the same cell get the similarity score of 1 in a partitioning.The final Isolation kernel similarity between two points is the probability of both points falling into the same cell over a finite number of partitionings, as shown in Equation 7. Examples of kernel distribution due to different ψ values are shown in Appendix B, so as the implementation details.

The Isolation kernel makes full use of the distributional information in small samples
The Isolation kernel only requires small samples (ψ) for the space partitioning without a computationally expensive process.A small sample of a dataset contains data distributional information which is sufficient to build a data-dependent kernel.
The Isolation kernel extracts this information in the form of a Voronoi diagram, which depicts the relative densities between regions.
In contrast, using a data-independent measure such as the Gaussian kernel, the distributional information in a dataset is ignored and each point in the input space is treated as an independent point.In order to get the distributional information in the form of variable bandwidths that are adaptive to the local distribution, a separate optimisation process is required, as conducted in step 1 of the t-SNE algorithm.
It is important to note that when they could not handle a large dataset, most methods may use small samples as a mitigation approach, and this inevitably trades off runtime with accuracy.But it is not the case for the Isolation kernel where small samples are the key in achieving high accuracy; and samples larger than the optimal ψ will degrade the accuracy of Isolation kernel.See further discussion on this issue in Section 6.
In other words, by using the Gaussian kernel, t-SNE must employ a computationally expensive approach to get the distributional information in a dataset.It does not exploit the same information which is freely available in small samples of the dataset.The Isolation kernel is a direct approach that makes full use of the distributional information freely available in small samples of a dataset.

The Isolation kernel is well-defined
The Isolation kernel has the following well-defined data-dependent characteristic: two points in a sparse region are more similar than two points of equal inter-point distance in a dense region [24].
Using a specific implementation of Isolation kernel (see Appendix B), [20] have provided the following Lemma (see its proof in their paper): Lemma 1.
[20] ∀x i , x j ∈ X S (sparse region) and ∀x k , x ∈ X T (dense region) such that ∀ y∈X S ,z∈X T ρ(y) < ρ(z), the nearest neighbour-induced Isolation kernel K ψ has the characteristic that for where x − y is the distance between x and y; and ρ(x) denotes the density at point x.
Let p b|a be the probability that x a would pick x b as its neighbour.
We provide two corollaries from Lemma 1 as follows.
Corollary 1. x i is more likely to pick x j as a neighbour than x k is to pick x as a neighbour, i.e., This is because x k in the dense region is more likely to pick a point closer than x as its neighbour, in comparison with x i picking x j as a neighbour in the sparse region, given that , where x a , x b ∈ X A is a region in X ; and ρ is an average density of a region.
Using a data-dependent kernel with a well-defined characteristic as specified in Lemma 1, we can establish that the probability that x a would pick x b , p b|a , is inversely proportional to the density of the local region.
This becomes the basis in setting a reference probability in the high-dimensional space.
It is interesting to note that the adaptation of Gaussian kernel by optimising n bandwidths attempts to achieve a similar outcome, as stipulated in Corollaries 1 and 2. Yet, it is unclear that a similar data-dependent characteristic, as stated in Lemma 1, can be formally stated for the adaptive Gaussian kernel.This is because the similarity cannot be computed for all x ∈ R d (except those in the given dataset.)

t-SNE with the Isolation kernel
We propose to replace K with K ψ in defining p j|i in Equation ( 2), i.e., The rest of the procedure of t-SNE remains unchanged.
The procedure of t-SNE with the Isolation kernel is provided in Algorithm 2.
Note that the only difference between the two algorithms is step 1; and Eq 9 (instead of Eq 2) in step 2.

Empirical Evaluation
This section presents the three evaluation methods we adopt, evaluation results, runtime comparison and a scalability test.

Evaluation measures
We used a qualitative assessment R(k) to evaluate the preservation of k-ary neighbourhoods [12,11,10], defined as follows: where is the set of k nearest neighbours of x; and x is the corresponding low-dimensional (LD) point of the high-dimensional (HD) point x.R(k) measures the k-ary neighbourhood agreement between the HD and corresponding LD spaces.R(k) ∈ [0, 1]; and the higher the score is, the better the neighbourhoods preserved in the LD space.In our experiments, we recorded the assessment with k ∈ {0.01n, 0.03n, ..., 0.99n} and produced the curve, i.e., k vs R(k).
To aggregate the performance over the different k-ary neighbourhoods, we calculate the area under the R(k) curve in the log plot [11] as: AUC RN X assesses the average quality weighted by k, i.e., errors in large neighbourhoods with large k contribute less than that with small k to the average quality.
In addition, the purpose of many methods of dimensionality reduction is to identify HD clusters in the LD space such as in a 2-dimensional scatter plot.Since all the datasets we used for evaluation have ground truth (labels), we can use measures for clustering validation to evaluate whether all clusters can be correctly identified after they are projected into the LD space.Here we select two popular indices of cluster validation, i.e., Davies-Bouldin (DB) index [6] and Calinski-Harabasz (CH) index [3].Their details are given as follows.
Table 4: Parameters and their search ranges for each kernel function.
Calinski-Harabasz (CH) index is calculated as where c is the centre of the dataset.
Both measures take the similarity of points within a cluster and the similarity between clusters into consideration, but in different ways.These measures assign the best score to the algorithm that produces clusters with low intra-cluster distances and high inter-cluster distances.Note that the higher the CH score, the better the cluster distribution; while the lower the DB score is, the better the cluster distribution is.
All algorithms used in the following experiments were implemented in Matlab 2019b and were run on a machine with 14 cores (Intel Xeon E5-2690 v4 @ 2.59 GHz) and 256GB memory. 4All datasets were normalised using the min-max normalisation to yield each attribute to be in [0,1] before the experiments began.We also use the min-max normalisation on the t-SNE results before calculating DB and CH scores.

Evaluation results
This section presents the result of utility evaluation of isolation kernel and Gaussian kernel in t-SNE using 21 real-world datasets 5 with different data sizes and dimensions.We report the best performance of each algorithm with a systematic parameter search with the range shown in Table 4. 6 Note that there is only one manual parameter ψ to control the partitioning mechanism, and the other parameter t can be fixed to a default number.
Table 5 shows the results of the two kernels used in t-SNE.The Isolation kernel performs better on 18 out of 21 datasets in terms of AU C RN X , which means that the Isolation kernel enables t-SNE to preserve the local neighbourhoods much better than the Gaussian kernel.With regard to the cluster quality, the Isolation kernel performs better than the Gaussian kernel on 18 out of 21 datasets in terms of both DB and CH.Notice that when the Gaussian kernel is better, the performance gaps are usually small in any of the three measures.Overall, the Isolation kernel is better than the Gaussian kernel on 16 out of 21 datasets in all three measures.The reverse is true on one dataset only, i.e., News20.
The visualization result on New20binary indicates there are significant overlaps between the two clusters in this dataset.This is reflected in the AU C RN X results which are significantly less than a random assignment (AU C RN X = 0.5).
The visualization result of News20 is shown in Appendix C.
On the COIL20 dataset, we have identified a structural misrepresentation issue with the Gaussian kernel, similar to the one shown in Table 2. Table 6 shows the five clusters where the Gaussian kernel has misrepresented structures in the high-dimensional space.The 3-dimensional results denote that the Isolation kernel depicts a more nuanced structural relationship between the five clusters; whereas the Gaussian kernel depicts that they are disparate five clusters, shown in the second column in Table 6.Also, note that a reference point × is close to all five clusters when the Isolation kernel is used, but it is far from many clusters when a Gaussian kernel is used.

Runtime comparison
Generally, both Gaussian Kernal and Isolation Kerner have quadratic time and space complexities.However, the Gaussian kernel in the original t-SNE needs a large number of iterations to search for the optimal local bandwidth for each point.as a result, the Gaussian kernel takes a much longer time in step 1 of the algorithm than the Isolation kernel.Figure 3 presents the two runtime comparisons of t-SNE with the two kernels on a synthetic dataset.Figure 3(a) shows that the Gaussian kernel is much slower than the Isolation kernel in similarity calculations.This is mainly due to the search required to tune the n bandwidths in step 1 of the algorithm.It is interesting to note that though both similarities have n 2 time complexity, the constant is significantly lower in the Isolation kernel: if the data size is increased 10 times from 10,000 to 100,000, the Gaussian kernel increases its runtime 685 times; whereas the Isolation kernel increases 91 times only.As a result, with a dataset of 100,000 data points, the Isolation kernel7 is two orders of magnitude faster than the Gaussian kernel (887 seconds versus 72,196 seconds).Figure 3(b) shows the runtime of the mapping process in step 3 of Algorithms 1 and 2 which is the same for both algorithms.It is not surprising that their runtime are about the same in this step, regardless of the kernel employed.
Table 7 compared the CPU runtime of Gaussian kernel and Isolation kernel used in t-SNE on four real-world datasets.The t-SNE with the Isolation kernel is up to one order of magnitude faster than the t-SNE with Gaussian kernel in the first two steps.

Scalability testing
Here we show that the Isolation kernel enables t-SNE to deal with large datasets because step 1 takes constant time (once the parameters are fixed), rather than n 2 when a Gaussian kernel is used.This allows t-SNE to deal with a dataset with millions of data points in step 1, while using a subsample in steps 2 & 3 to visualise the dataset in a low-dimensional space.
To demonstrate this ability, we use the MNIST8M dataset [17] with 8.1 million points in step 1; and then use either the MNIST dataset or a subsample of 10,000 data points from MNIST8M in steps 2 & 3 of t-SNE.The results of t-SNE with the Isolation kernel are shown in the last two columns in Table 8.The results show that IK can get good CH scores with small ψ values.It took 334s (ψ = 2048) in steps 1 and 2, and 972s in step 3. Note that t-SNE with Gaussian kernel cannot be directly applied on this massive dataset in the same manner because it would take too long to complete step 1, as shown in Figure 3(a).
The use of a subsample in steps 2 and 3 was previously suggested by [18].However, the suggestion was to replace the Gaussian kernel with a graph similarity that employs a random walk method.This graph similarity approach has the same limitation as the Gaussian kernel because of its high time complexity.It requires a neighbourhood graph to be generated before a random walk kernel (or any graph kernel) can be used to measure similarities.While many graph kernels (see e.g., [9]) may be applied here, the key obstacle is the generation of the neighbourhood graph which has at least O(n 2 ) time complexity.
In summary, employing Isolation kernel is the only method that takes constant time in step 1.Meanwhile, subsampling in step 2 and 3 enables t-SNE to process large-scale datasets without compromising the reference probability that needs to be established in step 1.

The proposed method can benefit existing variants of t-SNE
The common feature of existing variants of t-SNE is that they all use the Gaussian kernel. 8The proposed idea can be applied to variants of stochastic neighbour embedding, e.g., NeRV [27] and JSE [11], since they employ the same algorithm procedure as t-SNE.The only difference is the use of variants of cost function, i.e., type 1 or type 2 mixture of KL divergences.
In addition, Isolation kernel can be used in existing methods which aims to speed up t-SNE in step 3 of the algorithm.This is discussed in Section 6.3.

Isolation kernel performs optimally with small samples
The finding-small samples (as the ψ value) have better visualisation results than large samples-was formally analysed in the context of nearest neighbour anomaly detection [23].The work is motivated by the previous finding that small samples can produce better detection accuracy for some anomaly detectors than large samples (e.g., [16,22].)The theoretical analysis based on computational geometry reveals that the geometry of data distribution has a direct impact on the sample size setting which is essential to produce an optimal nearest neighbour anomaly detector [23].In a simple geometry such as a Gaussian distribution, a sample size of one data point (at the mean of Gaussian distribution) yields the optimal nearest neighbour anomaly detector; and a sample of more data points will produce a worse performing detector.In a more complex geometry of data distribution (e.g., a mixture of multiple Gaussian distributions), while the optimal sample size is more than one data point, a sample size over the optimal one also produces a worse performing detector.See [23] for details.
The above result can explain the effect of small samples in Isolation kernel described in Section 4.3: the optimal sample size is the representative sample for the underlying geometry of data distribution, allowing the Isolation kernel to model relative similarities between different regions most effectively.
In summary, most methods use small samples as a compromising approach when failing to handle large datasets.It comes at the cost of low accuracy and longer runtime.However, algorithms employing Isolation kernel can process large datasets without trading off accuracy and efficiency due to the resultant sample.While ψ of the Isolation kernel serves the primary purpose of a kernel parameter like the bandwidth parameter of Gaussian kernel, the resultant sample size enables algorithms that employ the Isolation kernel to deal with large datasets without compromising the accuracy of the task.

Methods to speed up t-SNE
Scalability is an open issue for applying unsupervised distance metric learning approaches on large datasets [28].As mentioned before, currently, there are two ways to speed up t-SNE: subsampling (which is a mitigation approach discussed in Section 4.3), and another is via some approximation to reduce runtime in step 3.
The two approximation methods mentioned in the literature review are (i) the Barnes-Hut algorithm in conjunction with the dual-tree algorithm [25], and (ii) interpolating onto an equispaced grid in order to use the fast Fourier transform to perform the convolution required in step 3 of the t-SNE algorithm [14].However, these approximation methods sacrifice accuracy for efficiency.For example, opt-SNE [2] utilises Kullback-Leibler divergence evaluation to automatically identify the tailored parameters in the optimisation procedure of t-SNE, in order to reduce the iteration time and improve the embedding quality.Nevertheless, all of these methods are still based on Gaussian kernel.Therefore, they still have the same deficiency of misrepresented structures as the original t-SNE, as discussed in Section 3.1.1.Appendix E and Appendix F show examples of these outcomes of FIt-SNE [14] and opt-SNE [2], respectively.
In a nutshell, the proposed method of using Isolation kernel in t-SNE offers (i) the only way to establish the reference probability in step 1 using a large dataset (without parallelisation); and (ii) a way to speed up t-SNE, which is an alternative to existing speedup methods.The use of a subsample, as a mitigation approach, in step 1 compromises the accuracy of reference probability.The use of an approximation method in step 3 reduces the quality of the dimensionality reduction.These existing methods in speeding up t-SNE still employ Gaussian kernel; and thus they fail to address the two deficiencies we have identified.

Conclusions
This paper identifies two deficiencies in t-SNE due to the use of Gaussian kernel.First, the point-based-bandwidth Gaussian kernel often creates misrepresented structure(s) which do not exist in the given dataset under some conditions.Second, the data-independent Gaussian kernel largely increases the computation load resulted from the need in determining n bandwidths for a dataset of n points and thus unable to deal with large datasets.Though some methods have been suggested to trade off accuracy for faster running speed, the underlying issue due to the use of Gaussian kernel remains unresolved.
Since the root cause of these deficiencies is the use of a data-independent kernel, we propose to simply replace Gaussian kernel with a data-dependent kernel called Isolation kernel.Figure 6 shows the visualisation results on three datasets using opt-SNE 10 .As expected, opt-SNE produced similar results as t-SNE, having misrepresented structures in Figures 7a and 7b.On MNIST, opt-SNE got a slightly worse result than t-SNE (CH=6129 versus CH=6452) because it split the green clusters into two parts, as shown in Figure 7c.

3. 1
The first deficiency 3.1.1Point-based bandwidth: the cause of misrepresentation in high-dimensional space perplexity = 50 (b) perplexity = 250 (c) perplexity = 500 Isolation kernel (d) ψ = 50 (e) ψ = 250 (f) ψ = 500 Figure (b) shows the movement of the origin using a perplexity between those used in Figure (a) & Figure (c), i.e., the origin moves from in-between the two clusters in (a) to the edge of a cluster in (b); before moving to a location far away from both clusters in (c).

Table 2 :
Visualisation results of t-SNE with Gaussian kernel and Isolation kernel on a 200-dimensional dataset with two equal density subspace clusters.Note that in (c), the origin is far away from both clusters, although there is a clear gap between the two clusters.The green box in (c) presents a zoom-in view of the two clusters.Gaussian kernel (a) perplexity = 50 (b) perplexity = 210 (c) perplexity = 300 Isolation kernel (d) ψ = 50 (e) ψ = 210 (f) ψ = 300 (a) Gaussian kernel with perplexity = 250 (b) Isolation kernel with ψ = 250

Figure 1 :
Figure 1: Isolation kernel versus Gaussian kernel: Distributions of similarities of points wrt the origin for three clusters of N [0, 1], N [0, 16] and N [0, 81] in different subspaces shown in Table1, where each is a 10-dimensional cluster (See the details in Footnote 1.) The similarities are computed in the 50-dimensional space.The left-most point in each cluster is the point closest to the origin O having the highest similarity: x k1 is the red left-most point; x k2 is the yellow (or green) left-most point.
a dataset sampled from an unknown probability density function x i ∼ F .Moreover, let H ψ (D) denote the set of all partitionings H admissible for the given dataset D, where each H covers the entire space of R d ; and each of the ψ isolating partitions θ[z] ∈ H isolates one data point z from the rest of the points in a random subset D ⊂ D, and |D| = ψ.In our implementation, H is a Voronoi diagram generated from D. Definition 1.For any two points x, y ∈ R d , the Isolation kernel of x and y wrt D is defined to be the expectation taken over the probability distribution on all partitionings H ∈ H ψ (D) that both x and y fall into the same isolating partition θ[z] ∈ H, z ∈ D: (a) ψ = 16 (b) ψ = 64

Figure 2 :
Figure 2: Two examples of partitioning H using the nearest neighbour (a Voronoi diagram) on a dataset having two regions of uniform densities, where the left half has a lower density than the right half.

Algorithm 2 t
-SNE(D, ψ, m) which employs the Isolation kernel Require: D -Dataset {x 1 , . . ., x n }; ψ -sharpness parameter of the Isolation kernel 1: Build a space partitioning model using t sets of ψ data points for the Isolation kernel K ψ 2: Compute matrix [p ij ] m×m according to Equations 3 & 9 3: Compute low-dimensional D and p ij which minimise the KL divergence 4: Output low-dimensional data representation D = {x 1 , . . ., x m } (a) Runtime for Steps 1 & 2 (m = n) (b) Runtime for Step 3

Figure 3 :
Figure 3: CPU runtime comparison of Gaussian kernel and Isolation kernel used in t-SNE on a 2-dimensional synthetic dataset.
(a) and (c) show the t-SNE visualisation results on COIL20 in a two-dimensional space.(b) and (d) show the five clusters and a reference point (indicated as × with the class label "R") on t-SNE visualisation results in a three-dimensional space.

Figure 4 :
Figure 4: AU C RN X of Gaussian Kernel and Isolation Kernel on 5 subspace clusters with different dimensionality.The parameters for each algorithm are turned according to Table4.

Figure 6 :
Figure 6: FIt-SNE visualisation results with the Gaussian kernel on the MNIST and MNIST8M datasets.

Figure 7 :
Figure 7: opt-SNE visualisation results with Gaussian kernel on three datasets.

Table 3 :
Time complexities of t-SNE in steps (1) kernel building, (2) computing the similarity, and (3) mapping from high to low dimensions.r is the number of iterations used for bandwidth search for the Gaussian kernel; and s is the number of iterations in t-SNE mapping.m(≤ n) is the subsample size used for the mapping.For small datasets: m = n.

Table 5 :
Evaluation results on real-world datasets.For each dataset, the best performer, GK (Gaussian kernel) or IK (Isolation kernel) w.r.t. each evaluation measure is boldfaced.Note that the higher the AU C RN X and CH scores indicate the better a cluster distribution; while a lower DB score indicates a better cluster distribution.

Table 7 :
CPU runtime (seconds) of t-SNE on four real-world datasets.

Table 8 :
t-SNE visualisation results on the MNIST and MNIST8M datasets.