Given a new set of n samples of vectors in R^d

histogram of feature types (binary, integer, non-negative, character, string etc.)
NaNs per row? Per column? Infs per row? Per column? "Zero" variance rows? columns?

Heat map of raw data that fits on screen (k-means++ to select 1000 samples, CUR to select 100 dimensions)
1st moment statistics
1. mean & median on (line plot + heatmap)
2nd moment statistics
1. correlation matrix (heatmap)
2. matrix of energy distances (heatmap)
density estimate 4. 1D marginals (Violin + jittered scatter plot of each dimension, if n > 1000 or d>10, density heatmaps) 8. 2D marginals (Pairs plots for top ~8 dimensions, if n*d>8000, 2D heatmaps)
Outlier plot
cluster analysis (IDT++)
1. BIC curves
2. mean line plot
3. covariance matrix heatmaps
spectral analysis
1. cumulative variance (with elbows) of data matrix
2. eigenvectors (pairs plot + heatmap)

heatmap, sorted by child node
1st order stats per child + difference between children
1st order stats per child + difference between children
density estimate per child 2. 1D marginals: violion plot, separated by child node
1. 2D marginals: pairs plots, color coded by cluster, voronoi diagram overlaid
outlier plot for each child node
cluster analysis per child
spectral analysis per child

raw
linear options
- linear squash between 0 & 1
- mean subtract and standard deviation divide
- median subtract and median absolute deviation divide
- make unit norm
nonlinear
- rank
- sigmoid squash

use Geometric median and robust estimation in Banach spaces to obtain robust estimates of 1st and 2nd moments

Provide feedback