From 3a5fe0eaa97604f09e03a2f237812faa84f17857 Mon Sep 17 00:00:00 2001 From: Priya <150767072+priyashuu@users.noreply.github.com> Date: Sun, 10 Nov 2024 11:40:04 +0530 Subject: [PATCH] added tsne-visulisations --- docs/machine-learning/tsne-visualization.md | 96 +++++++++++++++++++++ 1 file changed, 96 insertions(+) create mode 100644 docs/machine-learning/tsne-visualization.md diff --git a/docs/machine-learning/tsne-visualization.md b/docs/machine-learning/tsne-visualization.md new file mode 100644 index 000000000..18e3e0a31 --- /dev/null +++ b/docs/machine-learning/tsne-visualization.md @@ -0,0 +1,96 @@ +--- + +id: tsne-visualization +title: t-Distributed Stochastic Neighbor Embedding (t-SNE) Algorithm +sidebar_label: t-SNE Visualization +description: "An overview of t-SNE, a popular technique for visualizing high-dimensional data in two or three dimensions." +tags: [machine learning, data visualization, dimensionality reduction, t-SNE, algorithms] + +--- + +### Definition: +**t-Distributed Stochastic Neighbor Embedding (t-SNE)** is a nonlinear dimensionality reduction technique commonly used for visualizing high-dimensional data. By mapping data points to a lower-dimensional space (typically two or three dimensions), t-SNE preserves the local structure of the data, making patterns and clusters more discernible. + +### Characteristics: +- **Nonlinear Dimensionality Reduction**: + Unlike linear techniques like PCA, t-SNE captures the complex relationships between data points, making it suitable for data with intricate structures. + +- **Focus on Local Structure**: + t-SNE emphasizes preserving the relative distances of nearby points while de-emphasizing larger pairwise distances. This helps reveal the underlying structure in clusters of data. + +### How It Works: +t-SNE minimizes the divergence between two distributions: one that measures pairwise similarities in the original high-dimensional space and another in the lower-dimensional space. The algorithm works as follows: + +1. **Pairwise Similarities**: + Calculate the pairwise similarities between points using a Gaussian distribution in the high-dimensional space. + +2. **Low-Dimensional Mapping**: + Initialize the data points randomly in the lower-dimensional space and compute their similarities using a Student’s t-distribution (hence "t-SNE"). + +3. **Optimization**: + Minimize the Kullback–Leibler divergence between the two similarity distributions using gradient descent. + +### Problem Statement: +Integrate t-SNE visualization as a feature to aid users in interpreting and analyzing high-dimensional datasets by reducing them to 2D or 3D representations that can reveal clusters, patterns, and anomalies. + +### Key Concepts: +- **Perplexity**: + A hyperparameter in t-SNE that balances attention between local and global aspects of the data. Typical values range from 5 to 50. + +- **Learning Rate**: + Affects the speed of convergence. Too low a rate can result in poor convergence, while too high a rate can lead to data artifacts. + +- **High-Dimensional Similarities**: + Defined using conditional probabilities based on Gaussian kernels. + +- **Low-Dimensional Embedding**: + Uses a Student's t-distribution to prevent "crowding" in the lower-dimensional space, where distant points stay separated. + +### Example Usage: +Consider a dataset containing 1000 samples, each with 50 features: + +```python +import numpy as np +import matplotlib.pyplot as plt +from sklearn.manifold import TSNE + +# Example data: synthetic dataset +X = np.random.rand(1000, 50) # 1000 samples, 50 features + +# Apply t-SNE to reduce dimensions to 2D +tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, random_state=42) +X_embedded = tsne.fit_transform(X) + +# Plot the 2D t-SNE visualization +plt.figure(figsize=(10, 6)) +plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c='blue', alpha=0.6) +plt.title('t-SNE Visualization of High-Dimensional Data') +plt.xlabel('t-SNE Component 1') +plt.ylabel('t-SNE Component 2') +plt.show() +``` + +### Considerations: +- **Computationally Intensive**: + t-SNE can be slow for large datasets due to the pairwise similarity calculations and optimization process. Various optimized implementations (e.g., Barnes-Hut t-SNE) help reduce the runtime. + +- **Interpretation**: + While t-SNE is excellent for visualizing clusters, the distances between clusters may not be as meaningful as the intra-cluster distances. + +- **Preprocessing**: + It’s beneficial to scale and preprocess data (e.g., using PCA for initial reduction) to enhance the quality and performance of t-SNE. + +### Benefits: +- Reveals hidden structures in data that linear methods may miss. +- Suitable for exploring complex datasets such as images, word embeddings, or genomic data. +- Enhances data analysis, pattern recognition, and exploratory data analysis (EDA). + +### Challenges: +- Requires careful tuning of hyperparameters like perplexity and learning rate. +- Sensitive to scale; data preprocessing is crucial for optimal results. +- The visualization outcome can vary between runs due to the non-convex optimization. + +### Conclusion: +t-SNE has become a powerful tool for visualizing and understanding high-dimensional data, especially in cases where simpler techniques fail to reveal meaningful structures. Integrating t-SNE visualizations into projects provides users with an intuitive way to explore complex datasets, spot clusters, and identify underlying relationships. + +---