Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added tsne-visulisations #2012

Merged
merged 1 commit into from
Nov 10, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions docs/machine-learning/tsne-visualization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---

id: tsne-visualization
title: t-Distributed Stochastic Neighbor Embedding (t-SNE) Algorithm
sidebar_label: t-SNE Visualization
description: "An overview of t-SNE, a popular technique for visualizing high-dimensional data in two or three dimensions."
tags: [machine learning, data visualization, dimensionality reduction, t-SNE, algorithms]

---

### Definition:
**t-Distributed Stochastic Neighbor Embedding (t-SNE)** is a nonlinear dimensionality reduction technique commonly used for visualizing high-dimensional data. By mapping data points to a lower-dimensional space (typically two or three dimensions), t-SNE preserves the local structure of the data, making patterns and clusters more discernible.

### Characteristics:
- **Nonlinear Dimensionality Reduction**:
Unlike linear techniques like PCA, t-SNE captures the complex relationships between data points, making it suitable for data with intricate structures.

- **Focus on Local Structure**:
t-SNE emphasizes preserving the relative distances of nearby points while de-emphasizing larger pairwise distances. This helps reveal the underlying structure in clusters of data.

### How It Works:
t-SNE minimizes the divergence between two distributions: one that measures pairwise similarities in the original high-dimensional space and another in the lower-dimensional space. The algorithm works as follows:

1. **Pairwise Similarities**:
Calculate the pairwise similarities between points using a Gaussian distribution in the high-dimensional space.

2. **Low-Dimensional Mapping**:
Initialize the data points randomly in the lower-dimensional space and compute their similarities using a Student’s t-distribution (hence "t-SNE").

3. **Optimization**:
Minimize the Kullback–Leibler divergence between the two similarity distributions using gradient descent.

### Problem Statement:
Integrate t-SNE visualization as a feature to aid users in interpreting and analyzing high-dimensional datasets by reducing them to 2D or 3D representations that can reveal clusters, patterns, and anomalies.

### Key Concepts:
- **Perplexity**:
A hyperparameter in t-SNE that balances attention between local and global aspects of the data. Typical values range from 5 to 50.

- **Learning Rate**:
Affects the speed of convergence. Too low a rate can result in poor convergence, while too high a rate can lead to data artifacts.

- **High-Dimensional Similarities**:
Defined using conditional probabilities based on Gaussian kernels.

- **Low-Dimensional Embedding**:
Uses a Student's t-distribution to prevent "crowding" in the lower-dimensional space, where distant points stay separated.

### Example Usage:
Consider a dataset containing 1000 samples, each with 50 features:

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

# Example data: synthetic dataset
X = np.random.rand(1000, 50) # 1000 samples, 50 features

# Apply t-SNE to reduce dimensions to 2D
tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, random_state=42)
X_embedded = tsne.fit_transform(X)

# Plot the 2D t-SNE visualization
plt.figure(figsize=(10, 6))
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c='blue', alpha=0.6)
plt.title('t-SNE Visualization of High-Dimensional Data')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.show()
```

### Considerations:
- **Computationally Intensive**:
t-SNE can be slow for large datasets due to the pairwise similarity calculations and optimization process. Various optimized implementations (e.g., Barnes-Hut t-SNE) help reduce the runtime.

- **Interpretation**:
While t-SNE is excellent for visualizing clusters, the distances between clusters may not be as meaningful as the intra-cluster distances.

- **Preprocessing**:
It’s beneficial to scale and preprocess data (e.g., using PCA for initial reduction) to enhance the quality and performance of t-SNE.

### Benefits:
- Reveals hidden structures in data that linear methods may miss.
- Suitable for exploring complex datasets such as images, word embeddings, or genomic data.
- Enhances data analysis, pattern recognition, and exploratory data analysis (EDA).

### Challenges:
- Requires careful tuning of hyperparameters like perplexity and learning rate.
- Sensitive to scale; data preprocessing is crucial for optimal results.
- The visualization outcome can vary between runs due to the non-convex optimization.

### Conclusion:
t-SNE has become a powerful tool for visualizing and understanding high-dimensional data, especially in cases where simpler techniques fail to reveal meaningful structures. Integrating t-SNE visualizations into projects provides users with an intuitive way to explore complex datasets, spot clusters, and identify underlying relationships.

---
Loading