A tool for analyzing privacy policies using AI. This system processes privacy policy text segments and categorizes them according to standard privacy policy categories like data collection, sharing, retention, and user controls.
This repository includes three specialized AI agents:
- Policy Segmenter Agent: Breaks down privacy policies into meaningful segments
- Policy Annotator Agent: Analyzes and categorizes policy segments
- GDPR Compliance Agent: Evaluates policies against regulatory requirements
-
Policy Segmentation
- Breaks down privacy policies into distinct clauses
- Preserves original text and context
- Ensures segments are self-contained and meaningful
-
Policy Annotation
- Categorizes segments into standard privacy categories:
- First Party Collection/Use
- Third Party Sharing/Collection
- User Choice/Control
- User Access, Edit, and Deletion
- Data Retention
- Data Security
- Policy Change
- Do Not Track
- International and Specific Audiences
- Other
- Provides detailed explanations for categorizations
- Compares results with human annotations
- Categorizes segments into standard privacy categories:
-
Regulatory Compliance
- Evaluates policies against GDPR requirements
- Checks for required disclosures and practices
- Identifies potential compliance gaps
- Generates compliance reports
- Clone the repository:
git clone <repository-url>
cd policy-analyzer
- Create a
.env
file with your API keys:
OPENAI_API_KEY=your_key_here
- Start the services with Docker Compose:
docker compose watch
-
Access the Streamlit app by navigating to
http://localhost:8501
in your web browser. -
The agent service API will be available at
http://localhost:80
. You can also use the OpenAPI docs athttp://localhost:80/redoc
. -
Use
docker compose down
to stop the services.
This setup allows you to develop and test your changes in real-time without manually restarting the services.
- Run the analysis script:
docker exec policy-analyzer-agent_service-1 python agents/run_experiment.py
- Check results in
data/analysis_results.json
:
cat data/analysis_results.json
The results will contain:
- Original policy segments
- Human annotations (if provided)
- AI model analysis with categories and explanations
- Match status between human and AI annotations
.
├── data/ # Data files
│ ├── records_test.json # Example input data
│ └── analysis_results.json # Generated results
├── src/ # Source code
│ ├── agents/ # AI agent implementations
│ │ ├── privacy_policy_analyzer.py # Main analyzer logic
│ │ └── run_experiment.py # Analysis script
│ ├── service/ # API service
│ ├── schema/ # Data models
│ └── client/ # API client
├── docker/ # Docker configuration
│ ├── Dockerfile.app # Streamlit app container
│ └── Dockerfile.service # Agent service container
└── compose.yaml # Docker Compose configuration
The project uses Docker Compose Watch for development. Changes to source files will automatically trigger container rebuilds.
To add new privacy policy categories:
- Update
PRIVACY_LABELS
insrc/agents/privacy_policy_analyzer.py
- Update the system prompt to include the new categories
- Rebuild the containers:
docker compose build
Use the provided records_test.json
to verify your setup and test new features.
Contributions are welcome! Please feel free to submit a Pull Request.
The analysis_results.json
contains detailed results for each policy segment. Here's an example:
{
"document": 3828,
"policyURL": "http://www.kraftrecipes.com/about/privacynotice.aspx",
"segment": "Effective Date: May 7, 2015 Kraft Site Privacy Notice...",
"human_annotations": [
{
"Other": {
"Other Type": "Introductory/Generic"
}
},
{
"Policy Change": {
"Change Type": "Unspecified",
"User Choice": "Unspecified",
"Notification Type": "General notice in privacy policy"
}
}
],
"model_analysis": {
"category": {
"Other": {
"Other Type": "Introductory/Generic"
}
},
"explanation": "This segment serves as an introductory statement..."
},
"matching_details": {
"top_level_match": true,
"exact_match": true,
"matching_subcategories": 1,
"total_subcategories": 1,
"subcategory_match_ratio": 1.0,
"matched_categories": [
{
"Other": {
"Other Type": "Introductory/Generic"
}
}
]
}
}
The matching_details field provides metrics on how well the model's categorization matched human annotations:
top_level_match
: True if the main category (e.g., "Other") matches at least one human annotationexact_match
: True if there's a perfect match between the model's category and one of the human annotations (including all subcategories)matching_subcategories
: Number of matching subcategoriestotal_subcategories
: Total number of subcategories to matchsubcategory_match_ratio
: Ratio of matching subcategories (1.0 = perfect match)matched_categories
: List of human annotation categories that matched
In this example, the model correctly identified the segment as "Other: Introductory/Generic", matching one of the human annotations perfectly. Note that human annotators may identify multiple applicable categories (in this case, both "Other" and "Policy Change"), while the model currently focuses on the primary category.
- Breaks down privacy policies into logical segments
- Uses LangGraph for workflow management
- Ensures each segment contains one complete privacy statement
- Preserves original text exactly as written
- Analyzes individual policy segments
- Categorizes according to standard privacy categories
- Provides detailed explanations for categorizations
- Compares results with human annotations for accuracy
- Evaluates policies against key GDPR requirements:
- Identity and contact details
- Processing purposes and legal basis
- Data recipient information
- International transfer safeguards
- Retention periods
- Data subject rights
- Consent withdrawal
- Complaint procedures
- Automated decision-making details