Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global view SafeTrace algorithm #36

Open
cankisagun opened this issue Mar 30, 2020 · 5 comments
Open

Global view SafeTrace algorithm #36

cankisagun opened this issue Mar 30, 2020 · 5 comments
Labels

Comments

@cankisagun
Copy link
Contributor

cankisagun commented Mar 30, 2020

This issue builds on issue #2

In order to achieve a certain level of privacy, we propose to run a lightweight clustering algorithm inside the SafeTrace secure DB and passing results to the maps API. This prevents any specific user data to leak to the map API.

The clustering algorithm should return << lan, lat, weight >> to the maps API for each data point to be represented on the map

Next steps:

  • Determine how to account for high risk symptoms in the global view algorithm (Suggested approach: overlay both positive covid19 and high risk symptoms on the same map with different colors)
  • Determine clustering logic based on sample data size (i.e. create 100 clusters for 10K users vs. 5 clusters for 10 users)
  • Assess Differential Privacy options

UPDATE: The clustering algorithm runs inside the enclave similar to individual reporting algorithm, which can be found here

@theoturner
Copy link

  • I agree on the positive vs high risk in different colors. While there is correlation, I expect that if you drew the curves, you'd see changes in high risk frequency for a given area not to be in the same time step as positive cases. Visualise it, and you may find a leading indicator relationship (I'd consider the first and second derivatives wrt time). You might even be able to develop a predictor using a regression model.
  • The clustering algorithm is simple: duplicate points according to weight, the run k-means (easy with cogset in Rust).
  • Choosing the number of clusters is not an easy problem to solve, and generally is difficult to do without a human present. Silhouetting is the most straightforward way to automate cluster number choice, though you can also run through many numbers of clusters and minimise MSE while not having too many clusters (elbow method). These approaches typically take a lot of effort to program reliably, and a human observing the data based on context tends to be the better option. I would allow the user to input the number of clusters, and maybe display a number selected by yourselves by default.
  • In terms of preserving privacy, it's pretty easy: output cluster centroids, rather than the data points.

@lacabra
Copy link
Contributor

lacabra commented Apr 3, 2020

Thanks @theoturner for your outline and proposed solution. It makes sense at a high level. Given that I sense your familiarity with Rust, would you be in a position to submit a PR to introduce these improvements?

Alternatively, you could post small snippets of code as per your comments to this thread, and someone else could "glue" them together in fully functioning code and submit a PR instead.

Thoughts?

@theoturner
Copy link

K-means in Rust for lat/long:

We assume that lat/long is representative of a flat plane. For most cases this is fine, particularly smaller surface areas (@Akhil325 could potentialy comment further). I've personally never had problems with this abstraction and clustering, and given the ongoing dispute about 2D projections of the globe, It's likely not worth your time to consider projections.

Assume you have locations:

struct Location {
    latitude: f64, // Or whatever numerical type
    longitude: f64,
}

Cluster (assume argument num_clusters e.g user-supplied - I'd advise this before attempting silhouetting):

use cogset::{Euclid, Kmeans};
let mut eucvec: Vec<Euclid<_>> = Vec::new();
for point in &locations {
    eucvec.push(Euclid([point.latitude as f64, point.longitude as f64]));
}
let kmeans = Kmeans::new(&eucvec, num_clusters as usize);
let clusters = kmeans.clusters();

Outputting just the centroids:

let mut clustvec: Vec<(f64, f64)> = Vec::new();
for result in &clusters {
    clustvec.push(((result.0).0[0], (result.0).0[1]));
} 
format!("{:?}", clustvec)

@theoturner
Copy link

Drawing it on Google Maps

I also see you've linked an issue re: google maps. Cluster data is relatively easy to draw on maps, e.g. in React.

First, let's create a drawable var of our points, which you could extract from your props, e.g. say you had prop clusters imported from your Rust:

const Point = ({ text }) => <div>{text}</div>;
var points = [];
for (var i = 0; i < this.props.clusters.length; i++) {
    points.push(<Point
        lat={this.props.clusters[i][0]}
        lng={this.props.clusters[i][1]}
        text="X"
    />);
}

Then, draw a google map:

import GoogleMapReact from 'google-map-react';
<GoogleMapReact
bootstrapURLKeys={{ key: process.env.REACT_APP_API_KEY }}
defaultCenter={{ lat: 51.507221, lng: -0.127600}}
defaultZoom={11}
{points}
</GoogleMapReact>

@theoturner
Copy link

For more details, see my implementation of clustering + maps for Enigma, APEX. In this implementation, the contract only stores ints and uses enigma macros such as eformat!, there's some input sanitation, and the google map is drawn with a custom theme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants