Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisiting Access Control #839

Open
danielballan opened this issue Jan 10, 2025 · 8 comments
Open

Revisiting Access Control #839

danielballan opened this issue Jan 10, 2025 · 8 comments

Comments

@danielballan
Copy link
Member

danielballan commented Jan 10, 2025

Tiled needs first-class support for authorization

We currently support a pluggable Access Control Policy interface that derives access controls from metadata. This pushes a security concern into a space that was intended to be for science. It would be better to have a dedicated field in Tiled for access control information associated with each node, separate from its metadata.

This prompts the question, what should that field contain?

Of course, the security policy itself must still be customizable, to integrate with various facilities existing systems and policies. Tiled can ship some policies for common use cases.

RBAC fits our use case and scale

Claim: RBAC as it was originally formulated1 makes frequent operations easier to reason about and lower-risk than "users and groups".

Building up layers of indirection...

The most direct thing we could do is assign a list of users to each dataset, with associated rights (e.g. read-only, read-write). This is sufficiently expressive to describe anything we might want to do! It's more or less how access controls work in Sharepoint. But it is difficult to manage correctly at large scales.

A layer of indirection helps make this more manageable. We could assign a list of groups to each data set---again with associated rights. This is how filesystem ACLs work. It is, again, technically sufficient to describe anything we might want to do. In many common cases, it is easier to manage than lists of users.

As an example, suppose we assign these groups with these rights to 1000 datasets: {"proposal-12345": "r", "smi": "r", "csx": "r", "data-admins": "rw"}. Now consider some use cases:

  • Adding and removing users from proposals is easy. Just update the membership of `proposal-12345'. No problem there.
  • If we need to alter the permissions given to any of these groups, we need to touch every dataset.
  • If we want to grant temporary access to data from this proposal, for the purposes of support, we have to touch every dataset. Worse, if new data is being generated, by acquisition or just data processing, new and old data have to be updated consistently. The temporary must be reverted correctly afterward.
  • When a dataset in created, who applies these permissions? How do they know which ones to apply, and how do we manage who is allowed to apply which permissions?

Adding "groups" made the system easier to operate correctly. Adding more layer of indirection makes it even easier. Instead of assigning groups to datasets, we assign one or more access tags to data sets, and defined a security policy that resolves each user's rights from these tags. (This is RBAC as originally proposed.)

Suppose we assign the access tags ["proposal-12345"] to all 1000 datasets and define a security policy that maps proposal-12345 to {"proposal-12345": "r", "smi": "r", "csx": "r", "data-admins": "rw"}. Notice that the groups in this policy have different permissions, so we cannot just make a new group proposal-12345.

  • Adding and removing users from proposals is still easy, same as before.
  • If we need to alter the permissions a given group has on a proposal, we change the policy.
  • If we need to grant temporary permission, we change the policy and then change it back.
  • Tags are assigned "tag owners" (a list of users and/or groups) with authority to apply that tag to dataset that they create. So, for example, users on the proposal-12345 can apply the proposal-12345 tag to their data in order to share it with others on their proposal. The policy of what that tag means is centrally controlled by the data admin team.

Again, all of these could be expressed using lists of users and groups---or just lists of users!---but RBAC makes operations easier to manage at scale and puts responsibility where it belongs. Users tag data using the tags that they "own", and the data admin team manages the details of how that maps onto the organizations groups and permissions.

(Aside: As a thought exercise, should we keep going, adding a layer of indirection beyond RBAC? The next level up from here might be groups of access tags. It's not clear that this makes anything easier to reason about.)

Footnotes

  1. The 10-page 1992 paper that introduced RBAC, from authors at NIST (incidentally, at a conference in Baltimore). See a nice explanation in a blog post by Avery Pennarun, co-founder of Tailscale

@danielballan
Copy link
Member Author

Example

Only Tiled admins can create "access tags" and define their policies:

c.context.admin.set_policy(
    # name of access tag
    "proposal-12345",
    # Who can apply this access tag to datasets that they create?
    owners=["group:proposal-12345-users"],
    # What access does it confer?
    entitlements=[
        {"groups": ["proposal-12345-users", "smi-beamline-staff", "data-admins"], "scopes": ["read"]}
        {"subjects": ["data-admins"], "scopes": ["write"]}
    ],
)

Users create data and can tag it to share it:

# By default, data is accessible (read/write) only by the user who created it.
x = c.write_array([1,2,3])

# Later, a user can apply tags that they are a 'tag owner' on.
# They might share a processed data result like this:
x.access_tags.add("proposal-12345")

# This could also just be done at creation time:
y = c.write_array([1,2,3], access_tags=["proposal-12345"])

Other tags like public or calibration might be created, with appropriate policies around what entitlement that confer and which "tag owners" can apply them.

@whs92
Copy link
Member

whs92 commented Jan 10, 2025

Your proposal matches what we (and I guess you) are currently doing with adding metadata about what proposal/ purpose/tag this data being made is related to, and then leaving it to some other system later to decide who can do what and access given tags.

If you remove this from the scientific metadata and put it somewhere else, where will you put it? Will it still be searchable? The tag which might be the proposal name, a commissioning project name, or some other thing is in with the scientific metadata, but it is somehow scientific. It's giving useful context to the data. I think I agree with the argument you are making and your conclusion about what should be stored as this tag, but I am not sure I agree with the original premise that

It would be better to have a dedicated field in Tiled for access control information associated with each node, separate from its metadata

Can you elaborate on why?

@danielballan
Copy link
Member Author

danielballan commented Jan 10, 2025

Administrative information may (likely should) also be stored in metadata, but we should not rely on the contents of metadata for access policy enforcement. I am proposing to store it in a separate table in the same database, table of access_tags that has many-to-many relation with Tiled nodes (datasets). Yes, they will be searchable.

  1. We are aware of use cases (one from @edbarnard) where data is being ingested from systems like SciCat that stores the access control information outside of the metadata. Mixing it in for Tiled is possible, but perhaps awkward.
  2. Users often have permission to edit metadata. They may or may not have permission to alter access control policies on their data. If the access control tags are placed in a separate location, it will easier to grant access to one but not the other, and to alter one without accidentally altering the other. The argument works the other way to: operations from admins altering security policies should avoid touching (and may not have permission to touch) the metadata.

@danielballan
Copy link
Member Author

I'm curious whether @garryod has given thought to this or encountered this interpretation of RBAC before.

@garryod
Copy link

garryod commented Jan 10, 2025

I'm curious whether @garryod has given thought to this or encountered this interpretation of RBAC before.

This interpretation of RBAC is new to me - it appears to be more powerful than typical RBAC implementations, I would be tempted to say it's more akin to many Relation Based Access Control (ReBAC) policies and would lend itself well to being expressed in such a way.

Unfortunately due to some requirements from @keithralphs and @coretl - specifically around use of network locations and times in authorization decisions - we've had to opt for fully fledged Attribute Based Access Control (ABAC). This has taken us down the route of Rego and a central Open Policy Agent instance for our Authorization, with some preliminary policy residing in the policy directory of our AuthZ repo. If you can constrain your authorization requirements such that ReBAC (or even RBAC) are viable then I would strongly recommend doing so - with a suggestion of using an off the shelf ReBAC solution such as those derived from Google's Zanzibar.

@danielballan
Copy link
Member Author

Thanks, that's useful feedback. I think where @nmaytan are landing is this:

  1. Tiled should grow a new column in the nodes table, i.e. next to the node metadata. From Tiled's point of view that column will contain arbitrary JSON used for the purposes of access control. The Tiled HTTP API will enable filtering based on it.
  2. User-defined AccessControlPolicy plugins fully manage what gets stored in this column, and then use it to implement access control filtering.

This model enables:

  • Storing lists of group names and using them for access control (@edbarnard and @dylanmcreynolds want this)
  • Storing "object tags" and implementing the RBAC variant described in my original post (@nmaytan, @tacaswell, and I are sold on trying this)
  • Storing whatever information an external framework (Casbin, Zanzibar, etc.) would want to make an access control determination.

In short, we create a dedicated space for AccessControlPolicies to store and access state per-node, and continue to leave it up to AccessControlPolicies to implement whatever authorization model fits best with facility requirements.

It may make sense to ship implementations in Tiled itself for two or three popular modes.

@dylanmcreynolds
Copy link
Contributor

I like this field. For reference something similar was added to the event_model some time ago: data_group and data_session. bluesky/event-model#196

@danielballan
Copy link
Member Author

It's interesting that that was a compromise based on:

  • NSLS-II wanting to store one key, data_session, relying on an external system to resolve groups/users
  • ALS wanting to store a list of groups, data_groups, directly in the document

Years later, we rediscovered that we have the same divergent requirements/preferences. The proposed field may be used for one or the other, as controlled by the respective AccessControlPolicy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants