Question

I am building an app for a business company, and they need to control who sees which reports by project and roles, the report can belong to one project and can be seen by many roles (employees roles).
so when the report is submitted it is tagged with the project and roles, like "project1" and {"manager","seller"} , so for example the employees who are working on project1 and are managers can see this report. The way i do it now is very much depends on arrays, this is what i have:

reports table:
project (string)
roles (array of strings)

employees table:
projects (array of strings) // all the projects the employee working/worked on
roles (array of strings) // employees can have many roles

when querying the reports the employee can see, i do something like this:

select * 
from reports 
WHERE (employee.roles && report.roles) AND (report.project = ANY (employee.projects))

i use postgresql

the problem is i think this will not have a good performance (i'm not sure)
the only way i know to speed this query is making a GIN index on reports (roles) column, to make the overlap faster

beside performance this tip here, just made me worry:

Tip: Arrays are not sets; searching for specific array elements can be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element. This will be easier to search, and is likely to scale better for a large number of elements.

so is there a much better design to do this, or this will just works fine?

Was it helpful?

Solution

Short short answer: What you're doing is reasonably sane,but consider using int arrays rather than strings, as they're faster to compare, and mind the caveats.

Personally, I'd normalize it: add a user_roles table, along with role2report and user2role. Performance-wise, the optimal case in my own experience is to pre-compute the current user's role_ids in your app, and then query with an IN clause for roles. This means:

select from reports join role2report ...

The same in triggers and such: the key is to compute the role_ids (or perm_ids), and then query. You do NOT, under any circumstance, want:

select from reports join role2report join crazy_user2role_role2role_rec_view

The biggest optimization from there involves caching a user's role for convenience using an int array or memcached or whatever. This avoids constantly using a crazy user2role joined with recursive role2role view definition, and whatever other types of craziness your specs' edge cases lead you to. Mind cache invalidation.

Caching the access lists is much trickier in my experience: should you cache who can read? Write? Both? Are some objects public? Can non-logged in guests access them to? It's a deluge of questions.

If you do cache that, use an int array as well. Toss in e.g. -1 to stand for public/guest access, and 0 in it to stand for registered/user access. And then use array overlaps in your queries (with registered users getting rows 0 and -1 automatically). Optimize your arrays accordingly to keep them small: if it contains -1, that should be the only value; else the same for zero; else list the role ids with grant access.

One caveat of using arrays, btw: at least until a recent version of Postgres (not sure now), no stats were collected on an array's contents. This made using an array sub-optimal for data sets in which a certain role_id who can access most things should lead to Postgres ignoring the GIN index. That's a real performance killer right there, because it means PG will basically fetch the entire table to fetch top-10 rows with appropriate perms instead of index scanning it with a filter.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top