Question

I'm not sure if this can be achieved in Google Refine at all. But basically, I have data like this.

enter image description here

enter image description here

The first table is the table of all the users. The second table show all the friends. However, in the second table in "friends" column not all the id exists in the first table which I want to get rid of. So, how can I search each id in friends column in the second table and get rid of the id that doesn't exists in the table 1?

Was it helpful?

Solution

Put the two tables in different projects (we'll call them Table1 and Table2).

In Table2 on on the friends column:

  • use "split multi-valued cells" to get each value on a separate row
  • convert the visitors column to numbers (or conversely user_id in Table1 to string)
  • use "add a new column based on this column" with the expression cross(cell,'Table1','user_id').length()

This will return 0 if there's no match, 1 if there's a match or N>1 if there are duplicates in Table1

If you want the data back in the original format, set up a facet to filter on the validity column, blank out all the bad values and then use "join multi-valued cells" to reverse the split operation you did up front.

I fixed some caching bugs with cross() for OpenRefine 2.6, so if the cross doesn't work, try stopping and restarting the Refine server.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top