Question

I have a pandas df pulled from an ODBC connection:

import pandas.io.sql as psql
handle = pyodbc.connect('...')
df1 = psql.frame_query("select * from Table1 where... [some queries on columns]")
# below is a pandas df resulting from the above SQL query
df1 = pd.DataFrame([[1, 'F', 11111, 500, 60], [2, 'M', 22222, 400, 30], [3, 'M', 33333, 5400, 78], [4, 'F', 44444, 5400, 45], [5, 'M', 55555, 8914, 66]], columns = ['ID','Gender','ZipCd','Spend','Age'])

Now I want to run a separate query on a different table in the same database; and as one of the criteria, extract rows that match the IDs from df1 (e.g. below, which does not work).

df2 = psql.frame_query("select * from Table2 where ID = ? and StatusCd in ('104', '106', '112', '115')", df1['ID'])
# The ID variable is a common unique identifier b/n the 2 tables

My question is, how do I assign df1['ID'] as a list of elements to query in df2? e.g. ...where ID in (1,2,3,...), but using df1['ID'] as an object containing the list. This would return records where IDs in df2 matched those of df1 as well as the other query criteria.

I am familiar w/ R syntax, so conceptually, my question very closely resembles this one: Pass R variable to RODBC's sqlQuery?

At the end of the day, I'm interested in parsing down table 1 so that it includes only records found in table 2 (i.e. that have one of the requisite StatusCds found in table 2). In this respect, I'm certain there is a more efficient way to call in the data, and probably in one query, but I'm not literate enough in python or SQL yet.

Further comment

I have pyodbc as a tag since i was originally pulling from my SQL servers using that module; maybe pyodbc is the more efficient method to use for this kind of task? But I'm an R/spreadsheet guy & pandas has just been the easiest thing for me to learn so far.

Was it helpful?

Solution

frame_query accepts an optional params keyword argument that can be a list or tuple of parameters to use in the SQL query. To be able to include a list of parameters, you can use python's string formatting to include the right number of placeholders.

For example:

placeholders = ','.join(['?'] * df1['ID'].count())
query = ("select * from Table2 where ID in ({}) "
         "and StatusCd in ('104', '106', '112', '115')").format(placeholders)
df2 = psql.frame_query(query, params=df1['ID'].tolist())

The query string would be select * from Table2 where ID in (?, ?, ?) and StatusCd in ('104', '106', '112', '115') if there are three IDs.

There is a limit on the number of parameters you can send, so if you have a very large number you might want to do a number of queries in batches and then concatenate the resulting DataFrames.

OTHER TIPS

Use an INNER JOIN in your original query to return rows from table1 with matching IDs in table2 for the status codes you need. While you're at it, put the status codes in a variable and parameterize the SQL statement execution. Code would look something like this:

...
codes = ("104", "106", "112", "115")
sql = """select Table1.*
         from Table1
         inner join Table2
            on Table1.ID = Table2.ID
         where Table2.StatusCd in (?, ?, ?, ?)"""
df1 = psql.frame_query(sql, codes)
...

I think the answer to this lies in constructing a better SQL query:

psql.frame_query("select... from Table1 as t1
                 inner join Table 2 t2
                 on t1.ID = t2.ID
                 where [add various queries from both tables]", handle)

Moving to close this post since it's more appropriately a SQL question & answered with basic documentation.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top