After having endlessly tried to handle large (3-35gb) csv files in R, I have moved over to SQL for handling these datasets. So I am using this code within an R environment (using the using the SQlite based RSQLite package) but it should not detract from my SQL question!

My Question: How do I make a selection of one table based on matching values given in another table?

I would Like to explain by example. I have the following table format:

"Data" Table

Symbol| Value| EX
A  | 1  | N       
A  | 1  | N     
A  | 2  | T  
A  | 3  | N  
A  | 4  | N  
A  | 5  | N  
B  | 1  | P       
B  | 2  | P  
B  | 2  | N  
B  | 2  | N  
B  | 3  | P  
B  | 5  | P  
B  | 6  | T  
... 

I want to select all entries for which the symbol and exchange value matches according to a certain condition given in the example table below.

"Symbolexchange" Table:

Ticker| Exchange
A  | N       
B  | P  
... 

(Note that symbol and ticker refer to the same attribute, also EX and Exchange refer to the same attribute)

So the output I am aiming for is so that it only keeps A entries given exchange is N etc:

Symbol| Value| EX
A  | 1  | N       
A  | 1  | N     
A  | 3  | N  
A  | 4  | N  
A  | 5  | N  
B  | 1  | P       
B  | 2  | P  
B  | 3  | P  
B  | 5  | P  
... 

I was able to do this via two methods, although I am not quite satisfied with them.

This method adds the reference table in columns next to the original table, which is redundant.

SELECT *
FROM Data
INNER JOIN Symbolexchange 
ON Data.EX=Symbolexchange.EXCHANGE
AND Data.SYMBOL=Symbolexchange.TICKER

This Method also gets the job done directly but is slower than the above.

SELECT *
FROM Data
WHERE EX=(SELECT exchange FROM Symbolexchange WHERE ticker = SYMBOL)

Is there a better and faster way way to program this? Speed is quite important because of the size of my data sets. Any other comments on my code welcome!

Thanks

有帮助吗?

解决方案

Two things that you can do to improve performance:

First (and most importantly) add a key or index to your tables. I don't know SQLite, but usually there's a command something like this:

CREATE INDEX DataIX1 ON Data(Symbol,EX)

You'll want one on the other table too:

CREATE INDEX SymbolExchangeIX1 ON Symbolexchange(Ticker,Exchange)

You may need to throw in ".." or '..' on the names...

The second thing is that although your first query is probably your best approach, you should only return the columns that you actually need/want:

SELECT Data.*
FROM Data
INNER JOIN Symbolexchange 
ON Data.EX=Symbolexchange.EXCHANGE
AND Data.SYMBOL=Symbolexchange.TICKER

其他提示

I am not sure if you are using mysql or MS SQL. For MS SQL, you make your query faster by adding a no lock to your queries.

1) WITH (NOLOCK)

Select * from user with (NOLOCK)

OR

2) SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
select * from user a, class b where a.userid=b.userid

You can refer to previously discussed topic on this below. WITH (NOLOCK) vs SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top