Julia: Find the intersection of two vectors and their indices to help concatenate two time series

StackOverflow https://stackoverflow.com/questions/22702848

  •  22-06-2023
  •  | 
  •  

Question

I'm trying to learn the Julia language, so I'm attempting to port some MATLAB and Octave code I have lying around to help me learn.

Specifically in this instance, the reason I'm looking for a way to determine the indices is because I'm attempting to horizontally concatenate two time series by their dates so I can test for conintegration, and I've used the intersect function as one of the intermediate steps within MATLAB/Ocatve to allow me to accomplish this in the past.

Time series 1 example
+------------+--------+
| Date       |  Value |
+------------+--------+
| 2014-03-01 |     11 |
| 2014-03-02 |     12 |
| 2014-03-03 |     13 |
| 2014-03-04 |     14 |
| 2014-03-05 |     15 |
+------------+--------+

Time series 2 example
+------------+--------+
| Date       |  Value |
+------------+--------+
| 2014-03-01 |     21 |
| 2014-03-02 |     22 |
| 2014-03-05 |     25 |
| 2014-03-06 |     26 |
+------------+--------+

Intermediate result desired
+------------+----------------------+---------------------+
| Date       |  Time series 1 value | Time series 2 value |
+------------+----------------------+---------------------+
| 2014-03-01 |                   11 |                  21 |
| 2014-03-02 |                   12 |                  22 |
| 2014-03-03 |                   13 |                 NaN |
| 2014-03-04 |                   14 |                 NaN |
| 2014-03-05 |                   15 |                  25 |
| 2014-03-06 |                  NaN |                  26 |
+------------+----------------------+---------------------+

Final result desired
+------------+----------------------+---------------------+
| Date       |  Time series 1 value | Time series 2 value |
+------------+----------------------+---------------------+
| 2014-03-01 |                   11 |                  21 |
| 2014-03-02 |                   12 |                  22 |
| 2014-03-05 |                   15 |                  25 |
+------------+----------------------+---------------------+

MATLAB's and Octave's intersect function can return the index positions of the intersected set elements, as well as the intersection set, like so (MATLAB reference : Octave reference):

[C a_idx b_idx] = intersect(a_vector, b_vector)

While Julia's intersect function would only return the the equivalent of C above from what I can tell (Julia reference)

C = intersect(a_vector, b_vector)

How can I acheive this concatenation of two time series in Julia where only those dates that both have data are included in the final result?

I've played around with findin() and I can get some indices, but perhaps the way I've written MATLAB/Ocatave code in the past can't/shouldn't be replicated in Julia, so I'm interested in the best way to get the final time series result in the most accurate and efficient way possible using Julia.

(The time series examples given of course are only to clarify what I'm trying to achieve, the data can be millions to billions in the 1st dimension)

Was it helpful?

Solution

This seems like a task where you DataFrames should be the tool for the job. I'm not sure how well it currently works, but the documentation suggests that there is a join() method that can do both inner and outer joins, as you request. I have seen some issues about making DataFrames more like a inMemory database, but I have not followed the discussion closely enough to know.

When the problem size gets really big, I would really suggest that you consider using a relational database like MySQL, or sqlite. They are carefully tuned for doing exactly these kind of operations, and provide you with a simple declarative language to express what result you want, and let the system work out how this can be done in the fastest way possible.

OTHER TIPS

Concur with ivarne that DataFrames is a good choice:

using DataFrames

timeseries1 = DataFrame(
    a = ["2014-03-01","2014-03-02","2014-03-03","2014-03-04","2014-03-05",],
    b = @data([1,2,3,4,5])
)
timeseries2 = DataFrame(
    a = ["2014-03-01","2014-03-02","2014-03-05","2014-03-06",],
    c = @data([21,22,23,24])
)
join(timeseries1, timeseries2, on=:a)

3x3 DataFrame:
               a b  c
[1,]    "2014-03-01" 1 21
[2,]    "2014-03-02" 2 22
[3,]    "2014-03-05" 5 2

A solution without DataFrames and which enables to keep track of the original positions of the elements in the intersection. In this respect, it is equivalent to the Matlab function intersect:

function intersectalamatlab( a , b )
    function findindices!( resa , ab , a)
        for ( i , el) ∈ enumerate(ab)
            resa[i] = findfirst( x->x==el , a )
        end
    end
    ab = intersect(a,b)
    resa=Vector{Int64}(undef,length(ab))
    findindices!( resa , ab , a)
    resa
    resb=similar(resa)
    findindices!( resb , ab , b)
    resb
    return (ab , resa , resb )
end

a =[3 , 45 , 123 , 12]
b = [12 , 19 , 46 , 56 , 123]
intersectalamatlab( a , b )
([123, 12], [3, 4], [5, 1])

This solution can surely be improved in terms of speed but compared to using DataFrames, it has the advantage to be lightweight and keep the syntax and output of the Matlab routine.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top