Domanda

I am using MongoDB to store my stock tick data. I have one document per stock symbol per minute:

{
"_id" : ObjectId("535fb330f6a03d59077db43c"),
"symbol" : "AAPL",
"ts_minute" : ISODate("2014-04-29T14:12:00Z"),
"ticks" : [
    {
        "mu" : 115864,
        "ae" : true,
        "t" : 2,
        "v" : 571.93
    },
    {
        "mu" : 803378,
        "ae" : true,
        "t" : 2,
        "v" : 571.91
    },
    {
        "mu" : 903378,
        "ae" : false,
        "t" : null,
        "v" : 9000
    }
}

where mu is the distance in microseconds since ts_minute, t is tick type (bid, ask, open, close, volume, etc.), and v is value.

To aggregate this into minutely bars of OHLC (open, high, low, close) I use the following (with PyMongo):

query = {'$match': {'symbol': 'AAPL'}}
projection = {
    '$project': {
        'symbol':       1,
        'year':         {'$year':   '$ts_minute'},
        'month':        {'$month':  '$ts_minute'},
        'day':          {'$dayOfMonth': '$ts_minute'},
        'hour':         {'$hour':   '$ts_minute'},
        'minute':       {'$minute': '$ts_minute'},
        'ts_minute':    1,
        'ticks':        1
    }
}
unwind = {'$unwind': '$ticks'}
sort = {'$sort': {'ts_minute': 1}}
group = {
    '$group': {
        '_id': {
            'symbol':   '$symbol',
            'year':     '$year',
            'month':    '$month',
            'day':      '$day',
            'hour':     '$hour',
            'minute':   '$minute'
        },
        'open':     {'$first':  '$ticks.v'},
        'high':     {'$max':    '$ticks.v'},
        'low':      {'$min':    '$ticks.v'},
        'close':    {'$last':   '$ticks.v'},
    }
}
bars = tick_collection.aggregate([query, projection, unwind, sort, group])

The problem is that I store volume ticks and price ticks in the same array. Volume ticks are identified by having t equal to null. So you see, when I group, my price ticks and volume ticks get mixed. I would like to aggregate to OHLCV, such that OHLC is based on t not equal to null, and V should be the last element of the array where t equals null.

Does it make sense? Or is it just poor schema design? ;-)

È stato utile?

Soluzione

For performance, you really need to move the $sort to be before the unwind - it will get combined with $match and use an appropriate index (which would be {symbol:1,ts_minute:1} so I hope you have that index available). The project should go after unwind to create the price and volume fields you need for aggregation. It seems that you should just group by ts_minute directly. The changes to make would be:

query = {'$match': {'symbol': 'AAPL'}}
sort = {'$sort': {'ts_minute': 1}}
unwind = {'$unwind': '$ticks'}
projection = {
    '$project': {
        'symbol':       1,
        'ts_minute':    1,
        'volume' : { '$cond' : [
        {"$eq" : ["$ticks.t",null]},
        "$ticks.v",
        0
    ] },
    "price" : { "$cond" : [
        {"$eq" : ["$ticks.t",null] },
        null,
        "$ticks.v"
    ] }
    }
}

group = {
    '$group': {
        '_id': {
            'symbol':   '$symbol',
            'minute':   '$ts_minute'
        },
        'open':     {'$first':  '$price'},
        'high':     {'$max':    '$price'},
        'low':      {'$min':    '$price'},
        'close':    {'$last':   '$price'},
        'volume':   {'$sum':    '$volume'}
    }
}
bars = tick_collection.aggregate([query, sort, unwind, projection, group])
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top