Question

I have a dataset with 60+ thousand rows in excel and about 20 columns. The "ID column" sometimes repeats itself and I want to add a column that will return 1 only in the row that is the most recent only IF it repeats itself.

Here is the example. I have…

    ID            DATE       ColumnX
    AS1         Jan-2013      DATA
    AS2         Feb-2013      DATA
    AS3         Jan-2013      DATA
    AS4         Dec-2013      DATA
    AS2         Dec-2013      DATA

I want…

    ID            DATE       ColumnX      New Column
    AS1         Jan-2013      DATA            1
    AS2         Feb-2013      DATA            0
    AS3         Jan-2013      DATA            1
    AS4         Dec-2013      DATA            1
    AS2         Dec-2013      DATA            1 

I've been trying with a combination of sort and nested if's, but it depends on my data being always in the same order (so that it looks up the ID in the previous row).

Bonus points: consider my dataset if fairly large for excel, so the most efficient code that won't eat up processor would be appreciated!

Was it helpful?

Solution

An approach you could use is to point MSQuery at your table and use SQL to apply the business rules. On the positive side, this runs very quickly (a couple seconds in my tests against 64k rows). A huge minus is the query engine does not seem to support Excel tables exceeding 64k rows, but there might be ways to work around this. Regardless, I offer the solution in case it gives you some ideas.

To set up first give your data set a named range. I called it MYTABLE. Save. Next select a cell to the right of your table in row 1, and click through Data | From other sources | from Microsoft Query. Choose Excel Files* | OK, browse for your file. The Query Wiz should open, showing MYTABLE available, add all the columns. Click Cancel (really), and click Yes, you want to continue editing.

The MSQuery interface should open, click the SQL button and replace the code with the following. You will need to edit some specifics, such as the file path. (Also, note I used different column names. This was sheer paranoia on my part. The Jet engine is very finicky and I wanted to rule out conflicts with reserved words as I built this.)

SELECT 
    MYTABLE.ID_X, 
    MYTABLE.DATE_X, 
    MYTABLE.COLUMN_X, 
    IIF(MAXDATES.ID_x IS NULL,0,1) * IIF(DUPTABLE.ID_X IS NULL,0,1) AS NEW_DATA
FROM ((`C:\Users\andy3h\Desktop\SOTEST1.xlsx`.MYTABLE MYTABLE 
        LEFT OUTER JOIN (
            SELECT MYTABLE1.ID_X, MAX(MYTABLE1.DATE_X) AS MAXDATE
            FROM `C:\Users\andy3h\Desktop\SOTEST1.xlsx`.MYTABLE MYTABLE1
            GROUP BY MYTABLE1.ID_X
            ) AS MAXDATES
        ON MYTABLE.ID_X = MAXDATES.ID_X
        AND MYTABLE.DATE_X = MAXDATES.MAXDATE)
    LEFT OUTER JOIN (
        SELECT MYTABLE2.ID_X
        FROM `C:\Users\andy3h\Desktop\SOTEST1.xlsx`.MYTABLE MYTABLE2
        GROUP BY MYTABLE2.ID_X
        HAVING COUNT(1) > 1
        ) AS DUPTABLE
    ON MYTABLE.ID_X = DUPTABLE.ID_X)

With the code in place MSQuery will complain the query can't be represented graphically. It's OK. The query will execute -- it might take longer than expected to run at this stage. I'm not sure why, but it should run much faster on subsequent refreshes. Once results return, File | Return data to Excel. Accept the defaults on the Import Data dialog.

That's the technique. To refresh the query against new data simply Data | Refresh. If you need to tweak the query you can get back to it though Excel via Data | Connections | Properties | Definition tab.

The code I provided returns your original data plus the NEW_DATA column, which has value 1 if the ID is duplicated and the date is the maximum date for that ID, otherwise 0. This code will not sort out ties if an ID's maximum date is on several rows. All such rows will be tagged 1.

Edit: The code is easily modified to ignore the duplication logic and show most recent row for all IDs. Simply change the last bit of the SELECT clause to read

IIF(MAXDATES.ID_x IS NULL,0,1) AS NEW_DATA

In that case, you could also remove the final LEFT JOIN with alias DUPTABLE.

OTHER TIPS

Sort by ID, then by DATE (ascending). Define entries in new column to be 1 if previous row has the same ID and next row has a different ID or is empty (for last row), 0 otherwise.

It could be done in VBA. I'd be interested to know if this is possible just using formulas, I had to do something similar once before.

Sub Macro1()

Dim rowCount As Long

Sheets("Sheet1").Activate
rowCount = Cells(Rows.Count, 1).End(xlUp).Row

Columns("A:D").Select
Selection.AutoFilter

Range("D2:D" & rowCount).Select
Selection.ClearContents

Columns("A:D").Select

ActiveWorkbook.Worksheets("Sheet1").AutoFilter.Sort.SortFields.Add Key:=Range _
    ("B1:B" & rowCount), SortOn:=xlSortOnValues
ActiveWorkbook.Worksheets("Sheet1").AutoFilter.Sort.SortFields.Add Key:=Range _
    ("A1:A" & rowCount), SortOn:=xlSortOnValues
ActiveWorkbook.Worksheets("Sheet1").AutoFilter.Sort.Apply

Dim counter As Integer

For counter = 2 To rowCount

    Cells(counter, 4) = 1
    If Cells(counter, 1) = Cells(counter + 1, 1) Then Cells(counter, 4) = 0

Next counter

End Sub

So you activate the sheet and get the count of rows.

Then select and autofilter the results, and clear out Column D which has the 0s or 1s. Then filter on the values mbroshi suggested that you say you're already using. Then execute a loop for each record, changing the value to 1, but then back to 0 if the value ahead of it has the same ID.

Depending on your processor I dont think this would take more than a minute or two to run. If you do find something using formulas I would be interested to see it!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top