Question

I have an excel with a few thousand sources arranged in this way:

Example 1:

Abbott KW, Snidal D (2009) The Governance Triangle: Regulatory Standards Institutions and the Shadow of the State. In: Mattli W , Woods N (eds) The Politics of Global Regulation, pp. 44–88. Princeton University Press, Princeton, NJ

Example 2:

Moschella M , Tsingou E (eds) (2013) Great Expectations, Slow Transformations: Incremental Change in Financial Governance. ECPR Press, Colchester

I need to split them into 7 columns with this data:

  1. first author
  2. second author
  3. third to N author
  4. publication year
  5. title of source article
  6. published in (not always included, but always starts with In:)
  7. More info - means everything after the published in/after title of source article (in case it was not part of a larger publication)

I tried using the split into columns tool in excel, but because the data is so varied I couldn't do it efficiently. Does anyone know a solution to this?

Was it helpful?

Solution 2

Try this VBA macro. It uses regular expressions to parse out the different segments; but if data is not how you have presented, it will fail; so if there are failures, you'll need to see how it mismatches either my assumptions or the way you presented the data.

The macro assumes the data starts in A1 and is in column A, with no label in row 1. The results are written into column B and subsequent; with a label row 1 -- but these could be placed anywhere.

This code goes into a regular module.

Option Explicit
Sub ParseBiblio()
    Dim vData As Variant
    Dim vBiblios() As Variant
    Dim rRes As Range
    Dim re As Object, mc As Object
    Dim I As Long

'Assume Data is in column A.
'Might need to start at row 2 if there is a label row
vData = Range("A1", Cells(Rows.Count, "A").End(xlUp))

'Results to start in Column B with labels in row 1
Set rRes = Range("b1")

Set re = CreateObject("vbscript.regexp")
With re
    .MultiLine = True
    .Global = True
    .ignorecase = True
    .Pattern = "(^[^,]+),?\s*([^,]+?)(?:,\s*([^(]+))?\s*\((\d{4})\)\s*(.*?\.)\s*(?:In:\s*(.*)\.)?\s*(.*)"
End With

'Results array and labels
ReDim vBiblios(1 To UBound(vData) + 1, 1 To 7)
    vBiblios(1, 1) = "First Author"
    vBiblios(1, 2) = "Second Author"
    vBiblios(1, 3) = "Other Authors"
    vBiblios(1, 4) = "Publication Year"
    vBiblios(1, 5) = "Title"
    vBiblios(1, 6) = "Published In"
    vBiblios(1, 7) = "More Info"

For I = 1 To UBound(vData)
    Set mc = re.Execute(vData(I, 1))
    If mc.Count > 0 Then
        With mc(0)
            vBiblios(I + 1, 1) = .submatches(0)
            vBiblios(I + 1, 2) = .submatches(1)
            vBiblios(I + 1, 3) = .submatches(2)
            vBiblios(I + 1, 4) = .submatches(3)
            vBiblios(I + 1, 5) = .submatches(4)
            vBiblios(I + 1, 6) = .submatches(5)
            vBiblios(I + 1, 7) = .submatches(6)
        End With
    End If
Next I

Set rRes = rRes.Resize(rowsize:=UBound(vBiblios, 1), columnsize:=UBound(vBiblios, 2))
rRes.EntireColumn.Clear
rRes = vBiblios
With rRes
    With .Rows(1)
        .Font.Bold = True
        .HorizontalAlignment = xlCenter
    End With
    .EntireColumn.AutoFit
End With

End Sub

OTHER TIPS

See How to split Bibiliography MLA string into BibTex using c#? where I linked to several dedicated tools for extracting bibliographic information from formatted text.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top