Can anybody share a simple example of using Mathematica and Google scholar to extract academic research information

StackOverflow https://stackoverflow.com/questions/6109520

Question

How can I use Mathematica and Google scholar to find the number of papers a person published in 2011?

Was it helpful?

Solution

Google Scholar is not very suited for this goal as it doesn't have a formal API AFAIK. It also doesn't provide results in a structured (e.g. XML) format. So, we have to resort to a quick (and very, very fragile!) text pattern matching hack like:

 searchGoogleScholarAuthor[author_String] := 
 First[StringCases[
   Import["http://scholar.google.com/scholar?start=0&num=1&q=" <> 
     StringDrop[
      StringJoin @@ ("author:" <> # <> "+" & /@ 
         StringSplit[author]), -1] <> "&hl=en&as_sdt=1,5"], ___ ~~ 
     "Results" ~~ ___ ~~ "of about" ~~ Shortest[___] ~~ 
     p : Longest[(DigitCharacter | ",") ..] ~~ ___ ~~ "." ~~ ___ ~~ 
     "(" ~~ ___ :> p]]

In[191]:= searchGoogleScholarAuthor["A Einstein"]

Out[191]= "6,400"

In[190]:= searchGoogleScholarAuthor["Einstein"]

Out[190]= "9,400"

In[192]:= searchGoogleScholarAuthor["Wizard"]

Out[192]= "197"

In[193]:= searchGoogleScholarAuthor["Vries"]

Out[193]= "70,700"

Add ToExpression if you don't like the string result. If you want to restrict the publication years you can add &as_ylo=2011&as_yhi=2011& to the search string and change the start and end years appropriately.

Please note that authors with popular names will generate lots of spurious hits as there is no way to uniquely identify a single author. Additionally, Scholar returns a diversity of hits, including citations, books, reprints and more. So, really, this ain't very useful for counting.

A bit of explanation:

Scholar splits the initials and names of authors and co-authors over several author: fields combined with a +. The StringDrop[StringJoin @@ ("author:" <> # <> "+" & /@ StringSplit[author]), -1] part of the code takes care of that. The StringDrop removes the last +.

The Stringcases part contains a large text pattern which basically searches for the text that Scholar places at the top of each results page and which contains the number of hits. This number is then isolated and returned.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top