Article on studies of size of Google Scholar

Meta-Googling

The leading search engine for academic research is almost 10 years old. Scott McLemee considers how it's grown -- and to what size.

You have /5 articles left.
Sign up for a free account or log in.

Google Scholar will be 10 years old as of next month. Also coming up fast, early in the new year, is the 10th anniversary of the launching of Inside Higher Ed. Just to be totally clear about it, they were completely unrelated developments, though I do seem to remember each being met with skepticism and reservations, in some quarters. Certain senior academics insisted on calling them both “blogs.” (It was a simpler time.)

Suffice it to say that the beta days soon passed. The world of higher learning has grown ever more dependent on the internet’s intricate system of tubes -- with the academic web now serving as publishing platform and archive, and also as a point of access to restricted or proprietary digital collections. In principle, at least, we ought to be able to determine the dimensions of the scholarly web: how many items it contains (papers, dissertations, conference recordings, etc.), in however many formats, and with whatever depth of indexing and degree of retrievability. But actually taking the measurements is another matter. They belong to the epistemic category Donald Rumsfeld so aptly dubbed “the known unknowns.”

How about posing a narrower question, then? Just how big is Google Scholar? As with the company’s search-engine algorithms, that information remains a trade secret. But a couple of recent studies have tried to sound the depths of Google Scholar – using GS itself.

The earlier of the papers is “The Number of Scholarly Documents on the Public Web” by Madian Khabsa and C. Lee Giles, published in May by PLOS One, the online open access journal. Lee is a professor of information and computer sciences at Pennsylvania State University, University Park, where Khabsa is a Ph.D. candidate in computer science and engineering.

Their study, conducted in early 2013, started with a pool of 10 papers each from 15 categories used by another search engine, Microsoft Academic Search. The fields covered were agriculture science, arts and humanities, biology, chemistry, computer science, economics and business, engineering, environmental sciences, geosciences, material science, mathematics, medicine, physics, and social sciences, plus multidisciplinary. The researchers then requested from both Google Scholar and Microsoft Academic Search a list of the distinct incoming citations to each paper. That is, if paper X cited a target paper seven times, it counted only as one citation.

“Overall,” Khabsa and Lee report, “we obtained 41,778 citations from MAS and 86,870 citations from Google Scholar,” including all the metadata available: “the document's title, list of authors, number of citations, year of publications, and the venue of publication (if available).”

The format and range of metadata are not standardized, but with some effort the results from the two search engines could be compared to determine how many items appeared in both -- as well as the number GS had but MAS didn’t, and vice versa. The researchers also determined the degree of overlap in each of the 15 fields, and the percentage of papers available without payment or subscription.

At the time of the experiment, MAS claimed 48.7 million records. Taking into account the degree of overlap, the researchers “estimated Google Scholar to have 99.3 million documents, which is approximately, 87% of the total number of scholarly documents found on the web,” which they determine to be some 114 million items in English, with about a quarter of them freely available.

The share of open-access material varies considerably between fields, of course. At the low end is agricultural science, with 12 percent. Computer science is in the lead, with half of scholarly publications being freely available. (The percentages are broken down by field in the paper’s second table.)

In July, Enrique Orduña-Malea and three other researchers in Spain published “About the Size of Google Scholar: Playing the Numbers” through the site arXiv. The paper begins with an assessment of the methodology and findings of Khabsa and Lee. In particular, Orduña-Malea et al. stress “the low indexation of institutional repositories on Google Scholar” and the GS policy of not indexing files over 5MB, “a procedure which is especially critical for doctoral theses).” They suggest that the earlier study probably underestimates the size of the academic web – even of its strictly Anglophone component -- while overestimating how much of it Google Scholar indexes.

The authors go on to describe a battery of tests designed to measure GS “from the inside,” so to speak. Generally this involved performing searches for particular kinds of documents or search terms, using different temporal filters. For example: the search could be made for patents and citations issued in a century, then by decade within that century, and the result compared.

An intriguing variant is what the authors call an “absurd query,” in which the search is for a very common word likely to be found in most documents, with variations on the search parameters, including temporal filters.

The Khabsa-Lee study came up with the estimate that Google Scholar indexed 99.3 million documents in English. Extrapolating from that, with English accounting for 65 percent of GS material, Orduña-Malea et al. determine that the Khabsa-Lee method yields a total Google Scholar database of 152.7 million documents. (Despite their critique, the authors call Khabsa and Lee’s work “novel and brilliant.”)

The methods that Orduña-Malea and his colleagues tried were not restricted to English-language scholarship. Their findings varied from a low of 126.3 million documents in Google Scholar to a high of 176.8 million.

For their part, the Spanish researchers conclude that a judicious estimate would be in the neighborhood of 160 million items. “However,” they write, “the fact that all methods show great inconsistencies, limitations and uncertainties, makes us wonder why Google does not simply provide this information to the scientific community if the company really knows this figure.”

It’s a fair point, and one the company should answer. Its 10th anniversary might be a good time for Google Scholar to give its users the gift of transparency.