In today's Academic Minute, Tamara Bogdanovic, a professor of physics at Georgia Tech, discusses the use of advanced supercomputers to make predictions about the behavior of black holes. Learn more about the Academic Minute here.
Professors in the arts and sciences at Walla Walla Community College have voted no confidence in President Steven VanAusdle, criticizing what they say is a lack of support for non-vocational programs and a poor administrative style, The Union-Bulletin reported. After the vote, the board of the college issued a strong statement of support for the president.
Negotiators for California State University and its faculty union said Thursday night that they had reached tentative agreement on a three-year contract extension. Under the terms of the deal between Cal State and the California Faculty Association, the compensation pool for union members -- who include librarians, coaches, and counselors as well as professors -- would rise by 3 percent this year, resulting in a 1.6 across-the-board increase and pay raises of up to 4.6 percent for certain groups of instructors.
Would-be graduate students in philosophy may again apply to the University of Colorado at Boulder, to begin their studies in fall 2015. Graduate admissions to the department were suspended last year after an external study of its climate described systemic sexual harassment and bullying. Andy Cowell, who was appointed interim chair of philosophy following the American Philosophical Association subcommittee's unflattering report, said in a statement that department faculty members “had willingly participated in numerous facilitated department workshops, as well as activities and exercises to build the culture" spanning the last nine months. Current graduate students were involved in the reform process. Provost Russell L. Moore called the department's efforts "laudable," saying they could serve as a model for other departments struggling with climate issues.
Google Scholar will be 10 years old as of next month. Also coming up fast, early in the new year, is the 10th anniversary of the launching of Inside Higher Ed. Just to be totally clear about it, they were completely unrelated developments, though I do seem to remember each being met with skepticism and reservations, in some quarters. Certain senior academics insisted on calling them both “blogs.” (It was a simpler time.)
Suffice it to say that the beta days soon passed. The world of higher learning has grown ever more dependent on the internet’s intricate system of tubes -- with the academic web now serving as publishing platform and archive, and also as a point of access to restricted or proprietary digital collections. In principle, at least, we ought to be able to determine the dimensions of the scholarly web: how many items it contains (papers, dissertations, conference recordings, etc.), in however many formats, and with whatever depth of indexing and degree of retrievability. But actually taking the measurements is another matter. They belong to the epistemic category Donald Rumsfeld so aptly dubbed “the known unknowns.”
How about posing a narrower question, then? Just how big is Google Scholar? As with the company’s search-engine algorithms, that information remains a trade secret. But a couple of recent studies have tried to sound the depths of Google Scholar – using GS itself.
The earlier of the papers is “The Number of Scholarly Documents on the Public Web” by Madian Khabsa and C. Lee Giles, published in May by PLOS One, the online open access journal. Lee is a professor of information and computer sciences at Pennsylvania State University, University Park, where Khabsa is a Ph.D. candidate in computer science and engineering.
Their study, conducted in early 2013, started with a pool of 10 papers each from 15 categories used by another search engine, Microsoft Academic Search. The fields covered were agriculture science, arts and humanities, biology, chemistry, computer science, economics and business, engineering, environmental sciences, geosciences, material science, mathematics, medicine, physics, and social sciences, plus multidisciplinary. The researchers then requested from both Google Scholar and Microsoft Academic Search a list of the distinct incoming citations to each paper. That is, if paper X cited a target paper seven times, it counted only as one citation.
“Overall,” Khabsa and Lee report, “we obtained 41,778 citations from MAS and 86,870 citations from Google Scholar,” including all the metadata available: “the document's title, list of authors, number of citations, year of publications, and the venue of publication (if available).”
The format and range of metadata are not standardized, but with some effort the results from the two search engines could be compared to determine how many items appeared in both -- as well as the number GS had but MAS didn’t, and vice versa. The researchers also determined the degree of overlap in each of the 15 fields, and the percentage of papers available without payment or subscription.
At the time of the experiment, MAS claimed 48.7 million records. Taking into account the degree of overlap, the researchers “estimated Google Scholar to have 99.3 million documents, which is approximately, 87% of the total number of scholarly documents found on the web,” which they determine to be some 114 million items in English, with about a quarter of them freely available.
The share of open-access material varies considerably between fields, of course. At the low end is agricultural science, with 12 percent. Computer science is in the lead, with half of scholarly publications being freely available. (The percentages are broken down by field in the paper’s second table.)
In July, Enrique Orduña-Malea and three other researchers in Spain published “About the Size of Google Scholar: Playing the Numbers” through the site arXiv. The paper begins with an assessment of the methodology and findings of Khabsa and Lee. In particular, Orduña-Malea et al. stress “the low indexation of institutional repositories on Google Scholar” and the GS policy of not indexing files over 5MB, “a procedure which is especially critical for doctoral theses).” They suggest that the earlier study probably underestimates the size of the academic web – even of its strictly Anglophone component -- while overestimating how much of it Google Scholar indexes.
The authors go on to describe a battery of tests designed to measure GS “from the inside,” so to speak. Generally this involved performing searches for particular kinds of documents or search terms, using different temporal filters. For example: the search could be made for patents and citations issued in a century, then by decade within that century, and the result compared.
An intriguing variant is what the authors call an “absurd query,” in which the search is for a very common word likely to be found in most documents, with variations on the search parameters, including temporal filters.
The Khabsa-Lee study came up with the estimate that Google Scholar indexed 99.3 million documents in English. Extrapolating from that, with English accounting for 65 percent of GS material, Orduña-Malea et al. determine that the Khabsa-Lee method yields a total Google Scholar database of 152.7 million documents. (Despite their critique, the authors call Khabsa and Lee’s work “novel and brilliant.”)
The methods that Orduña-Malea and his colleagues tried were not restricted to English-language scholarship. Their findings varied from a low of 126.3 million documents in Google Scholar to a high of 176.8 million.
For their part, the Spanish researchers conclude that a judicious estimate would be in the neighborhood of 160 million items. “However,” they write, “the fact that all methods show great inconsistencies, limitations and uncertainties, makes us wonder why Google does not simply provide this information to the scientific community if the company really knows this figure.”
It’s a fair point, and one the company should answer. Its 10th anniversary might be a good time for Google Scholar to give its users the gift of transparency.