techfaculty

Article on studies of size of Google Scholar

Google Scholar will be 10 years old as of next month. Also coming up fast, early in the new year, is the 10th anniversary of the launching of Inside Higher Ed. Just to be totally clear about it, they were completely unrelated developments, though I do seem to remember each being met with skepticism and reservations, in some quarters. Certain senior academics insisted on calling them both “blogs.” (It was a simpler time.)

Suffice it to say that the beta days soon passed. The world of higher learning has grown ever more dependent on the internet’s intricate system of tubes -- with the academic web now serving as publishing platform and archive, and also as a point of access to restricted or proprietary digital collections. In principle, at least, we ought to be able to determine the dimensions of the scholarly web: how many items it contains (papers, dissertations, conference recordings, etc.), in however many formats, and with whatever depth of indexing and degree of retrievability. But actually taking the measurements is another matter. They belong to the epistemic category Donald Rumsfeld so aptly dubbed “the known unknowns.”

How about posing a narrower question, then? Just how big is Google Scholar? As with the company’s search-engine algorithms, that information remains a trade secret. But a couple of recent studies have tried to sound the depths of Google Scholar – using GS itself.

The earlier of the papers is “The Number of Scholarly Documents on the Public Web” by Madian Khabsa  and C. Lee Giles, published in May by PLOS One, the online open access journal. Lee is a professor of information and computer sciences at Pennsylvania State University, University Park, where Khabsa is a Ph.D. candidate in computer science and engineering.

Their study, conducted in early 2013, started with a pool of 10 papers each from 15 categories used by another search engine, Microsoft Academic Search. The fields covered were agriculture science, arts and humanities, biology, chemistry, computer science, economics and business, engineering, environmental sciences, geosciences, material science, mathematics, medicine, physics, and social sciences, plus multidisciplinary. The researchers then requested from both Google Scholar and Microsoft Academic Search a list of the distinct incoming citations to each paper. That is, if paper X cited a target paper seven times, it counted only as one citation.

“Overall,” Khabsa and Lee report, “we obtained 41,778 citations from MAS and 86,870 citations from Google Scholar,” including all the metadata available: “the document's title, list of authors, number of citations, year of publications, and the venue of publication (if available).”

The format and range of metadata are not standardized, but with some effort the results from the two search engines could be compared to determine how many items appeared in both -- as well as the number GS had but MAS didn’t, and vice versa. The researchers also determined the degree of overlap in each of the 15 fields, and the percentage of papers available without payment or subscription.

At the time of the experiment, MAS claimed 48.7 million records. Taking into account the degree of overlap, the researchers “estimated Google Scholar to have 99.3 million documents, which is approximately, 87% of the total number of scholarly documents found on the web,” which they determine to be some 114 million items in English, with about a quarter of them freely available.

The share of open-access material varies considerably between fields, of course. At the low end is agricultural science, with 12 percent. Computer science is in the lead, with half of scholarly publications being freely available. (The percentages are broken down by field in the paper’s second table.)

In July, Enrique Orduña-Malea and three other researchers in Spain published “About the Size of Google Scholar: Playing the Numbers” through the site arXiv. The paper begins with an assessment of the methodology and findings of Khabsa and Lee. In particular, Orduña-Malea et al. stress “the low indexation of institutional repositories on Google Scholar” and the GS policy of not indexing files over 5MB, “a procedure which is especially critical for doctoral theses).” They suggest that the earlier study probably underestimates the size of the academic web – even of its strictly Anglophone component -- while overestimating how much of it Google Scholar indexes.

The authors go on to describe a battery of tests designed to measure GS “from the inside,” so to speak. Generally this involved performing searches for particular kinds of documents or search terms, using different temporal filters. For example: the search could be made for patents and citations issued in a century, then by decade within that century, and the result compared.

An intriguing variant is what the authors call an “absurd query,” in which the search is for a very common word likely to be found in most documents, with variations on the search parameters, including temporal filters.

The Khabsa-Lee study came up with the estimate that Google Scholar indexed 99.3 million documents in English. Extrapolating from that, with English accounting for 65 percent of GS material, Orduña-Malea et al. determine that the Khabsa-Lee method yields a total Google Scholar database of 152.7 million documents. (Despite their critique, the authors call Khabsa and Lee’s work “novel and brilliant.”)

The methods that Orduña-Malea and his colleagues tried were not restricted to English-language scholarship. Their findings varied from a low of 126.3 million documents in Google Scholar to a high of 176.8 million.

For their part, the Spanish researchers conclude that a judicious estimate would be in the neighborhood of 160 million items. “However,” they write, “the fact that all methods show great inconsistencies, limitations and uncertainties, makes us wonder why Google does not simply provide this information to the scientific community if the company really knows this figure.”

It’s a fair point, and one the company should answer. Its 10th anniversary might be a good time for Google Scholar to give its users the gift of transparency.

 

Editorial Tags: 

At Educause conference, Kuali leaders attempt to assure college leaders about shifts

Smart Title: 

Since announcing a for-profit company in August, Kuali has struggled to explain its change in direction. Its leaders came to Educause to change that.

College for America spins off its custom-made learning management system

Section: 
Smart Title: 

Southern New Hampshire U.'s College for America spins off the custom-made learning management system it built around competency-based education.

Lynn U., to free itself from its learning management system, creates its own software

Section: 
Smart Title: 

Lynn U., halfway through dropping its learning management system for iTunes U, develops its own software where Apple's falls short.

Educause, Gates Foundation to examine history and future of the LMS

Section: 
Smart Title: 

Educause launches a Gates Foundation-backed initiative to identify how the learning management system market needs to evolve.

New NEH director welcomes digital humanities grant recipients to the agency's new home

Smart Title: 

Grants for digital humanities projects serve as established tradition as the new chairman for the National Endowment for the Humanities welcomes grant recipients to the agency's new home in Washington.

Wesleyan U. fires university librarian after disagreement with provost

Smart Title: 

After long-running disagreements about how to run its library, Wesleyan U. fires its university librarian.

ACE will create a pool of 100 low-cost courses, some from non-college providers

Smart Title: 

ACE continues online experimentation with proposed pool of general education courses from colleges and providers like StraighterLine.

The Pulse podcast explores the future of 'flipping'

Smart Title: 

This month's edition of The Pulse podcast looks at what the future holds for efforts to use technology to "flip" the classroom.

Essay urges professors not to ban student emails

The article about Spring-Serenity Duvall, a communications professor who banned students from emailing her and lived to blog about it, caught my eye on the same day my own inboxes at two colleges spilled over with bewildered messages from students. Some had been told to purchase the wrong edition of our course text, resulting in their plodding through a chapter on meta-commentary instead of one on contributing meaningfully to group discussions; more simply hadn’t received their textbooks and didn’t know when they would; still others, I suspected, were so besieged by first-week information overload that they needed reassurance from a human who had seemed friendly enough on the first day of class.

When I announced to my Critical Reading and Writing classes the next morning that we wouldn’t cover the assigned reading so we could instead talk about “a professor who doesn’t allow students to email her,” many likely assumed I was using this hook as a launching pad for my own ban. Several — the ones who had dared type a few words or even sentences to me at quiet, unobtrusive hours of the night — looked somewhat repentant. We were going to read this article together, I told them, and in addition to identifying its purpose, audience, context, and noteworthy rhetorical moves, they would be invited to interject their opinions.

“I had a strong reaction when I read this,” I admitted, “and I expect you might as well.”

Turns out, the students generally endorsed Duvall’s policy more than I did. One young man remarked that he initially opposed the idea but began to see its merits as we dug further into the reasoning. Both classes and I settled unanimously on a valuable lesson that could be learned from the spirit of such a ban: Students should try to find the answers themselves, several pointed out, before they bother the professor, who they all (charitably) agreed would be busy with other matters. Others said it would be useful to practice reading course documents more carefully and researching answers on their own or with other peers.

As we identified potential audiences for an article championing such a ban, some responses were obvious, such as fellow professors with hectic schedules. Other responses were disconcerting. More than one student claimed their parents were a perhaps-unintended audience. Parents who foot the bill for this whole venture might be interested (disgruntled?) to discover a brick wall separating their children from the people who are paid to teach them important things.

I have no doubt the email embargo worked miracles for Duvall’s time management. Just because I find student correspondence one of the least complicated demands of the teaching profession doesn’t mean I should impose my preferences on others. And since 47 glowing course evaluations suggest that Duvall’s students not only didn’t feel cheated, but actually thought her in-person-or-by-phone-only rule made her more accessible, I won’t belabor my somewhat obvious challenge that such a policy could deter students — those, perhaps, who are at risk of doing poorly and therefore need the most encouragement — from asking questions down the line or even approaching their future professors.

But isn’t there something to be said for letting young adults — especially those enrolled in a communications course — navigate the delicate rules of student-professor etiquette on their own? For letting them fail at it even? Suppose you email about a problem your professor deems trifling. The two worst consequences are (a) no response or (b) a snippy response. In my own college days, I sent emails that at the time seemed vital but that I now recognize as self-absorbed and/or irritatingly Type A. After a few terse one-liners from professors I admired, I became a less zealous emailer.

There need not be an official ban committed in writing on a syllabus for professors to ignore or even confront messages that are petty or unprofessional. Furthermore, today’s students are attending college in the first place so they can land a job that might one day allow them to emerge from — or even to buoy — this faltering economy. Employers prize communication and collaboration skills more highly than ever, and it’s hard to imagine the 21st-century workplace functioning without people who can competently email.

Do we really want to graduate a generation of students who can’t decide for themselves what warrants pressing the send button? Or, to take this issue to its logical extreme, who think their employers should drop everything to schedule in-person conferences for matters that can be handled in one pithy sentence? If our wading through a bunch of syllabus emails can contribute to a larger discourse about the importance of good professional writing, then maybe we are — in the eyes of the public — one step closer to earning our keep as educators.

Danielle DeRise is an adjunct professor of English, literature, and writing at Piedmont Virginia Community College and James Madison University.

Editorial Tags: 

Pages

Subscribe to RSS - techfaculty
Back to Top