The Limitless Potential of Big Data and Linguistics

Messy Desk with Big Data Related NotesNext time you receive a perfectly cheeky answer from Siri after asking her if she loves you, you might want to consider sending a thank you card to Max Liberman, professor of linguistics at the University of Pennsylvania.

Liberman is on a quest to make linguistic data more accessible to the public, which could help voice recognition software become even smarter in the future. But the shift he sees in linguistic research could also help those in a range of other disciplines – from teachers to CEOs – due to the speed with which algorithms can now analyse human speech.

At the British Academy and Philological Society’s panel discussion “Language, Linguistics and the Data Explosion,” Liberman illustrated how easy big data makes analysing human speech by performing a five minute statistical analysis of “ums” and “uhs” from 2,500 hours of recorded and transcribed telephone conversations, classified by age and gender that are available online. Liberman quickly found that “uhs” performed better, especially among women.

Breakneck Speed, Vast Resources

Linguistic data used to take months or years to collect and investigate, but today’s sophisticated algorithms make data sets easily scannable in mere minutes. “Any bright undergraduate with an internet connection can access and interpret the primary data,” Liberman said.

“We can now observe linguistic patterns across time, space and cultural contexts on a scale three to six orders of magnitude greater than in the past,” Liberman told the crowd.

Social media platforms like Twitter allow for a wealth of potential data about the distribution of phrases, words and dialects (like “You all” vs. “Y’all), while online databases allow greater access to raw material.

A New Scientific Instrument

The potential applications are mind-boggling to consider. Besides developers, linguistic data could be used by teachers to find out which parts of their students’ vocabulary are lacking; by programmers working on language software; and by businesses who’d like to evaluate crowd sentiments about their products.

SmogFarm, a crowd research company, uses linguistics to analyse the psychological state of users on social media platforms on Twitter. Being able to quickly and easily chart the feelings of a crowd could help political candidates, journalists, even stockbrokers, according to the company.

“From the perspective of a linguist, today’s vast archives of digital text and speech, along with new analysis techniques and inexpensive computation, look like a wonderful new scientific instrumen,” Liberman wrote in an article in the Guardian.

The potential of big data for linguistic research is truly limitless. Not only does linguistic data give researchers a vast window through which to explore language, it also helps those in other disciplines find solutions for better communication and understanding.