Like many other scientific disciplines, psychological science has felt the impact of the big-data revolution. This impact arises from the meeting of three forces: data availability, data heterogeneity, and data analyzability. In terms of data availability, consider that for decades, researchers relied on the Brown Corpus of about one million words (Kučera & Francis, 1969). Modern resources, in contrast, are larger by six orders of magnitude (e.g., Google’s 1T corpus) and are available in a growing number of languages. About 240 billion photos have been uploaded to Facebook,1 and Instagram receives over 100 million new photos each day.2 The largescale digitization of these data has made it possible in principle to analyze and aggregate these resources on a previously unimagined scale. Heterogeneity refers to the availability of different types of data. For example, recent progress in automatic image recognition is owed not just to improvements in algorithms and hardware, but arguably more to the ability to merge large collections of images with linguistic labels (produced by crowdsourced human taggers) that serve as training data to the algorithms. Making use of heterogeneous data sources often depends on their standardization. For example, the ability to combine demographic and grammatical data about thousands of languages led to the finding that languages spoken by more people have simpler morphologies (Lupyan & Dale, 2010 ). The ability to combine these data types would have been substantially more difficult without the existence of standardized language and country codes that could be used to merge the different data sources. Finally, analyzability must be ensured, for without appropriate tools to process and analyze different types of data, the “ data” are merely bytes.
See all of the papers appearing in the Big Data Special Issue of Behavior Research Methods