The Google Web 1T 5-Gram Database is a collection of frequent 5-grams extracted from approximately 1 trillion words of Web text collected by Google Research. This Web interface determines pseudo-collocations of a given node word, ranked according to one of five standard association measures.
Pseudo-collocations are surface collocations in the sense of Firth and Sinclair, i.e. salient co-occurrences within a span of up to 4 words to the left and right of the node word. Since the Web 1T 5-Gram Database does not record all contexts of a node word, co-occurrence counts have to be approximated from co-occurrences within frequent N-grams, which may introduce a certain bias towards fixed expressions. It is therefore more appropriate to speak of “pseudo-collocations” in this case. The precise numerical values of association scores should not be taken too seriously or compared to data from regular corpora. However, we expect the collocate rankings to be comparable to those obtained for full surface collocations. See Evert (2008) for a thorough discussion of surface collocations, appropriate co-occurrence counts and the association measures implemented here.
Note that case-folding and some additional normalization of the N-grams may have been performed, leading to frequency counts that are occasionally different from those found in the original Google data. Co-occurrence frequency data for all possible node-collocation pairs have been been indexed in a SQLite databases with a size of 32 gigabytes, from which they are retrieved by this Web interface. For any further questions or bug reports, please contact Stefan Evert.
Type a single Node word into the text field at the top, then push the Search button to display the most salient collocates for this node. Push Help to display this help page or Reset Form to start over from scratch. The CSV button returns a CSV table suitable for import into a spreadsheet program or database. The XML button returns the search results in an XML format, allowing this interface to be used as a Web service.
You can customise the calculation of association scores, the size of the collocational span, and the display format with the option menus below the node word: