The Google Web 1T 5-Gram Database is a collection of frequent 5-grams extracted from approximately 1 trillion words of Web text collected by Google Research. This Web interface allows you to search an indexed version of the database for collocational patterns such as
carrying * to * (where
* marks collocate positions) and rank them by association strength, using one of four standard association measures. Click on the Frequency list tab at the top of this page for simple frequency rankings with more flexible display options.
Association scores are calculated between each set of collocates (e.g. coals, newcastle) and the fixed constraint terms (carrying, to in the example above). Due to the nature of Google's N-gram database, these scores are only rough approximations and their precise numerical values should not be taken too seriously. Case-folding and some additional normalization of the N-grams have also been performed, so the frequency counts reported in the result tables may occasionally be different from those found in the original Google data. The normalized N-grams have been indexed in several SQLite databases with a total size of 180 gigabytes. For any further questions or bug reports, please contact Stefan Evert.
The search pattern consists of up to 5 terms, which represent the elements of an N-gram and must be separated by blanks.
All other terms in the search patterns specify constraints for the "fixed" part of the pattern. Our database query engine supports four different types of constraint terms:
[mouse,mice]→ mouse, mice)
%to stand for an arbitrary substring (e.g.
%erati→ maserati, literati, glitterati, ...)
?indicates a skipped token (i.e. an arbitrary word which is not a collocate)
Push the Search button to execute your query, Help to display this help page, or Reset Form to start over from scratch. The CSV button returns a CSV table suitable for import into a spreadsheet program or database. The XML button returns the search results in an XML format, allowing this interface to be used as a Web service.
You can customise the calculation of association scores and the display format with the option menus below the search pattern:
The examples below include comments starting with
//, which must not be entered in the search pattern field.
interesting * // what are people most interested in? * violin // '*' at the start of a query is much slower met ? * [man,woman] // use '?' to skip determiner etc. [enjoy,enjoys] ? * // what do people enjoy? carrying * to * // which measures find the expected result? %ization ? * health // use with "grouped" display from * to * // a classic of Googleology