|CWB Homepage||Online CQP Demos|
Online CQP Demos
DICKENS corpus is a collection of novels by Charles Dickens, including
A Christmas Carol, David Copperfield, Dombey and Son, Great Expectations, Hard Times,
Master Humphrey's Clock, Nicholas Nickleby, Oliver Twist, Our Mutual Friend, Sketches by BOZ,
A Tale of Two Cities, The Old Curiosity Shop, The Pickwick Papers, and Three Ghost Stories.
This corpus amounts to a total of 3.4 million running words. It has been part-of-speech tagged
and lemmatised with the
using its standard English parameter file.
In addition, noun phrases (NP) and prepositional phrases (PP) were annotated with a
parser developed by Helmut Schmid.
BUNDESTAG corpus contains Hansards of the German Parliament (Bundestag)
from the parliamentary term running from 1994 to 1997.
This corpus amounts to a total of 5.7 million running words. It has been
annotated with a rich variety of linguistic information. The token-level annotations comprise
part-of-speech tags (TreeTagger),
lemmata, and morpho-syntactic information (both IMSLex).
In addition, a partial phrase-structure analysis was performed with the
developed by Hannah Kermes.
The EUROPARL parallel corpus contains proceedings of the European Parliament from the years 1996–2003. This Web interface gives access to EUROPARL version 3, which is distributed as part of the OPUS collection of freely available parallel corpora. The Web interface covers six languages (English, German, French, Spanish, Italian and Dutch), with close to 40 million words in each language and full pairwise alignments. The texts have been POS-tagged and lemmatised with the IMS TreeTagger. Some meta-information on dates and speakers is also included. The EUROPARL corpus was originally compiled and sentence-aligned by Philipp Koehn; an improved and extended version (running up to 2009) can be downloaded from the Europarl corpus homepage.