Issues in corpus design for lexicography

Get Complete Project Material File(s) Now! »

Background to the study

This thesis is about corpus linguistics, precisely corpus design for lexicography (the science and art of dictionary compilation) as it relates to the Setswana language. The field of corpus linguistics is broad, covering areas such as grammatical studies, language education sociolinguistics, phonetics, phonology, stylistic analysis, dialectology and others (Kennedy, 1998). Corpus linguistics, particularly its application to lexicography is in its infancy in many African languages, particularly so in the language which is the focus of this thesis: the Setswana language.
This study focuses on Setswana corpus design whose output can serve a lexicographic purpose. Its findings and methodologies it is hoped would inspire similar designs in other African languages In corpus research in general, the focus has been placed on what researchers can retrieve from corpora, amongst these being frequency information, lemma lists, example sentences in dictionaries and concordance lines (De Schryver, 2002: 275/6) While there is nothing defective with such studies, what is lacking in the literature is detailed and in depth research on corpus design particularly for African languages The gap is particularly worrying in that the quality of corpus output is dependant on corpus design Few corpus designs have been documented.

Statement of the research problem

Corpora use is not common in many dictionary projects in Africa languages, Setswana included. The larger body of research in corpora is on corpus usage and rarely in corpus design. There is no research that focuses on the design of Setswana language corpora At a practical lexicographic level, the production of dictionaries in various African languages has been very low particularly when compared with dictionary compilation in English by publishing houses such as Oxford University Press, Longman, Webster, COBUILD (The Collins Birmingham University International Language Database) and Chambers.

Table of contents :

  • Summary
  • Declaration
  • Acknowledgements
  • Abbreviations
  • Table of contents
  • List of tables
  • List of figures
  • Chapter 1 Introduction
    • 1.1 Background to the study
    • 1.2 Statement of the research problem
    • 1.3 Clarifying terms: genre, text type and varieties
    • 1.4 Methodology
    • 1.5 Aims of the study
    • 1.6 Research goals
    • 1.7 Exposition of chapters
  • Chapter 2 The Setswana Language
    • 2.1 The Botswana language situation
    • 2.2 The Setswana language
    • 2.3 Setswana dialects
    • 2.3.1 The village, cattlepost, lands and city language
    • 2.4 Domains of Setswana language use
    • 2.5 Text categories
    • 2.6 Challenges of multilingualism and diglossia
    • 2.7 The poverty of data
    • 2.7.1 The Sanitised Data
    • 2.8 Setswana language research
      • 2.8.1 A historical overview
      • 2.8.2 The development of Setswana lexicography
      • 2.8.2.1 Lexicographic tradition
    • 2.9 Conclusion
  • Chapter 3 Corpus Lexicography
    • 3.1 Introduction
    • 3.2 What is a corpus?
    • 3.3 Web as corpus
    • 3.4 Frequency profiling: frequency and type/token
      • 3.4.1 Frequency counts
      • 3.4.2 Type/token and word counts
    • 3.5 Relevance of corpora to lexicography
    • 3.6 Some preelectronic frequency studies
    • 3.7 Electroniccorpora studies
      • 3.7.1 An example of frequency profiling
    • 3.8 Keyword analysis
    • 3.9 Business keywords
    • 3.10 Concordance
  • Chapter 4 Issues in corpus design for lexicography
    • 4.1 Introduction
    • 4.2 Balance and representativeness
      • 4.2.1 Proponents of balance and representativeness
      • 4.2.2 A cautious approach to balance and representativeness
    • 4.3 Corpus annotation
    • 4.4 Sample size
    • 4.5 Brown Corpus and BNC review
    • 4.5.1 The Brown Corpus
      • 4.5.2 The BNC review
      • 4.5.2.1 The BNC design criteria
      • 4.5.2.2 The BNC written component
      • 4.5.2.3 The BNC spoken component
    • 4.6 The exploration of both corpora
    • 4.7 Conclusion
  • Chapter 5 The Setswana corpus compilation
    • 5.1 Introduction
    • 5.2 The design strategy
    • 5.3 Overall corpus statistics
    • 5.4 The Zipfian distribution
    • 5.5 Corpus components
      • 5.5.1 Text types in the corpus
    • 5.6 The compilation of corpus components
    • 5.6.1 Spoken language component compilation
    • i. Sampling
    • ii. Recording
    • iii. Transcription
    • 5.6.2 Compiling the written language component
    • i. Sampling
      • 5.6.3 Spoken language ethical matters
      • 5.6.4 Written language ethical matters
    • 5.7 Conclusion
  • Chapter 6 Measuring text type diversity
    • 6.1 Introduction
    • 6.2 Keyword analysis
      • 6.2.1 Keyword analysis of written components of the Setswana corpus
      • 6.2.2 Keyword analysis of spoken components of the Setswana corpus
    • 6.3 Conclusion to keyword analysis
  • Chapter 7 Type/token measures of corpus chunks
    • 7.1 Type/token measures
      • 7.1.1 The Mean calculation
      • 7.1.2 Confidence Interval (CI) calculation
      • 7.1.3 Standard deviation
    • 7.2 Text divisions for experiments
      • 7.2.1 Newspaper Components type/token
    • 7.3 Conclusion of typetoken measurements
    • 7.4 A comparison of the top 100 tokens
  • Chapt r 8 Conclusion and future work
    • 8.1 Future research and applications
    • Bibliography
READ  THE ROLE OF EXAGGERATED MALE CHELICERAE IN MALE-MALE CONTESTS IN SHEET-WEB SPIDERS

GET THE COMPLETE PROJECT

Related Posts