Digital corpus building for humanities research. From data collection, to annotation, exploitation and sharing
This is the homepage for the hybrid workshop Digital corpus building for humanities research: from data collection, to annotation, exploitation and sharing that will take place online (Zoom link) and in person at: Kröpeliner Str. 57, Jakobi-Passage, Seminarraum 8.
The workshop will address all necessary steps to create an annotated textual corpus, paying special attention to the use of different standards in order to enhance the sustainability and
interoperability of this type of resource. Download the supporting materials (slides, datasets, etc.) here by clicking on the green button that says Code
.
The event will take place from 30.05.2023 to 01.06.2023 from 9:00 am - 1:00 pm.
Program outline
Tuesday May 30th
- 9:00-9:15 Welcome and introduction
- 9:15-10:15 Explore your computer. Introduction to the command line
- 10:15-10:30 Data collection: presentation of required software (cURL and OpenRefine)
- 10:30-10:50 Software installation + Break
- 10:50-11:40 Introduction to web scraping: fetching webpages with cURL
- 11:40-11:50 Break
- 11:50-12:45 Introduction to HTML and markup languages
- 12:45-13:00 Wrap-up and Q&A
Wednesday May 31st
- 9:00-9:15 Overview of the contents tackled on the first session
- 9:15-10:30 Introduction to web scraping with OpenRefine
- 10:30-10:35 Presentation of required software (Visual Studio Code)
- 10:35-10:50 Software installation + Break
- 10:50-11:50 Cleaning data with regular expressions
- 11:50-12:00 Break
- 12:00-12:45 Introduction to data modelling and data standardisation
- 12:45-13:00 Wrap-up and Q&A
Thursday June 1st
- 9:00-9:15 Overview of the contents tackled on previous sessions
- 9:15-10:30 Introduction to TEI
- 10:30-10:35 Data analysis: presentation of required software (TXM)
- 10:35-10:50 Software installation + Break
- 10:50-11:40 Creating and annotating corpora with TXM and use of the command line for simple textual analysis
- 11:40-11:50 Break
- 11:50-12:45 Querying corpora with TXM
- 12:45-13:00 Wrap-up and Q&A