Corpus development workshop

This is the homepage for the hybrid workshop Digital corpus building for humanities research: from data collection, to annotation, exploitation and sharing that will take place online (Zoom link) and in person at: Kröpeliner Str. 57, Jakobi-Passage, Seminarraum 8.

The workshop will address all necessary steps to create an annotated textual corpus, paying special attention to the use of different standards in order to enhance the sustainability and
interoperability of this type of resource. Download the supporting materials (slides, datasets, etc.) here by clicking on the green button that says Code.

The event will take place from 30.05.2023 to 01.06.2023 from 9:00 am - 1:00 pm.

Program outline

Tuesday May 30th

9:00-9:15 Welcome and introduction
9:15-10:15 Explore your computer. Introduction to the command line
10:15-10:30 Data collection: presentation of required software (cURL and OpenRefine)
10:30-10:50 Software installation + Break
10:50-11:40 Introduction to web scraping: fetching webpages with cURL
11:40-11:50 Break
11:50-12:45 Introduction to HTML and markup languages
12:45-13:00 Wrap-up and Q&A

Wednesday May 31st

9:00-9:15 Overview of the contents tackled on the first session
9:15-10:30 Introduction to web scraping with OpenRefine
10:30-10:35 Presentation of required software (Visual Studio Code)
10:35-10:50 Software installation + Break
10:50-11:50 Cleaning data with regular expressions
11:50-12:00 Break
12:00-12:45 Introduction to data modelling and data standardisation
12:45-13:00 Wrap-up and Q&A

Thursday June 1st

9:00-9:15 Overview of the contents tackled on previous sessions
9:15-10:30 Introduction to TEI
10:30-10:35 Data analysis: presentation of required software (TXM)
10:35-10:50 Software installation + Break
10:50-11:40 Creating and annotating corpora with TXM and use of the command line for simple textual analysis
11:40-11:50 Break
11:50-12:45 Querying corpora with TXM
12:45-13:00 Wrap-up and Q&A

Digital corpus building for humanities research. From data collection, to annotation, exploitation and sharing

Program outline

Tuesday May 30th

Wednesday May 31st

Thursday June 1st

Software