Methodology

The project pipeline moves from raw archival text to structured data, computational analysis and spatial visualization in three linked stages.

Dataset construction transforms raw Chronicling America OCR into a clean, structured, geographically annotated corpus of 1,535 records. This involves geographic metadata standardization, multi-layer OCR cleaning, and passage extraction to isolate China-related content from full newspaper pages.

Analytical methods covers topic modeling with MALLET and the construction of Datawrapper charts. Two parallel LDA models (deduplicated corpus of 1,100 documents and full corpus of 1,525 documents) were trained at K = 25 and compared to trace how the nineteenth-century reprinting network reshaped discourse.

Map construction explains how the corpus was connected to historical geography. Newspaper records were plotted by publication city, aligned with 1882 county and state boundaries, and linked to topic categories and selected historical events for interactive exploration.

In This Section

Dataset Construction: corpus assembly, geographic standardization, OCR cleaning, excerpt extraction
Analytical Methods: MALLET topic modeling, Datawrapper chart preparation
Map Construction: historical boundary data, geographic resolution, topic assignment, event overlays

Methodology ​

In This Section ​

Methodology

In This Section