Skip to content

Dataset Reference

Overview

The corpus contains 1,535 records drawn from 323 distinct newspaper titles published across the United States between January 1880 and December 1885. Each record represents one newspaper page on which relevant material was identified through keyword search. Records are sourced from Chronicling America: Historic American Newspapers, a digital newspaper collection hosted by the Library of Congress and jointly sponsored with the National Endowment for the Humanities through the National Digital Newspaper Program (NDNP). Persistent URLs to the original page images on www.loc.gov are retained for every record.

Corpus composition

DimensionValue
Total records1,535
Unique newspapers323
Date range1880-01-01 to 1885-12-31
Original articles786
Reprints749
Used in topic modeling1,107

Records by keyword

KeywordCount
Chinese student482
Chinese boy289
Chinese children265
Chinese girl227
Chinese school168
Chinese child81
Chinese education23

Records by region

RegionCount
West469
South420
Midwest394
Northeast251

Records by year

YearCount
1880164
1881379
1882212
1883200
1884261
1885319

Column Descriptions

Bibliographic metadata

Keyword The search term used to retrieve the page. One of seven values: Chinese student, Chinese boy, Chinese girl, Chinese child, Chinese children, Chinese school, Chinese education. Tracking the search term supports analysis of how different queries shape what the corpus contains.

Date Publication date in YYYY-MM-DD format.

Newspaper_Name Title of the newspaper as recorded in the Chronicling America database.

Pub_City City of publication.

Pub_State State (or historical territory) of publication.

Coverage_Region Additional geographic references associated with the article or newspaper, separated by vertical bars (e.g., Richland | Greenville | Columbia). Captures spatial scope beyond the publication city.

region_bin Normalized Census region derived from publication location: West, South, Midwest, or Northeast.

Image_Number The specific page image within the issue (e.g., Image 7).

Page_URL Persistent link to the digitized page on the Library of Congress Chronicling America platform.

Content fields

OCR_cleaned A cleaned excerpt containing the passage relevant to Chinese children or schooling. OCR noise has been reduced and line breaks normalized.

model_text Version of the excerpt used as input to topic modeling. May differ slightly from OCR_cleaned due to additional preprocessing.

token_count Approximate word count of the cleaned excerpt.

relevance_tier Manual classification of how directly the page discusses Chinese children, schooling, or related debates. Values: core (1,426 records) or secondary (109 records).

topic_tags Thematic tags applied during corpus review, separated by vertical bars (e.g., education|school|student). Support qualitative filtering and thematic search.

doc_id Unique document identifier within the corpus (format: DOC_NNNNNN).

Deduplication and reprint tracking

Nineteenth-century newspapers frequently reprinted articles from other papers. These fields track duplicated material.

is_reprint Boolean (true/false). Marks whether this record is a reprinted version of an earlier article.

is_original Boolean. Marks the earliest known appearance of a text within a reprint chain.

reprint_count Total number of records in the same duplicate group, including the original.

duplicate_group Identifier linking records that share substantially similar text (e.g., REP_0374). Empty if no duplicate was detected.

sim_score Similarity score used during fuzzy deduplication.

chain_position Position of this record within the reprint chain (1 = earliest).

propagation_chain Ordered list of doc_id values in the reprint chain, from original to latest reprint.

dedup_text Normalized lowercase version of the excerpt used during deduplication processing.

model_text_deduped Version of the text used for topic modeling in the deduplicated corpus.

Topic modeling flags

use_for_mallet Final selection flag for the MALLET input corpus (yes/no). 1,107 records are marked yes.

mallet_type Distinguishes full (all-corpus run) from deduped (deduplicated run) corpus membership.

mallet_ready_text Final cleaned text passed to MALLET.

mallet_rank Ordering variable used when exporting the MALLET document set.

Temporal and geographic bins

month Integer month of publication (1–12).

year_month Year and month in YYYY-MM format. Used for time-series aggregation.

time_bin Four-digit year as a string. Used for coarser temporal grouping.

Topic Taxonomy

The following table lists all non-noise topics identified across both model runs (S1: full corpus K25_S1; S2: deduplicated corpus K25_S2), with their thematic categories and analytic labels. Topic IDs correspond to the topic_id values used in the spatial map and dataset browser.

Chinese Educational Mission

Topic IDRunWeightAnalytic Label
deduped_topic_17S20.2963Chinese Educational Mission
all_topic_24S10.1328CEM: Government Policy & Institutional Recall
all_topic_14S10.0411CEM: Student Lives & Personal Narratives
all_topic_0S10.0280CEM: Students & Cross-Cultural Marriage
all_topic_17S10.0205CEM: Political Controversy

Education & Schools

Topic IDRunWeightAnalytic Label
deduped_topic_7S20.1286Classroom Instruction
deduped_topic_8S20.1284Public School Admission
deduped_topic_13S20.1144Missionary & Church Schools
all_topic_3S10.0778Classroom Instruction
all_topic_2S10.0703Public School Admission
all_topic_16S10.0789Missionary & Church Schools
all_topic_18S10.0331Mission School Directories & Schedules

Children & Family

Topic IDRunWeightAnalytic Label
deduped_topic_18S20.2734Children & Family Life
all_topic_6S10.1107Children & Family Life
deduped_topic_6S20.0218Confucian Family Ethics
all_topic_8S10.0249Confucian Family Ethics
all_topic_12S10.0228Childhood Conditions & Moral Commentary

Law, Politics & Exclusion

Topic IDRunWeightAnalytic Label
deduped_topic_16S20.0763Criminal Cases & Court Proceedings
deduped_topic_24S20.0594Exclusion Legislation
all_topic_1S10.0363Exclusion Legislation
all_topic_5S10.0238Criminal Cases & Court Proceedings (Trials)
all_topic_19S10.0564Criminal Cases & Court Proceedings (Police)

Violence & War

Topic IDRunWeightAnalytic Label
deduped_topic_22S20.0426Anti-Chinese Violence
deduped_topic_20S20.0336Sino-French War
all_topic_10S10.0266Anti-Chinese Violence

Commerce & Material Culture

Topic IDRunWeightAnalytic Label
deduped_topic_21S20.0201East Asian Consumer Goods
deduped_topic_14S20.0327Social Dining & Interior Spaces
deduped_topic_10S20.0385Clothing & Physical Description
all_topic_23S10.0200East Asian Consumer Goods
all_topic_21S10.0249Trade in Chinese Goods & Furnishings

Daily Life & Urban Space

Topic IDRunWeightAnalytic Label
deduped_topic_9S20.1926Routine Press Reporting
deduped_topic_23S20.0375Chinatown Spatial Narratives
deduped_topic_0S20.0322Domestic Employment
all_topic_22S10.0172Domestic Employment

Land, Migration & Labor

Topic IDRunWeightAnalytic Label
deduped_topic_15S20.0694Hawaii & Pacific Migration
deduped_topic_11S20.0151Agriculture & Land Use
all_topic_11S10.0794Hawaii & Pacific Migration

Culture, Perception & Acculturation

Topic IDRunWeightAnalytic Label
deduped_topic_4S20.0192Public Gatherings & Lectures
deduped_topic_12S20.0232Opium & Moral Degradation
all_topic_20S10.0166Physical Appearance & Curiosity Narratives
all_topic_15S10.0149Language Learning & Cultural Adjustment

Diplomacy

Topic IDRunWeightAnalytic Label
all_topic_9S10.0464Diplomacy & Ceremonial Events

Example Record

FieldValue
doc_idDOC_000606
Date1885-05-28
Newspaper_NameBaptist Courier
Pub_CityGreenville
Pub_StateSouth Carolina
region_binSouth
KeywordChinese girl
relevance_tiercore
topic_tagschild|children|general_chinese_reference|language|school
is_reprintfalse
reprint_count1
token_count312
Page_URL📰 View original

Excerpt:

Dear Children: As you have shown your interest in missions in so many ways, and especially in supporting a missionary in China, I thought a letter from one who, just a few years ago, was where you are now — in a South Carolina school — might be interesting to you. My first chapter will describe a Chinese school. As it is a Chinese school, of course they study the Chinese language…