Map Construction

This document describes how the interactive spatial map is built from the cleaned corpus and topic model outputs. The pipeline produces four static data files that are loaded at runtime by the map frontend: counties_1882.geojson, states_1882.geojson, topics.json, records.json, and events.json.

Scripts are located in scripts/build_map_data/. They must be run in order: boundaries first, then topics, then records.

Step 1: Historical Boundary Data (`build_boundaries.py`)

The map uses period-appropriate administrative boundaries rather than modern state and county outlines. This is essential for the 1880–1885 period, when several territories had not yet achieved statehood and county divisions differed substantially from today.

Source

Boundaries are derived from the Atlas of Historical County Boundaries, published by the Newberry Library. The atlas provides shapefiles for all U.S. states and territories, with each feature carrying a start date and end date to record when each administrative unit came into existence or was dissolved.

Snapshot date

A snapshot date of 1882-06-30 is used to represent the political geography of the study period. All features whose active date range includes this date are selected; all others are discarded.

INFO

The 1882 midpoint was chosen because it falls in the same year as the passage of the Chinese Exclusion Act, making it both politically central to the project and a reasonable approximation of the corpus's geographic conditions across the full 1880–1885 window. It also aligns with the date of the Rand McNally basemap layer used in the interactive map.

For date-range filtering, the script handles two column formats present in different Newberry shapefiles:

START_DATE / END_DATE (timestamp columns)
START_N / END_N (integer format YYYYMMDD)

Dakota Territory handling

The Newberry Atlas assigns pre-statehood North Dakota and South Dakota counties to their respective future-state files (ND and SD shapefiles). These counties predate statehood (November 1889) and would duplicate Dakota Territory features already present in the DT shapefile. To prevent overlap, ND and SD features with a start date before 1889-11-02 are dropped. In practice this removes exactly 40 features.

Geometry simplification

County boundaries are checked against a 5 MB file size threshold after initial export. If the file exceeds this limit, shapely.simplify is applied with a tolerance of 0.003 degrees to reduce vertex count while preserving topology.

Outputs

File	Contents
`docs/public/data/states_1882.geojson`	State and territory outlines as of 1882-06-30
`docs/public/data/counties_1882.geojson`	County outlines as of 1882-06-30, with FIPS codes and state abbreviations

Step 2: Topic Taxonomy (`build_topics.py`)

topics.json provides the hierarchical topic structure used by the filter panel in the map frontend. It is derived from data/merged_topic_labels.csv, which consolidates topic assignments from both MALLET model runs.

Aggregation

The merged CSV may contain multiple rows for the same (category, analytic label) pair due to partial matches between the two model runs. These are collapsed by grouping on (category, analytic label) and taking the maximum weight and the union of topic IDs from both runs.

ID generation

Each topic is assigned a unique string ID for use in the frontend. The ID is constructed from:

A category prefix: initials of significant words in the category name (e.g., Chinese Educational Mission becomes cem)
A snake-cased version of the analytic label with any leading ABBR: prefix stripped

Example: category "Education & Schools", label "Public School Admission" produces ID es_public_school_admission.

Exclusion

Topics in the NOISE category are unconditionally excluded. Other topics marked exclude: yes in the CSV are also excluded. Excluded topics appear in the output but are hidden from the map interface.

If a topic index is missing from the CSV, check its top words in keys_K25_S2.txt and either add an analytic_label or mark it exclude: yes before rerunning.

Output structure

json

{
  "categories": [
    {
      "name": "Chinese Educational Mission",
      "color": "#fe640b",
      "hue": "Peach",
      "topics": [
        {
          "id": "cem_government_policy_institutional_recall",
          "label": "CEM: Government Policy & Institutional Recall",
          "color": "#ef9f76",
          "deduped_topic_id": null,
          "all_topic_id": "topic_24",
          "weight": 0.1328,
          "exclude": false
        }
      ]
    }
  ]
}

The script validates that the output contains 11 total categories, including the separate NOISE category, and 38 topic entries after the two model runs are merged. Warnings are emitted if these counts are not met.

Step 3: Document Records (`build_records.py`)

records.json is the main data layer for the map. Each of the 1,535 corpus documents becomes a GeoJSON feature with geographic coordinates, topic assignment, and display metadata.

Geographic resolution

Publication cities in the 1880s are not always present in modern geocoding databases, and many small towns no longer exist under the same name. A three-tier fallback system resolves coordinates for each record:

Tier 1: City lookup (110 major cities) A hand-verified dictionary maps 110 major cities to their county centroid. Coordinates are deterministically jittered by 2–8 km (seed derived from doc_id) to prevent point stacking when many records originate from the same city.

Tier 2: Coverage region county If the publication city is not in the Tier 1 dictionary, the script parses the Coverage_Region field (pipe-separated county/place names) and attempts to match against the counties GeoJSON. The matched county centroid is used with 2–8 km jitter.

Tier 3: State centroid If neither Tier 1 nor Tier 2 resolves, the state centroid is used as a fallback with 20–50 km jitter to indicate lower location confidence. The location_tier field in the output records which tier was used, allowing map users to assess coordinate precision.

Hawaii special case All records from Hawaiian newspapers are assigned fixed Honolulu coordinates (21.3099°N, 157.8581°W) without jitter, since the Kingdom of Hawaii is outside the U.S. boundary GeoJSON and county-level resolution is not available.

If Tier 3 geocoding exceeds 30% of records, the script prints the top 20 unresolved (city, state) pairs. Add them to the city_to_county_1882 dictionary at the top of build_records.py and rerun. Current measured rates: L1 69.6% / L2 19.3% / L3 11.1%.

Topic assignment

Each document may have topic weights from both MALLET model runs (deduped model K25_S2 and full-corpus model K25_S1). A single canonical topic is assigned for map display using a five-priority resolution:

Priority	Rule
1	Full-corpus model fine-grained label for CEM and Diplomacy sub-topics
2	Deduped model direct assignment (highest-weight topic)
3	Full-corpus model direct assignment for documents absent from the deduped corpus
4	Inheritance: adopt the topic of the highest-confidence original in the same reprint group
5	Fallback: score the `topic_tags` field by keyword counting

This ordering preserves the deduped model's finer sub-topic distinctions where available, while ensuring that reprint-only records and unmodeled documents still receive a displayable category.

Record structure

Each GeoJSON feature includes the following properties:

Property	Description
`doc_id`	Unique document identifier
`date`	Publication date (YYYY-MM-DD)
`year_month`	Publication year-month (YYYY-MM)
`topic_id`	Canonical topic ID (matches `topics.json`)
`topic_label`	Analytic topic label
`topic_category`	Parent category name
`topic_color`	Topic hex color
`category_color`	Category hex color
`newspaper`	Newspaper title
`pub_city`	Publication city
`pub_state`	Publication state
`region`	U.S. Census region
`page_url`	Persistent Chronicling America link
`is_reprint`	Boolean reprint flag
`reprint_count`	Total size of reprint chain
`chain_position`	Position within reprint chain
`excerpt`	First 300 characters of cleaned text
`location_tier`	Geographic resolution tier (1, 2, or 3)

Step 4: Historical Event Overlays (`build_events.py`)

events.json contains nine hardcoded historical events that serve as navigational anchors in the map interface. Each event can filter the record layer to its associated time window and highlight relevant counties on the boundary layer.

Event structure

json

{
  "id": "rock_springs_1885",
  "title": "Rock Springs Massacre",
  "date": "1885-09-02",
  "month_range": ["1885-08", "1885-11"],
  "description": "...",
  "highlight_level": "county",
  "highlight_fips": ["56037"],
  "related_topic_ids": ["violence_anti_chinese_violence"]
}

Field	Description
`date`	Exact event date used to anchor the timeline
`month_range`	Start and end months for filtering records
`highlight_level`	`"county"` or `"state"`
`highlight_fips`	Array of FIPS codes to shade on the boundary layer
`related_topic_ids`	Topic IDs to pre-select in the topic filter panel

Events included

Event	Date	Highlight
Angell Treaty Signed	1880-11-17	(global)
Chinese New Year Press Coverage Wave	1881-01-30	San Francisco, New York, Boston
Chinese Educational Mission Recall Begins	1881-06-08	Hartford and Springfield area
Chinese Exclusion Act Signed	1882-05-06	San Francisco
CEM Student Completes Yale Course	1883-06-01	New Haven, CT
Tape v. Hurley Public School Case	1884-09-01	San Francisco
Sino-French War: Battle of Fuzhou	1884-08-23	(global)
Rock Springs Massacre	1885-09-02	Sweetwater County, WY
West Coast Anti-Chinese Expulsions	1885-11-03	Pierce County / King County, WA

FIPS codes for county highlights were assigned manually after inspecting counties_1882.geojson.

Build Order and Dependencies

The build process has two independent branches before the final record layer is created. Topic labels can be rebuilt separately from event overlays. Historical boundaries must be prepared before document records, because each record depends on the boundary files for geographic matching. The record layer is generated last, after both counties_1882.geojson and topics.json are available.

TIP

When updating topic labels, rerun only build_topics.py and build_records.py. The boundary data usually stays the same, and build_events.py only needs to change when events or FIPS codes change.

Script	Output	Size
`build_topics.py`	`docs/public/data/topics.json`	12.3 KB
`build_events.py`	`docs/public/data/events.json`	7.8 KB
`build_boundaries.py`	`docs/public/data/counties_1882.geojson`	7.26 MB
`build_boundaries.py`	`docs/public/data/states_1882.geojson`	897.4 KB
`build_records.py`	`docs/public/data/records.json`	1.37 MB

All output files are served as static assets by the VitePress build and fetched at runtime by the map frontend.

Map Construction ​

Step 1: Historical Boundary Data (build_boundaries.py) ​

Source ​

Snapshot date ​

Dakota Territory handling ​

Geometry simplification ​

Outputs ​

Step 2: Topic Taxonomy (build_topics.py) ​

Aggregation ​

ID generation ​

Exclusion ​

Output structure ​

Step 3: Document Records (build_records.py) ​

Geographic resolution ​

Topic assignment ​

Record structure ​

Step 4: Historical Event Overlays (build_events.py) ​

Event structure ​

Events included ​

Build Order and Dependencies ​