Engineering a Real-Time Lassa Fever Surveillance Model

I am a data science enthusiast eager to learn and improve, with experience in data analysis, community leadership, and hands-on technical projects. My passion lies in uncovering actionable insights through data and leveraging technology to solve real-world problems. As a first-class computer science student and active participant in various competitions and challenges, I constantly seek opportunities to grow and contribute meaningfully
Latency interferes with response in the surveillance of health. Conventional disease reporting systems like the Lassa fever system are usually behind track with the current state of affairs. Digital signals, particularly social media ones, tend to emerge sooner, and hence, they can be used as a good early warning during an outbreak. This post describes the Lassa Fever Surveillance Model, an end-to-end machine-learning application that scrapes social media sources, labels epidemiological utility, and provides geospatial information via a low-latency API.
The following is a technical description of the process of converting unstructured text into actionable health intelligence by the system.
Data Ingestion: The Scraper Architecture
The initial social media mining issue is access. Formal APIs are usually rate-capped or costly to research. The system can query the front end directly by means of a custom ingestion module (scraper.py) accessing third-party OSINT tools. Such a solution removes API limits and allows the scraper to access historical data. It is based on the keywords “Lassa fever,” “outbreak,” and “virus” to create a refined result base.

The Labeling Pipeline: Weak Supervision & Human-in-the-Loop.
Gathering information is a simple task; tagging information to be learned through supervision is a very expensive undertaking. The weak-supervision strategy and then manual review are applied in this project.
Step A: Heuristic Auto-Labeling:
The script automaticlabelling.py uses deterministic rules on raw information. It searches high-signal tokens (e.g., confirmed, death, isolation, hospital). When the tweet has these tokens and the subject matter, it will automatically be labeled as positive (1).

Step B: Step Human-in-the-Loop Validation.
Heuristics are quick, however, noisy. In order to maintain data quality, we read a stratified sample of the auto-labeled data manually. False positives and false negatives are corrected by human beings, as in the case of figurative speech. The quality of ground truth before training is significantly enhanced with this hybrid approach.
Model Architecture: Efficiency over Complexity.
In the case of the classification engine, we selected inference speed and ease of deployment. Although the large language models are applicable in offering high accuracy, they incur latency and infrastructure expenses.
A TF-IDF vectorizer with a random forest classifier is used in the system.
TF-IDF transforms text into a sparse numerical weight matrix.
Random Forest is capable of high dimensionality and also offers probability scores (
predict_proba), which can be used as confidence on alerts. Such a combination is lightweight to make the model less than 500 MB and extremely portable.
NER Fallback Logic with Fallback Logic Feature Extraction.
The first half is to identify a tweet, and the second half is to locate the outbreak. We extract geopolitical entities (GPE) using spaCy to recognize named entities. Local African geography is usually absent in generic NER models. To deal with this, we have introduced a fallback mechanism. In case the NER model cannot locate a location, the system will cross-reference the text with a hard-coded list of states and major cities in Nigeria.


Containerized FastAPI Service: Deployment.
The last pipeline process is deployment. The model artifacts and preprocessing logic are combined into a FastAPI application. FastAPI has asynchronous performance, and Swagger is auto-generated. The service displays a /predict endpoint to which a JSON payload is sent, which text is then processed to provide classification, confidence, and location. Dockerizing the service will ensure that there is consistency between development and production.

Conclusion
The Lassa Fever Surveillance Model demonstrates that not necessarily the most expensive architecture can be implemented to achieve the best AI solution. Through creative data scraping, human-in-the-loop validation, and lightweight modeling, we can develop the accurate and computationally efficient surveillance tools.
You can explore the full source code and methodology on GitHub
