Approach & Methodology

World Compendium of Healthcare Facilities and Nonprofits Organizations is an outgrowth of the Virtue Foundation Actionable Data Initiative. Harnessing advancements in technology and machine learning, the Foundation has created a first-of-its-kind mapping-and-matching global health platform for local nonprofits and healthcare organizations.


This compendium represents the first product of this initiative, providing a curated view of demand-side data and enabling volunteer medical professionals, governments, and stakeholders to better identify where healthcare services are available and additional resources are needed. Each of the 72 chapters presents a brief country overview, a map depicting the locations of healthcare facilities, and a curated list of nonprofit organizations and healthcare facilities. Unique URLs associated with each listing link back to the web platform, providing access to further information about the organizations as well as the ability to interact with the data in a customizable manner.

Nonprofit Data Collection and Curation

Using specific keywords and medical specialty descriptors, a pipeline for querying and identifying nonprofit websites was created for targeted regions. Forty-three separate medical specialties, 7 generic terms, and 4 nonprofit keywords were applied to produce a total of 14,288 unique query combinations and executed on various search engines and social media platforms. This resulted in 2,734,535 candidate nonprofit web pages that were subsequently indexed using custom crawlers built with open-source Python libraries as a distributed Spark application, running on parallel workers on Amazon Web Services. This list was complemented with known public resources, such as the United Nations database for NGOs.


A Decision-Tree Script for Extracted Nonprofit Website Domains

A decision-tree script extracted the domains, deduplicated web pages, and created a recursive multilevel indexing tree, identifying 167,457 unique candidate nonprofit websites. Further data including contact information, donation links, and other metadata, were captured using regular expressions and pattern matching techniques. To minimize the likelihood of collected websites not representing an actual healthcare nonprofit organization and to minimize noise, machine learning methods were employed to filter the data.


Pipeline for Filtering Candidate Nonprofit Web Pages

A training set of 11,877 websites was thus manually labeled by the Virtue Foundation volunteer team. An auto-tuned word N-Gram text modeler, using token occurrences, and optimized for sensitivity over precision, achieved best performance on this training set. In addition to being able to predict whether or not a website represents a nonprofit, the classifier was also able to determine whether the organization’s activities were concentrated on healthcare. The inference process applied to the 167,457 candidate websites returned 11,119 organizations as healthcare nonprofits. Predicting whether a nonprofit was involved in healthcare proved challenging, as numerous healthcare-related websites belonging to educational organizations, publications and for-profit entities have a high likelihood of being incorrectly classified as providing healthcare. Therefore, all 11,119 organizations underwent further manual review to establish legitimacy, identify healthcare services provided, confirm countries of activity, and find additional relevant information. At completion, the total number of nonprofit organizations was narrowed down to 3,174. Due to space constraints, only 2,292 nonprofit organizations were ultimately included in the book, based on their quality and relevance. The companion online platform provides a more comprehensive and regularly updated dataset.

Healthcare Facility Data Collection and Curation

Healthcare facility data was primarily sourced from the OpenStreetMap humanitarian data layer. Given the abundant, and at times outdated, hospital listings in the OpenStreetMap dataset, uniform filtering based on building footprint, facility name, and online presence was applied to limit the data to hospitals and facilities with the highest impact and capacity. Area-based filtering was employed to exclude buildings too small to be a hospital based on square footage. Keyword filtering was then used to exclude non-hospitals on name (e.g., “health post”), factoring for linguistic differences.


Pipeline for Finding and Filtering Hospitals

Lastly, to establish activity, a scoring system was derived for each candidate facility by searching for related websites, local directories, government reports, social media posts, and more. Public APIs, including Bing Maps, OpenStreetMap Nominatim, and Geonames were called to capture and externally validate additional details. The purpose of these integrations was to (1) reverse-geocode hospital coordinates to return missing addresses, and (2) validate the location of the hospitals with close proximity to country borders. This approach was premised on the assumption that principal hospitals are more likely to be referenced online, whether by individuals, governments, or nonprofits. Filtering was complemented by several rounds of manual curation and review.

Future Directions

Data contained in this compendium presents only the first steps in improving the nonprofit and healthcare facility landscape in low and lower-middle-income countries. Much work remains to be done to better our understanding of the granularity specific to each region and healthcare system. The Foundations’s development of a vulnerability index based on macro-level health statistics, bed capacity, and population mobility in targeted regions is a step in this direction., the digital tool, will continue to be developed, adding new geographies along with additional insights from multiple, disparate sources of data. Additionally, data from social media activity can help identify acute medical conditions in real time and facilitate rapid assistance where needed. Information sources such as public satellite data and ground images obtained from online user activity can be further used in conjunction with machine-learning algorithms to validate the location of hospitals, estimate facility area, and even predict the number of beds needed. Together, these and other features will enhance the global marketplace for the exchange of healthcare services.


Book Availability

World Compendium of Healthcare Facilities and Nonprofits Organizations is currently available in paperback, hardcover and Kindle editions on