iKLEWS (Infrastructure for Knowledge Linkages from Ethnography of World Societies) is a HRAF project funded by the National Science Foundation. iKLEWS will create semantic infrastructure and associated computer services for a growing textual database (eHRAF World Cultures), presently with roughly 750,000 pages from 6,500 ethnographic documents covering 330 world societies over time. The basic goal is to greatly expand the value of eHRAF World Cultures to users who seek to understand the range of possibilities for human understanding, knowledge, belief and behaviour with respect to real-world problems we face today, such as: climate change; violence; disasters; epidemics; hunger; and war. Understanding how and why cultures vary in the range of possible outcomes in similar circumstances is critical to improving policy, applied science, and basic scientific understandings of the human condition. Seeing how others have addressed issues can help us find solutions we might not find otherwise. This is extremely valuable in understanding an increasingly globalized world. It can be used to explore the relationship between human evolution and human behavior. Although the current web version of eHRAF World Cultures is very fast at retrieving relevant ethnography, fundamentally it uses the same method as the original paper files founded in 1949, just a lot faster. There are no aids to analyzing the material once found; the user has to read the results of their search and apply their own methods. This project will begin to fill this gap so that modern methods of working with text can be applied by developing an extensible framework that deploys tools for analysis as well as greatly improving search capability. This will be available as a services framework, with interfaces for researchers ranging from beginner to advanced.
New semantic and data mining infrastructure developed by this project will assist in determining universal and cross-cultural aspects of a wide range of user selected topics, such as social emotion and empathy, economics, politics, use of space and time, morality, or music and songs, to use examples that have been investigated using prototypic tools preceding this project, Some of the methods used can be applied in areas as far afield as AI and robotics, such as forming a basis for a bridge between rather opaque deep learning outcomes and more transparent logic driven narratives, making AI solutions more human. We will apply pattern extraction and linguistic analysis through deep learning and other tools to define a flexible logic for the contents of the documents. However, the goal is to create new metadata and infrastructure based on the outcomes of these procedures that can operate in real time and to scale using less processor intensive algorithms.
The project will result in improved relevance of search results though identifying finer grained topics in each paragraph in addition to those in HRAF’s present Outline of Cultural Materials (OCM), establishing semantic representations of the paragraphs in the texts with semantic links between the paragraphs so that a researcher can follow topic trails more effectively, and provide tools for management, analysis, visualization, and summarization of results, user initiated data mining and pattern identification, based largely on precomputed data. These will assist researchers identifying and testing hypotheses about the societies they investigate. In addition to working on HRAF’s eHRAF World Cultures database, we will provide services that any researcher can use to process and analyse their own material.
We will a) expand the metadata for the eHRAF database using data mining, deep learning and a range of textual analysis, b) expose the contents of the database to data mining by researchers through an API, c) develop services that leverage the API to process results, and d) provide a range of approaches to leveraging the API to suit researchers with different levels of technical skill, including web applications, JupyterLab workflow templates, and direct API access. We will also include means for researchers analyzing their own ethnographic and other materials that are not in the HRAF corpus.
By exposing data mining, computer assisted text analysis methods and data management tools in this project, with guided means to leverage these through interactive web applications and JupyterLab templates together with interactive exemplars and training materials, we will expand capacity to advance secondary comparative, cross-cultural, and other ethnographic research and extend this capacity to a much wider constituency of researchers at a time when many of the problems of globalization could be better addressed by incorporating more information about the past and present of the different parties involved. These services will make practical the inclusion of cross-cultural analysis in research whose primary purpose is quite removed from this track, such as consideration of cross-cultural approaches to common or new problems.
The plan is to: 1) expand the metadata for the present HRAF corpus in two important ways: use existing OCM subject categories, already assigned at the paragraph level by HRAF analysts, in conjunction with topic extraction, pattern identification and other text-mining algorithms applied to the HRAF corpus to enable finer topical distinctions and add much more semantic structure, assertions and pragmatic equivalences to the corpus metadata per entry;2) to improve research capacity of the database, we will add bulk user accessible data mining to the database based on precomputed tables, with an initial set of new analytic and reporting services to be developed for the proposed project to expand processing of results by researchers and students; 3) to improve workflow management, based on JupyterLab workflow templates and a new HRAF Jupyter kernel and libraries that are interoperable with Python and R directly or through JupyterLab kernels, to be expanded and made more secure and scalable by incremental additions to our services that facilitate search, aggregation, visualization, summarization, management and further processing and analysis of result sets, including subsequent transformations and analysis history–this workflow management will facilitate the capacity of researchers to create and document complex workflows with relatively little need to understand the services framework; 4) since HRAF cannot currently process as many relevant materials for a culture or archaeological tradition as we would like due largely to the high labor cost of subject-indexing, we will implement new analytic tools to perform auto-assignment of topic level metadata at the paragraph level and other new metadata to new documents, as added to documents in the existing database as above. This will greatly increase the productivity of our analysts; and 5) implement these auto-assignment processes as open-access services to index and normalize a researcher’s own documents with respect to HRAF’s new management and analytic services so that researchers can process their own private ethnographic materials.