In the era of data-intensive scientific discovery, Big Data scientists in all communities spend the majority of their time and effort collecting, integrating, curating, transforming, and assessing quality before actually performing discovery analysis. Some endeavors may even start from information not being available and accessible in digital form, and when it is available, it is often in non-structured form, not compatible with analytics tools that require structured and uniformly-formatted data. Two main methods to deal with the volume and variety of data as well as to accelerate the rate of digitization have been to apply crowdsourcing or machine-learning solutions. However, very little has been done to simultaneously take advantage of both types of solutions, and to make it easier for different efforts to share and reuse developed software elements. The vision of the Human- and Machine-Intelligent Network (HuMaIN) project is to accelerate scientific data digitization through fundamental advances in the integration and mutual cooperation between human and machine processing in order to handle practical hurdles and bottlenecks present in scientific data digitization. Even though HuMaIN concentrates on digitization tasks faced by the biodiversity community, the software elements being developed are generic in nature, and expected to be applicable to other scientific domains (e.g., exploring the surface of the moon for craters require the same type of crowdsourcing tool as finding words in text, and the same questions of whether machine-learning tools could provide similar results can be tested).
The HuMaIN project proposes to conduct research and develop the following software elements: (a) configurable Machine-Learning applications for scientific data digitization (e.g., Optical Character Recognition and Natural Language Processing), which will be made automatically available as RESTful services for increasing the ability of HuMaIN software elements to interoperate with other elements while decreasing the software development time via a new application specification language; (b) workflows leading to a cyber-human coordination system that will take advantage of feedback loops (e.g., based on consensus of crowdsourced data and its quality) for self-adaptation to changes and increased sustainability of the overall system, (c) new crowdsourcing micro-tasks with ability of being reusable for a variety of scenarios and containing user activity sensors for studying time-effective user interfaces, and (d) services to support automated creation and configuration of crowdsourcing workflows on demand to fit the needs of individual groups. A cloud-based system will be deployed to provide the necessary execution environment with traceability of service executions involved in cyber-human workflows, and cost-effectiveness analysis of all the software elements developed in this project will provide assessment and evaluation of long standing what-if scenarios pertaining human- and machine-intelligent tasks. Crowdsourcing activities will attract a wide range of users with tasks that require low expertise, and at the same time it will expose volunteers to applied science and engineering, potentially attracting interest of K-12 teachers and students.