Most of the datasets that we have available in chemistry or chemical engineering research do not qualify as "big data". However, we still want to use the powerful tools of machine learning in small datasets of hundreds or dozens of datapoints.
There are many avenues we are pursuing to train small datasets of chemical data more effectively. One example is through boosting, a process of using multiple small models in sequence in place of one large model, iteratively drawing additional information from the graph representation.
Machine learning models for chemical properties are improving in quality rapidly. For some contexts, in some problems, they are becoming good enough that their in-domain predictions are nearly as accurate as data collection. However, these very good models are limited in their impact by accessibility. Very few people outside of machine learning specialists know they exist or have the capabilities to use them.
We will be changing that by building a repository of significant prediction models from various different sources, handling them with a unified software framework, and hosting prediction services for them publicly.