ARC-MOF: A Diverse Database of Over 280,000 MOFs with DFT-Derived Partial Atomic Charges and Descriptors for Machine Learning

Jake Burner, Ohmin Kwon, Marco Gibaldi, Scott Simrod, and Tom K. Woo

University of Ottawa

Metal-organic frameworks (MOFs) are a class of crystalline materials composed of metal nodes or clusters connected via semi-rigid organic linkers. Owing to their high surface area, porosity, and tunability, MOFs have received attention for various applications. One such application is gas separation and storage. Computational methods have been used to expedite design and discovery of MOFs. For example, atomistic grand canonical Monte Carlo (GCMC) simulations have been shown to provide relatively accurate estimates of the gas adsorption properties of MOFs if the structure is known. Notably, these simulations require partial atomic charges for the framework and guest atoms. Ideally, these charges are derived from a DFT calculation. However, such calculations can be computationally expensive for large-scale screening, so empirical partial charge assignment methods are often employed.

In addition to brute-force GCMC screening, machine learning (ML) and other data-driven methods have been used to screen large databases and successfully develop new experimentally synthesized and validated MOFs for CO2 capture. To enable data-driven materials discovery for any application, the first (and perhaps most crucial) step is the curation of a database. This work introduces the ab initio REPEAT charge MOF (ARC-MOF) database. This is a diverse database of over 280 000 experimentally characterized and hypothetical MOFs spanning all publicly available MOF databases. ARC-MOF contains MOFs with DFT-derived partial atomic charges for GCMC screening, as well as pre-computed descriptors for out-of-the-box machine learning applications. An in-depth analysis of the diversity of ARC-MOF with respect to the currently mapped design space of MOFs was performed – a critical, yet commonly overlooked aspect in past publications of MOF databases. Using this analysis, balanced subsets from ARC-MOF for various machine learning purposes have been identified. Other chemical and geometric diversity analyses are presented, with an analysis on the effect of charge assignment method on GCMC simulated gas uptakes in MOFs.

Back to List of Abstracts