Processing and using large amounts of data is not possible without fully
automated algorithms to find interesting relationships between amino acids. Not
only will it be a challenge to deal with such an amount of data, but it will be
a challenge to determine what to look for at all. Namely, the main goal of the
overall analysis must be to identify strong correlations in changes in amino
acid parameters (such as coordinated movements, simultaneous changes in the
structural alphabet, coincident formation of hydrogen bonds, etc.), but also to
find similarities in amino acid environments between different enzymes. This is
the task of the third pillar on which the proposed project rests, and these are
advanced data analysis and machine learning (ML). Machine learning is a very
rich field of data science that uses vast amounts of data to derive new
insights and uncover hidden patterns in data. As such, it is perfectly suited
for processing the information contained in the relational database performed
in the XD and MD part of the project. First, ML will be used to find different
repetitive structural motifs in XD data, such as a structurally preserved 3D
environment to a number of related enzymes, which would speak to their
importance. Different machine learning methods such as clustering will be used
for this. They will group similar domains and rank them according to some
suitably designed measure of similarity. Already this step will indicate some
significant parts of the enzyme (most likely the active sites, but also some
unknown and unexpected ones). However, these procedures will only provide a
basis for including dynamic data from the MD previously associated with static
data in the database. This is likely to further highlight the importance of
certain structural motives and give them a dynamic role, and also provide
insight into the interrelationships of these similar regions. Furthermore,
machine learning methods must be applied in order to extract the dominant modes
of amino acid communication from the forest of interactions that take place
during MD, while carefully selecting those that stand out above the noise
level. Once again, due to the abundance of information on the relationships
between amino acids that already exist in the database, ML methods will have a
wealth of data to extract the information. To facilitate the ML part of the
project, special attention will be paid to developing a database scheme that
will be adapted for the application of automated ML algorithms to extract
communication networks in this class of oligomeric enzymes, and with the
ability to find hidden laws applicable to oligomeric enzymes in general. Within
the Python programming language, there are great possibilities for the ML part
of the project and already existing components (Scikit-learn, Numpy, Pandas).