One of the probabilistic/data science projects inside SciCrop, conducted by our Head of Research, Brett Drury, aims to build Dynamic Bayesian Networks (DBN) from text that represent a specific crop in Brazil. These networks can be used to estimate the global production of that crop in Brazil. In this post we will explain the main concepts of how we do that.
Text is a good medium to extract knowledge as it contains the opinions and knowledge, from a large number of people. The information that a large body of text will hold exceeds any individual or group of people understanding of the area. The network is build by extracting causal relations from text, creating a directed graph, removing inconsistencies, and irrelevant information, and then convert the graph to a DBN.
The extraction of causal relations, and reasoning with large DBNs are very computationally intensive tasks, and there are no tricks to speed up the processing, consequentially we rely upon High Performance Computing (HPC) techniques to ensure that the process runs in the shortest period of time without losing any expressiveness of the extracted data.
Causal relations are relatively sparse, and we are extracting them with LSTM+CRF (Long Short-Term Memory with a Conditional Random Field) combination with a Word2Vec word vectors with 600 classes. This outperforms the previous benchmarks of using only a CRF with hand-crafted features. However to extract a sufficient number of causal relations a large amount of text has to be parsed. In our current experiments we are parsing 36K documents. In production we will parse a much larger corpus in the order of a million documents.
The problem is embarrassing parallelizable where a causal relation extractor can be run on an individual node. With multiple nodes the documents processed by each causal relation extractor will less than the whole corpus. In an extreme example if we have 36K nodes we can process the aforementioned corpus almost instantaneously. Using cloud services like Google Cloud (in our case) allows the almost infinite expansion of computing resources. An overnight task, becomes computable in the time it takes to make a coffee.
The construction and editing of graphs can be achieved on a single machine, however performing inference on large graphs cannot. Larger graphs have more expressive power and therefore better predictions, but require more computational power. Typically with single machines we reduce the number of nodes by merging them with others in a graph until we have a graph that is small enough to process. An alternative is to use HPC, and spread the reasoning workload across multiple nodes. At SciCrop we use the following stack:
This stack allows us to reason with very large graphs. We can retain the expressive power of large graphs, but with the short inference times of smaller graphs.
This project allows us to create expert models for any crop and encode this information into a directed graph. We can use this graph to reason about the market for that crop, and with data from a farm, we can reason the farm’s crop production. The predictive ability of the graph improves with high quality textual information about crops being collected daily in a high volume.
The causal relation extractor is a LSTM Neural Network with a Conditional Random Field. This was built with Theano and Keras. The reasoning uses amidst, Apache Spark, and Apache Hadoop to distribute the workload to multiple nodes. This allows the construction of massive and detailed graphs. The larger and more detailed a graph is, the better its predictive ability.
With this approach we are able to create expert models for any crop and encode this information into a graph . We can use this graph to reason about the market for that crop, once you have a real huge database with high quality textual information about crops being collected daily in a high volume.