Predictive Analytics and Big Data Analysis on Trood Core
Below we describe the core concepts of Trood Core architecture for predictive analytics, data science, and big data analytics.
Here you find a high-level description of concept highlights. Developers and geeks can find technical details within corresponding low-level documents. Please read on.
The reader is invited to get familiar with the general <Trood Core Architecture concepts> to stay on the same wave with us.
The key concepts of Trood Core apply for its Data Research Engine: technological diversity and integrability with maximum reliability, scalability, and maintainability.
By technological diversity, we mean an engineer’s ability to use almost any modern integrable technology if she needs it for any particular reason. Meanwhile, we always provide best practices and default solutions.
Depending on a particular problem being solved, you can decide to use either plain Trood CEP engine <Complex Events Processing (CEP)>, or use <Trood Exchange (ex. Crossover)> to plug a high-performance analytics database as Druid, or even create distributed flow analysis analytics with Spark or MapReduce.
Predictive Analytics on Trood Core
Predictive analytics are comprehensive phenomena that employ diverse facets of data processing and enrichment to help business users generate ideas and create insights. The entire process of data transformation depends on what we need to have as a result or how we need to handle the data to add value to our business decisions.
The data-related problem is to be solved on different layers:
- Data collection
- Data cleansing
- Event processing
- Model training and update
- Model check and testing
Data collection and handling
To collect data and handle it appropriately, we have a series of tools, services, and approaches. The overall purpose of any data handling is to collect what we have and produce what we need. We retrieve data from diverse sources, integrate it accordingly, combine and re-distribute the data, analyze and convert events (which are also data) on-the-fly, and build historical models and outline cases for further reuse: this all helps us to deliver notable business values.
It is very important that data for analysis is being collected from anywhere it could be useful. Real insights are gained when information is merged from different outlines such as an accounting system, operational reports, CRM/CX reports, etc. Having collected all dispersed data, we are ready to build an integrated solution and enrich the initial data.
Of course, all the data stored for Trood Business Entities can be seamlessly used in the Trood Core’s Data Research Engine via out-of-the-box converting mechanisms. Those mechanisms provide integrability features that enable joint operation of diverse legacy systems with newly built tools and options. Data Research Engine maintains the reliability of a joint solution. It makes the system scalable and maintainable, providing for almost any complexity and integrity of employed data.
CEP and Crossover
The fastest way to implement on-flow analysis is to deploy CEP services, which gain access to all the information handled with Trood. They can either implement analysis algorithms on-the-fly or convert data to analytical databases for further analysis.
A union of the data model, data collection mechanisms for training and analyzed data and flow analyzing services form a CaseBase entity instance. Working with CaseBase is described in detail in the corresponding documentation.
To balance the system load and improve its performance, we allow you to enhance the data on-the-fly eliminating the data noise and insignificant data fluctuations.
Flow data cleansing can be organized via <DataLogic> service. A DataLogic engineer can choose from a vast variety of mechanisms to develop a DataLogic service, but the most common practice is a service written on Scala/Java/Python/R. A good guideline for generic R algorithms for data cleansing is given at <this material https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf>
Manual (engineer-driven) data cleansing mechanisms are provided via the Trood InsightBox engine (integration of Trood Dashboard into Enterprise Manager), described later in the document.
Model training and update
Data-based models aim at automating data transformation procedures and enabling their optimization. The system builds a model and trains it to perform as expected. If not relevant, the model is re-considered and updated until it provides expected results.
Within a unified architecture, lots of big data engines can be used to train models and implement big data analytics on Trood Core, including Sklearn, Vowpal Webbit, TensorFlow, Pytorch, SparkML, etc. Services that use these frameworks and libraries are being deployed as CaseBase or DataLogic algorithms, or as InsightBox data providers.
These services implement all powerful and popular data analysis algorithms, as Bayes methods, SVM, Boosting, Random Forest, Association Rules, Neural Networks etc, which solve problems of Classification, Clusterization, Time Series Analysis, Text Analysis and more.
A data model engineer has to make her choice based on business tasks and applicable data techniques.
Modeling and Model testing
As some engineers prefer to test their models directly in corresponding IDEs, it is a good practice to get an automated model check reports available online. This is where <Trood Dashboard> is suitable for internal usage, integrated within Enterprise Manager. This integration is called InsightBox and can be used for visual exploration of the data and model checking. This tool is employed in Enterprise Manager (EM) to visualize data, and you set up a data refinement and cleansing option, you can visualize and compare the data before and after the cleansing operation.
The most common uses of InsightBox visualizations are as follows:
- Researched data diagrams, as
- Model accuracy estimation
- Data insights, (e.g. text analysis insights)
- Data cleansing
We will discover 2 scenarios for Trood Oil here: (1) direct check of flow parameters for minimax (oil transportation leakage monitoring), or (2) online flow forecasting (oilwell drilling monitoring).
- For minimax problem: direct minimax calculation over Rabbit messages (CEP) - C/C++, Go, Java, Scala, R, Python
- For oilwell drilling: sklearn logical regression or SVM with kernel trick (Python)
After getting the real data, we are to check the model hypotheses by pre-defined InsightBox scripts (developed on R) for attributes scatterplot matrices, Log-Likelihood Ratio Test, Receiver Operating Characteristic Curve, monitoring models Accuracy, Precision and Recall.
And one scenario for Trood Legal: (3) clause analysis for contracts.
3. Trood Legal uses topic modeling to analyze uploaded documents and internal clauses topics and TFIDF to get the most valuable words within topics and clauses. The flow engine determines whether a popular and important clause has been significantly changed within the document and notifies a responsible person.
As examples, the system sees that the organization account details have been changed within the document, separating them from account details of 3rd parties; Also the system sees when the wording of the standard IP regulating clause has been changed etc - all of these insights go directly to layers or partners who are interested to double-check this kind of changes.
This analysis mechanism is also implemented as a CaseBase service.
Technical details of CaseBase and DataLogic services development, deployment and maintenance will be available within the corresponding separate documents.
The Trood Core platform offers a pack of applicable data processing and handling techniques that help users build a complex data transformation paradigms. We help developers to create complex and unique solutions as required.
The raw data that companies usually get from their legacy systems do not only require collection and transformation. They often need to add calculations and higher maths to generate complex business ideas. This is exactly where Trood Core can be employed for developing distributed systems to retrieve, collect and compile data, build data models, add calculations, apply trends and forecasts, and produce highly valuable business information. And here appears distributed computing.
As stated before, Trood Core supports the use of distributed computing frameworks as Hadoop, Spark, MapReduce et. al., keeping unified infrastructure management opportunities provided by Enterprise Manager.
Trood Distributed computing concepts will be provided in detail in the correspondent document soon.
Dear friend! Since you are here and still reading, please know that we perceive each TCP visitor (let alone member) very personally. We don’t abuse you with popups encouraging to sign up, but if you leave your email here:
you will cause our eternal gratitude and tears of happiness. You will see how responsibly we approach our mailing policy, and we promise you won’t get any odd word from us! (unless something goes wrong with our AI called Boris) All our emails are gluten- and dairy-free!
Do you like it? Share with your colleagues!
Your personal space
Welcome to TCP (Trood Community Platform). Here we are building a community of like-minded people who share passion and knowledge about cutting-edge software development technologies. If you're looking for advice or willing to share your experience in IT, we'd love to hear from you in our community discussions. If you are a product creator, business owner, or developer who wants to be in touch with industry experts, here in TCP you are in a good company of opinion leaders and other enthusiasts. There is only one little thing left: we invite you to register and get access to all materials and a personalized news feed! Please,sign upand stay at the same wave with us!