Apache Spark

Apache Spark is a unified data analytics engine, available as open-source software, or as a hosted service from vendors such as Microsoft. Spark serves to make it easier to integrate data from numerous and varied data sources and provide the architecture to perform data analytics and science at large scale. Developers and analysts from many backgrounds can work together on the same application through multi-language support. Spark allows distribution of processing workloads among multiple machines running as a cluster, allowing our applications to perform equally under light and heavy demand.

Our platform relies heavily on applications we develop for Spark. These applications hold the functions that first ensure received data is valid and in a proper format for downstream analysis and computations. Some data need additional preparation, including analysis to generate meaningful and reportable results from Active Tasks. These tasks are performed by specialized Spark functions that we have written and injected into incoming data streams. Spark also stores transformed data in Delta Lake, a repository optimized for machine learning and other computational procedures.

Finally, we have developed a Spark application for transforming and storing collected data in an Influx database. Influx is optimized for the time-series data that is dominant in our platform and is the backend database that drives the user and care provider visualization dashboards of presents/with.

We will be taking advantage of Spark’s implementation of MLLib in the testing and deployment of machine learning methods of anomaly detection.