Databricks as a platform has enabled developers, data engineers and scientists to gain valuable insights into big data. This blog endeavours to describe its three out-of-the-box functionalities, which are critical for this purpose –
- Infrastructure management, secure deployment, highly availability, and multi-tenant Spark clusters
- Exploration of data, code writing, and debugging applications using Spark Applications
- Scheduling of jobs and monitoring their execution.
The Apache-Spark based platform runs a distributed system behind the scenes, i.e. the workload is automatically split across various processors and scaled up or down as needed. This leads to increased efficiency which in turn results in savings in time and cost particularly for massive tasks.
Additionally, as Databricks is HITRUST CSF certified it has the required level of security and risk controls in place. This makes it more desirable for clients in the Life Science industry.
Azure Databricks Benefits
- Supports multiple languages and environment
- High Productivity and Collaboration
- Production Deployments
- Version Control
- Integration with other Microsoft stacks
- Data Lake storage
- Data Warehouse
- Blob storage
- Extensive list of data sources
- Diff. file formats(CSV, JSON, ZIP, Binary, Parquet, Images, SAS)
- Storage Data Sources(Azure Blob, CosmosDB, MongoDB, SQL Databases, Cassandra)
- High Cost-to-Performance efficiency
- Suitable for Small & Massive Jobs
- Availability of exhaustive documentation
- Excellent community support
Fig: Apache Spark Integration in Databricks
Rich Notebooks and Dashboard
Recent years have seen an increase in the usage of Notebooks within the Data Science and Data Engineering community (Ref 2) Databricks provides at par Notebook experience along with rich visualization which is embedded into it. This extra feature is particularly appealing to Data Scientists as visuals for data analysis can be created without writing codes. A combination of notebooks and rich visuals can be used to display critical KPIs that are essential in the decision making during the clinical trial management of the study.
In February 2020, Gartner (Refs 4 & 5) released its Magic Chart for Data Science. Databricks was in the leader quad due to its strong execution and growth, with a partner ecosystem of over 500 companies. Databricks has also emerged as a leader in the enabling of Machine Learning scalability owing to its Unified Analytics platform.
Challenges and Use-Cases
One of the requirements was to read the MarketScan data (155 files) from the Azure Blob container and load the data to Azure Synapse for analytics. The file was zipped sas7bdat format (e.g., file1.sas7bdat.gz)
Databrick R notebook was the best cost-effective solution to fulfil this requirement, which was completed in just three steps. First, the storage container was mounted to the DBFS (Databricks File System). Second, the desired file data was read and stored in a data frame. Third, the data frame was written to a table in the Synapse Data Warehouse.
– Cost-effectiveness: As mounting the data container to the DBFS only created a logical pointer to the Azure storage and not physically stores the files. Hence no storage space was consumed on DBFS.
– Enabling Exploratory Data Analysis: Data frames created in the second step can directly be utilized for EDA (Exploratory Data Analysis) in the Notebook without the support of any additional platform.
As an extension to the above requirement, SAS files were to be made available in DBFS. These files were to be read and processed in the RStudio interface for analysis by another team.
Azure Databricks provides an interface to run the RStudio server on it. Using this setup, the above requirement was achieved yet again in three steps. First, the storage container was mounted to the DBFS (Databricks File System). Second, the desired file data was read and data sets were created in RStudio local storage. Third, SQL operations were performed or graphical visualizations for data analysis.
– Cost-effectiveness: The RStudio Server Daemon runs on the driver (or master) node of an Azure Databricks cluster, preventing any changes to the existing cluster network configuration.
– Azure Databricks web app is used to proxy R Studio web UI hence there is no additional requirement to modify the cluster network configuration.
One of the generic challenges was to “clean and format” the source file, apply transformations and load the data to Synapse Data Warehouse. This included but was not limited to editing the headers, converting the columns to proper data type, and mapping the source file columns to the appropriate counterparts in a database table. Given the size of the file and the number of records, this task when done manually was time-consuming and prone to errors.
Databricks provided an efficient way for pre-processing the source data, performing ETL operations, applying transformations (SCD2). One can choose any programming language to accomplish the task. A few lines worth of code was used to complete this requirement.
– A potentially tedious manual process that was prone to errors was replaced with a far more accurate and time-efficient process.
Our client is a pharmaceutical company that is running multiple trials to find a solution to Corona-a global pandemic. Given the severity and the urgency of this situation, one of the requirements was to use Data Analytics to provide real-time insights at a global level.
Databricks Notebooks Magic commands were used to perform SQL operations on the retrieved data. This allowed instantaneous creation of Notebook irrespective of the language.
– A one stop shop for all SQL operations and Graphical visualizations for the Data Scientists to derive meaningful outcomes
Key Performance Indicators
- Top 10 Countries conducting Interventional Trials
- Interventions Administered to Patients
- Lead Sponsors
- Trial Phases in percentage
Azure Databricks is a powerful and relatively inexpensive tool. With continuing strides in the digital revolution, the usage of big data technology will become indispensable for most organizations. Azure Databricks is extremely flexible and therefore an attractive option for such organizations. Additionally, Databricks provides an easier usage to distributed analytics.
Apache Spark is a fast and powerful execution engine for processing large-scale data workloads. But to reap the complete benefits of this powerful tool a number of operational processes must be deployed and maintained. Tasks ranging from deploying containers and VMs to securing the environment, Building out high availability and multi-tenant clusters that allow a multi-user, concurrent and uninterrupted access to data and analysis, Providing easy-to-use interfaces that allow data practitioners to code and debug in a language of choice, integration with GitHub and popular IDEs to manage, version control, and review the code and models being created. Databricks allows these operational tasks completion with relative ease allowing the focus to be on data to provide insights and leveraging tools like Apache Spark.