The POV presented here is specific to data lake enhancements or data ponds being created in response to the current pandemic; however, the thought process would apply to any purpose-built short- to mid-term ponds.
A mid-term purpose-built pond is different from a long-term data lake in the aspect that certain data sets are created for a specific purpose only; and the usage, composition, and need of the data is expected to change dramatically after a certain period, say, one to two years.
The dynamic nature (the type of data, specific data sets, usage patterns, and utility) of the purpose-built data pond has different velocity then long-term enterprise data assets, and hence the distinction.
There are also financial and political reasons. By calling out and putting a clear boundary, we can help clearly define its purpose, and related approvals and funding can be funneled directly towards it as opposed to going for general distribution.
The specific data needs related to COVID-19 will vary, depending on what the organization is trying to do, but at a high level, we can assume the following characteristics:
- Structured data
- Case Statistics (number and type of cases classified by demography, geography, and other similar parameters)
- Weather or any other affiliated data
- Semi-structured/sparse data
- Log from, say, different sensors measuring activities in public/private places
- Other sensor data
Following are key considerations which differentiate the data mentioned earlier from standard addition to the data lake:
- Highly fragmented data: Data is going to be highly fragmented as the approach to data collection is to ‘get whatever you can get your hands on’.
- Lower quality of data: Data may be of potentially lower quality than the rest of the data in the lake as it is not vetted, and sources are not standardized or defined.
- Uncertain application/interpretation of data: Application of this data and even the interpretation is not entirely understood as the schema is not defined for consumption.
- Unclear data harmonization process: It is not clear how this data can be joined or harmonized.
- Limited utilization of available data: Some of the data may have exceptional governance needs, for example, it may be shared for a specific purpose only, and there be restrictions about its use for any non-COVID-19 purposes.
- Uncertain data frequency: It is not clear if we will receive incremental data in the future or not.
Basically, we have higher uncertainty and potentially fragments of not so well-connected data.
Now, let us look at what are some of the things potential consumers of this data would be looking for:
- Correlation of ready-to-use data sets: If data is well-curated and ready, such as data from Johns Hopkins Coronavirus Resource Center, for consumption, the user may want to correlate this data with enterprise data sets.
- Exploration and other usages of data: Most of the other data will be for exploration and data science types of usage. For these and other analytics users, standard three pillars of data lake apply with some variation:
- Discoverability: We need to provide some exploration interface, which allows them to discover what data sets are available, look at some sample data, give basic profiling/histograms (potential variation), identify the source of data, and see the time period of the data among other actions. Most of these comments for discoverability are true for both structured and unstructured data, but unstructured data may need more — maybe applying basic ML for vectorization and hence allowing for most associated sections search relevance to make it easier to find the right unstructured data.
- Joinability: We may need to provide some additional capabilities like auto-discovery of primary and foreign keys within the data sets and across this data and enterprise data assets. We could also offer heuristic matching to provide potential similar data fields among the new data sets as well as with enterprise data.
- Governance: As mentioned above, beyond standard governance, the usage of the data may need to be monitored and reported and/or restricted to use for COVID-19 purposes. Such restrictions require “intent-based governance,” in which we need to determine the intent of the question being asked to the data lake and validate that it is for the right purpose.
Although there are many potential things we can do in each of the categories mentioned above, we believe the following three things should be on the top of the priority list:
- Enhanced data explorer: Stronger metadata and profiling to quickly understand what are the data sets and what type of data users are looking at, such as quality, sparseness, time frame, and other parameters, most of which can be auto-derived. This enhanced data explorer should also allow the user to search for pre-digested/curated data sets.
- Clear interface for users to upload their results: Users should be able to upload their conclusions or the outcome of processing this incoming data. It means that if they have segmented, or prioritized or cleansed certain data, they should be able to write that back to this pond. This ability is critical for quickly curating and understanding the data (crowdsourcing). Such capability requires an easy way for people who are working on this data to re-publish the modified/enhanced/curated datasets back into the data pond.
- ML-enhanced unstructured data search: Given the potential volume of unstructured data, keyword search could potentially underserve the use of this dataset. ML-based vector maps, comprehension algorithms, and ML search based on questions rather than keywords can genuinely allow users to discover and start using this content.
In the long-term, some data/content will find its way to the enterprise data set due to its long-term use, while others may die down with reduced or change in need. We can expect that at some point this data pond can be archived away with the appropriate long-term datasets which would be migrated to the enterprise data lake under a well-governed process.