
In a previous note, also published on Ness’s Blog, I wrote about Platform Modernization and Artificial Intelligence – two topics we regularly speak with our customers about as they consider how to further extend the productivity of their digital platforms. “Data” and “Cloud” are also frequently part of our discussions. Below are some thoughts in particular about using metadata for data lineage and avoiding cloud vendor lock-in.
Metadata for Data Lineage
Data quality is an issue for most organizations that are trying to monetize their data; e.g., by extracting insights that can move the needle for their business. If the data that is input to a predictive algorithm is “dirty” (e.g., missing or invalid values), then any insights produced by that algorithm cannot be trusted.
To achieve data quality, each data value must have a clearly defined lineage; i.e., an enterprise user should be able to determine where it came from, what transformations it underwent along the way, and where it is going…what other data items it affects. Data lineage provides an enterprise with many benefits; e.g., the ability to perform impact analysis and root-cause analysis by tracing lineage backwards (to find all data that influenced the current data) or forwards (to identify all other data that is impacted by the current data) from a given data item.
It sounds great, but where does data lineage information come from? There are several competing techniques to collect lineage metadata, each of which has its strengths and weaknesses:
- Data Similarity Lineage: This approach builds lineage information by examining data and schemas without actually accessing your code. On the one hand, this approach will always work regardless of your coding technology, because it analyzes the resulting data, regardless of which technology generated the data. But, it has several glaring weaknesses, e.g., it cannot detect lineage metadata that is only executed rarely; e.g., end of year processing.
- Decoded Lineage: This approach focuses exclusively on the code that manipulates the data, providing the most accurate, complete and detailed lineage metadata, as every single piece of logic is processed. But, it has some weaknesses; e.g.:
- Code versions change over time, so your analysis of the current code’s data flow may miss an important flow that has since been superseded.
- The code may be doing the wrong thing to the data. For example, suppose your code stores personal identification information in violation of GDPR and despite clear requirements to the contrary from the product manager. A Decoded Lineage tool will faithfully capture what the code does without raising a red flag.
- Manual Lineage Mapping: This approach builds lineage metadata by mapping and documenting the business knowledge in people’s heads; i.e., talking to application owners, data stewards and data integration specialists. The advantage of this approach is that it provides prescriptive data lineage – how data should flow as opposed to how it actually flows after implementation bugs. But, because the metadata is based on human knowledge, it is likely to be contradictory (because two people disagree about the desired data flow) or partial (If you do not know about the existence of a data set, you will not ask anyone about it).
As you can see, there is no magic bullet – each approach has its strengths and weaknesses. In Ness’s experience, the best solution combines all three approaches:
- Start with Decoded Lineage.
- Augment with Data Similarity Lineage to discover patterns in the database.
- Augment with Manual Lineage Mapping to capture how the data flows were supposed to be implemented.
Avoiding Cloud Vendor Lock-In
Many of Ness’s customers are in the process of transitioning some parts of their business to the Cloud. Technology has overcome early concerns about the Cloud’s privacy and reliability, and the Cloud vendors provide a very tempting tool stack with capabilities that may be hard for you to implement on your own; e.g., serverless processing, elasticity.
At the same time, Ness reminds our customers that vendor lock-in to a specific Cloud provider is still vendor lock-in, no matter how financially or technically attractive it is. Down the road, you may find that you need to switch Cloud providers; e.g.:
- Suppose you run a supermarket chain using Amazon Cloud, and Amazon one day decides to compete in your space. You may not be comfortable at that point with Amazon storing your sensitive customer data.
- Some Cloud products have a “hockey stick” pricing model, where the price rises precipitously once you cross a certain size or performance threshold.
- A Cloud provider could one day decide to discontinue a product you find essential, forcing you to look for an equivalent product on another vendor’s Cloud. Remember Google Search Appliance, a mainstay in many companies for enterprise search, which Google discontinued after 9 years.
- Privacy regulations could force you to move personal data from a Cloud vendor to your own on-premise private cloud.
What can you do to avoid Cloud vendor lock-in? Here is some advice Ness gives its customers:
- Wherever possible, use the Cloud as infrastructure rather than as a platform. At the infrastructure level, you are using the Cloud to provide you with hardware, so you can deploy and run your own containers. In that case, your dependence on a specific Cloud vendor is minimal. On the other hand, using the Cloud as a platform means you are using higher-level functionality like databases and elasticity frameworks that are far less standard across Cloud vendors.
- Wherever possible, stick to standard SQL, which is fairly uniform across all Cloud vendors, and avoid non-standard SQL extensions or stored procedures, which can be hard to port to other databases.
- Use features that are supported by all Cloud providers; e.g., serverless computing. The API may be different from vendor to vendor, but this will require a relatively painless conversion effort. Avoid features that exist only in one vendor’s platform; e.g., a proprietary machine learning development platform.
- Choose commercial products that support all the major Cloud providers, so you can easily move from one vendor to another.
Where feasible, consider using multi-Cloud APIs that hide the differences between Cloud platforms. If you are using the Cloud as infrastructure, there is OpenStack, a free and open source software platform for cloud computing, which consists of interrelated components that control hardware pools of processing, storage and networking resources throughout a data center. If you are using the Cloud as a platform, consider Cloud Foundry or OpenShift, each of which provides a uniform Cloud platform on any Cloud vendor’s infrastructure.
So, when should you use a Cloud provider’s vendor-specific technology? When it improves scalability, elasticity, availability, resilience or reliability. These are features of the underlying infrastructure rather than the application, and you should take advantage of your Cloud vendor’s capabilities and managed services rather than trying to achieve it yourself. While this may add some degree of vendor lock-in, the benefits far outweigh the risks, and porting to a different Cloud vendor’s infrastructure is a manageable risk.