As the appetite for data grows, DataOps professionals have mostly confronted the myriad of challenges involved in integrating internal data from multiple applications.
They have invested in the tools and technologies to aid in this task, implemented standard processes, and staffed up internally so they can deliver analytics-ready information to data consumers in their organizations.
That said, it’s a whole new ballgame when external data is added to the mix to provide additional context and insight. Integrating external data is quickly becoming mission-critical for every business, including investment and asset management. In fact, Forrester found that nearly 80% of data leaders want a faster way to onboard external data sources.
The same research found that data teams spend 70% of their time prepping new external datasets for analysis rather than on analytics, which isn’t a good use of their time. Here’s an overview of the complexities involved with external data integration and how applying three DataOps principles can help overcome those challenges and accelerate time to value.
Why External Data Integration Is a Different Kind of Difficult
DataOps teams aren’t new to the integration rodeo, and they’re familiar with tools that can streamline tasks like bringing an acquired company’s datasets onboard or integrating datasets from the enterprise CRM instance, for example. But external data ups the ante on complexity. External sources may offer decades of historical data that can be incredibly useful for analytics, but since archaic legacy systems were used to record and produce it, you’ll be dealing with outdated file transfer protocols (FTPs).
Also, data shapes change, with sources storing data in legacy formats over time, so DataOps teams find themselves doing twice the integration work for the same data yield. Schema changes are another common issue – they can also change over time, and you won’t necessarily get a heads-up from the source before they do. Cases like this have happened recently – a central bank we ended up working with had dropped a currency in response to geopolitical events and didn’t warn its users, breaking multiple data pipelines.
Similar challenges can arise with internal data, but DataOps teams have a level of control internally that they don’t have with external sources. That same dynamic is in play with missing or late data and issues like quality drift, which is why data observability is so critical. On the transformation side, DataOps teams working with internal data have more insight into formats, sources, and a smaller universe of permutations to deal with than those transforming external data into analytics-ready information.
Three DataOps Principles to Apply Now
The overwhelming demand for more data and the challenges that are either unique or magnified when applied to external datasets requires a new approach to provide access to analytics-ready data from third-party sources. These three DataOps principles can serve as tentpoles for a new strategy.
- Align teams on data project goals upfront: It’s critical to set clear objectives for data projects so that data scientists and engineers are on the same page from the beginning. Better communication can alleviate challenges data scientists face navigating APIs and big data systems, which might otherwise delay important insights. Closer collaboration can also help companies avoid delays due to a lack of transparency and limited understanding of internal data use cases when the organization outsources data engineering work, as is often the case. Defining objectives at the start ensures there’s no miscommunication.
- Remove data silos and transition to a centralized data repository: A centralized data repository with a single catalog is the holy grail for many enterprises, yet few actually achieve it. Obstacles to centralization include data stored on multiple platforms with different architectures in a variety of formats that must be standardized and reshaped for use in analytics. Other barriers include a lack of automation tools and internal resources to achieve centralization. But there are numerous benefits to a centralized data repository, including streamlined data licensing management that can result in tremendous cost savings, and the creation of a single source of truth that underpins analytics, which improves data science quality. A central location also enables companies to standardize data and reduce pipeline maintenance costs.
- Automate integration, transformation, and monitoring of fresh external data: Handling these tasks without automation is prohibitively labor intensive. That’s why data backlogs are insurmountable at many organizations. It also explains why DataOps teams have so little time for value-add work and why nearly 60% of business leaders say the time to value for third-party data is too long, according to Forrester. Automation is a game-changer. Specifically, automated tools for validation and transformation which can transform data in real time, improve data quality, and reduce the time organizations currently spend preparing, monitoring, and remediating data. Custom validations can ensure that ingested data meets format and content expectations, and automated notifications can improve data supply chain visibility. By automating ingestion, transformation, and monitoring, data engineers can focus on the value-add task of analytics.
Some business leaders at companies with an increasing appetite for external data view the lag time between data acquisition and time to value as an internal headcount problem. But in many cases, they don’t need more data engineers. Instead, the data engineers need the right tools in place, so they can focus on analytics instead of spending their talent on mundane, never-ending tasks related to external data integration, transformation, and monitoring.
When you apply these DataOps principles to external data challenges — collaborating with external partners and investing in the tools to make your data engineers’ jobs easier, creating a central repository, and automating core processes — you can get onboarding backlogs under control and accelerate time to value.
About the Author
Dan Lynn is SVP of Product at Crux. External data is a special kind of hard. Crux is here to make integration, transformation, and observability easy. We’re on a mission to make all information data science and analytics ready. Crux is an external data automation platform focused on the integration, transformation, and observability of third-party data. Companies use Crux for the delivery and operations bridging the data engineering gap between data suppliers and data consumers. Crux’s state-of-the-art platform automates the development of data pipelines, ongoing validation of data quality, and management of operations between suppliers, consumers, and cloud platforms at scale. Backed by state of the art tech and a team of seasoned data engineers, the sky Is the limit when It comes to accelerating your external data Integration workflows.
Featured image: ©railwayfx