But following the example, why would a pen and paper be the choice for writing stuff down? Couldn’t we just write on our phones, a computer? Or do we even need to write anything at all? Couldn’t we just record an audio message to remind us later what we intended to write? The answer to all of the above is… It depends. It depends on the purpose of your writing needs (you wouldn’t handwrite an executive paper), your ability to use the tools for writing , the accessibility of what you wrote (like whether you need to read what you wrote anywhere or there’s a precise place for it).
If you have been reading our series of articles (and if not… You should and can do so here), you might guess where we’re going: purpose and context before actions. Technical choices are hard when it comes to infrastructure and tools. We get it: they are expensive, upper management wants to see results, the market is flooded with options and there are no equivalents to a Ferrari. So, let’s walk through some key concepts to make those choices easier (or at least, better informed)
Mind your people and processes
Our information framework considers processes and people just as important as technology. A robust set of data is nothing without someone to read it, and a data analyst with a useful insight is nothing if there’s no way for them to communicate those insights. At this point we should be asking ourselves: who is going to access the data? How much technical knowledge do they have? The solutions and tools to provide them should be considered depending on where you’re standing at the moment and how you’ll be able to leverage those tools, as well as what you’ll need in the future to unleash their potential.
It’s good to aim high when designing a data strategy, but whatever capabilities you wish to enable with a set of technological solutions, be aware that you’ll need the people with the technical skills to manage them and the processes to make the fruits of their efforts useful for the rest of the company. These are as part of your plan as the software licenses you are considering.
Technology from necessity
Following the information from our previous articles, by now, you probably have enough information about your data and your organization’s needs to evaluate the situation. How close to raw data does users need the information to make the best value? How busy are the servers and how does this affect adding new requests to access their data for analytic purposes? Does our technical team need time to catch up? What volume of data do I wish to analyze? What type of data do I have or expect to have in the near future? Is it structured tables in a relational database, semi-structured Json files or images and audio?
Understanding these questions and overlapping them with your understanding of your existing capabilities and processes will help you know what SOLUTIONS you need to enable, and what are the gaps in the path to it.
Here’s in broad strokes what you will need: a place to store your data for analytical purposes. Based on the types and volumes of data you manage, you will also know if you’ll need simple storage for files, different types of databases to handle each type of format, or both.
Hint #1: Most users say “We need all data from everywhere, so think large.”
Reality: Closely analyze the sources of transactional data you have in your organization and the actual volumes. Questions like “How many clients does my organization have today? How many clients are we expected to have in 3-5 years? What is their level of interactions with my company and over which channels? How much of that data serves an analytical purpose?” These are simple yet key starting points. It is not the same to plan storage for a Telco company (that typically produces billions of records a minute) than a mid-size Retailer, with a couple thousand purchases each day. There are two aspects here that you may want to know in advance: Complex databases are expensive overtime and, perhaps the most critical, the most cutting-edge technologies make it difficult to find talent. All I’m saying here is: look closer to what you really need.
Hint #2: Beware of the trap of not knowing really how many active clients you have!
Although you will most certainly want to store all your clients, even the ones that made a one-time purchase 5 years ago, we see many companies falling in the trap of estimating transactional volume (ie, how many interactions clients perform), by calculating in the basis of total number of clients, and not an average of “active clients” over the last year (or the time frame your business mandates). If you have say, 10 million clients, and estimate their potential sales, calls to the call center, etc, based on that, you will have a very different volume estimation if you find out that only 20% of those clients are really active (and God forbids there are duplicated clients in that database!) So, although this looks like a technical task, you need to involve the business departments to have a more accurate vision of your present situation as well as the future of your company to have an accurate plan for what’s coming up.
Data Integration and orchestration tools: to extract the information from your transactional storages into your analytical ones. Based on how close to real time that you need the information refreshed, you will also know what type of integration you want and focus on tools tailored to that need.
Hint #3: Most users (if not all!) say they need all data in real time.
Reality: Real time data requirements are very specific cases that will only need a specific subset of data actually in real time. Most of the use cases will do more than fine with a daily update of data. While planning for the future in an architecture is absolutely correct, we find too often that a pressing need for real time data (which by the way, for some budgets is prohibitively expensive), is not such, it brings confusion to the decision table, and even worse, might lead to overly complex architectures for which the organization might not be ready for in the first place.
Make data available to users. Based on the people and processes you understood, you will also know if you need a reporting tool to graphically convey insights, an easily accessible structured database to close the gap from semi-technical business users to use the data, a sandbox for code wizards to do their magic, or (likely) all of the above.
Hint #4: Beware of users wanting that magical thing called “machine learning” right at day 1.
Reality: According to Gartner, analytical maturity starts with descriptive analytics . This means, that for a starter period in which your users need to adapt to use data for their decision-making process and trust that data, reporting and visualization tool will be more than enough. Don’t get me wrong: we love Machine Learning and its potential, but only if you have a large set of curated and trusted data to feed it (“good data”, as Andrew Ng calls it). If you are just getting familiar and building your company’s first data solution, starting with a complex use case will actually jeopardize your long-term goal.
If you can state affirmatively all the answers to the questions above, you can pretty much say that you have a SOLUTION in mind. Only then, does it make sense to start the search for the ideal tool to bring it to life.
Tools: a simpler equation
Now with all the above taken care of, the actual tools (as in brands) have a more precise boundary based on what you expect. Now, what comes into play are more measurable topics such as cost, performance and interoperability. The balance of those variables is up to you now, because all of them have trade-offs. Sure, competing platforms such as Azure, AWS or Google Cloud Platform will have specific perks for some component of their whole ecosystem, but if you’re already working in one of those, this will probably tilt the balance towards that choice rather than setting up a new environment. Familiarity and minding the talent available with the tool will probably save you whatever time you might gain by choosing another one. And if a difference in performance is heavy enough to balance that decision… There’s always a small startup willing to give you a Plan C and likely with a more personalized customer attention.
This can be a bit disappointing, I know. You expected an (yet another?) article with “5 reasons this tool ROCKS” or “Data Lake vs Data Warehouse vs Data Lakehouse: Ultimate showdown” and, instead, you’re left with more questions. But all of our experience at Baufest suggests that a hard question is worth more than an easy answer. Our approach here is to help you highlight the features that you value most, understand how close a couple of tools are, and help you make a data driven decision on these topics.
Much like tools for conveying messages, it’s really not about the tools themselves, rather than understanding that all of them serve useful and different purposes. So often, you see announcements that pen and paper have been rendered useless by text processors, but you know that’s far from the truth. At the end of the day, what matters most is the message and the fact that we all speak in a language that allows understanding.