top of page

High quality data will be the fuel of your AI first enterprise

Learn which data are necessary and how to get the right data.

pexels-scott-webb-1029622.jpg

Getting
the right data.

All companies function to generate a return on assets and AI-First companies are no different: they generate income from data. Getting data is the first step towards building a DLE (Data learning effects)

 

AI First companies put capital, time, and effort behind getting data, ahead of other strategic priorities. But how to procure unique data? Where does a data strategy begin? 

In this lesson you will learn something about methods to value data in order to smartly scope data acquisition projects, where to look for data, and how to create It from scratch.

"The data quality will separate the wheat from the chaff in the future."

Evaluate
the data

 

Before investing in data acquisition, establish a framework for valuing data

DISCRIMINATION - Was it hard to get?

Accessibility

Availability

Cost

Time

Fungibility

DETERMINATION - Is it
useful?

Perishability

Veracity

Dimensionality

Breadth

Self-reinforcement

pexels-mati-mango-5952651.jpg

Discrimination

There are five ways to value data based on how difficult it is for others to get it (as opposed to its ultimate utility to the acquirer of that data)

  •  Accessibility

  •  Availability

  •  Cost

  •  Time

  •  Fungibility

Accessibility

Data that is hard to obtain might be hard for others to get, too. To give you an example, it might require traveling to a special site, such as a local council office, manually collecting paper files, photocopying them, and then running them through optical character recognition software that turns the image from the photocopier into text that a computer can read. It´s important to assess whether data may be hard to obtain in the future, based on contracts or policies related to a dataset. Typically, this might involve the data owner´s restricting access to it. For example, both government agencies and private vendors often make data publicly available at no cost for a set period of time before beginning to charge access or deleting it from circulation.

Availability

The time it takes to pull data from a given system presents a barrier to others. Some systems only allow harvesting data at a very slow rate because it is costly to run the system from which to pull data or to stratify data products based on different rate limits. One example is financial market data, where market data providers allow access to stock price quotations only during certain, relatively long intervals (usually microseconds) unless the customer pays more to pull it in shorter intervals. Paying more for the frequently available data can give that customer a competitive advantage if they use that data to make important decisions, for example that difference of a few microseconds may allow an AI-First trading system to make a profitable trade over its competitors. 

Cost

The price charged by data vendors is an obvious barrier. That price is sometimes clearly stated by the vendor in dollars terms but sometimes less clearly stated, in the form of revenue sharing arrangements with the data vendor or by requiring the purchase of expensive software to access the data. For example, a terminal provided by Bloomberg, the New York based financial data and media company, costs thousands of dollars per month to lease, when it is just a relatively basic computer made of commodity components. However, it provides access to high-quality, real-time data on financial markets. The price of data can sometimes be paid in nonmonetary forms, such as submitting some internal data before getting access is of equal value to everyone, it can be hard to access the exact cost of the contributed data. However, the assessment of whether that data contribution requirement presents a barrier to competitors can be based on the minimum amount of data required to contribute, its format, the completeness of each row, etc. The more strict the requirements, the more difficult it is for competitors to contribute and thus access data from the vendor. 

Time

The time it takes to collect data can give a head start over competitors that may want to collect that same data. Some data can be amassed at a certain rate: for instance, weather data dictated by the revolutions of the sun, and employment data dictated by the rate at which the relevant government bureaus assemble it and make it available. Therefore, anyone wanting to obtain a critical mass of data released in such a way simply needs to collect it for a long time. 

Fungibility

The fungibility, or interchangeability of data affects whether data presents a barrier to others achieving the same goal with that data. Data that is fungible can be swapped out for different data (that may be cheaper) without negatively affecting the quality of the decision made based on that data. For example, you could use text from any news website to train a model to understand the general meaning of a sentence model. 

Determination

There are also five ways to value data based in its ultimate utility (as opposed to how hard it is for others to acquire): perishability, veracity, dimensionality, breadth, and self-reinforcement. This is less straightforward because it depends on the intended use of the data. 

  •  Perishability

  •  Veracity

  •  Dimensionality

  •  Breadth

  •  Self-reinforcement

Perishability

The rate at which data perishes determines relevance. Old data may no longer represent reality and thus can cause models to generate an invalid prediction. For example, the price of stocks changes almost instantaneously, so the price from even just a second ago isn´t as relevant as the price from a microsecond ago. Other data types have long shelf lives: mapping data that includes the height of Mount Everest will not change (much) and thus will remain relevant for many years. Then there are cases in between, such as consumer preference data taken from surveys. Sometimes preferences are long lasting, such as clothing sizes, and sometimes they are not like styles that may be in fashion for just a season. 

Perishability can be mitigated by updating. The rate at which the vendor or source updates the data affects its perishability. Sometimes data vendors will stratify their pricing on the updating rate, charging more for more recent data. Perishable data is generally less valuable because it needs to be updated constantly, and this either costs money for computing the operations on that data or for fetching more data. 

 Veracity

The veracity of data determines reliability in the context of making a decision. Often, determining veracity requires manually validating data points: for example, getting a sample of product specifications data such as the voltage of a power supply - just a few rows - and checking it against data manually collected from the manufacturer. Sometimes data vendors help to determine the veracity of data by adding a guarantee that it´s accurate. Finally, there are third parties that verify data accuracy across a suite of vendors by industry or on a bespoke basis.

Dimensionality

The number of dimensions in the data determines whether it´s relevant to making a decision. Dimensions are attributes of a given entity. Typically, this manifests in the number of columns in a table of data. For example, demographic data can include age, gender, income, and more.

Dimensionally it is a particularly powerful determinant of value; the intended use is training an ML model because each dimension informs the model when it´s trying to learn shapes and pattern in the data.

Breadth

The breadth of data determines how closely it represents reality. Breadth is the number of entities or points in a distribution. this is typically manifested in the number of rows in a table of data. Breadth means more examples of the same type, and more variations in the attributes of entities and edge cases. Sometimes more breadth come through joining datasets from different sources or vendors, but this requires them having the same attributes. Often, combining different datasets by lining up the attributes and filling in any missing data gets more breadth.

Self-reinforcement

Self-reinforcing data becomes more valuable over time. Self-reinforcement manifests in attributes of the same entity that change over time but are measured in the same way. For example, performance feedback on an employee is represented as one value at one point in time and another at a later date, but the later value reinforces the worth of the first value. This is because if the two data points are the same or trending the same way, they reinforce the value of each other. 

Finally 
some hints

Finally, here are some tips on how to handle data (what better not to do), especially if you are still in the experimentation phase. 

The experimental phase is not the time to build a “fat” data pipeline but rather the time to stay lean, so it´s worth making a quick diversion here to talk about getting “just enough” data to build an AI product. 

Hopefully, this means just one set of data that´s located in one database and can be retrieved with a single query. The customer´s first guess at which data might be predictive is probably the best starting point because the customer knows their domain better than anyone any may have tried to solve the problem at hand in other ways. 

Data preparation comes next and, with any luck, it will be minimal at this stage, taken from one data source. The next step is formatting, with the same units of measurement and file types. Cleaning the data to fill in missing values, delete duplicates, and remove errant values is generally necessary but, again, minimized by obtaining data from a single source. The final step is making sure the data is efficiently computable by the models. Most of the time, this isn´t a major consideration with small-scale experiments.

Here is what better not to do, at this stage:

  •  Do not label extensively 

Determining at the outset what data customers have that might be predictive can spare the time-consuming and costly task of data labeling

  •  Do not harvest data from multiple sources 

Doing so requires obtaining extra permissions, building more integrations, and more formatting. Instead, pick one dataset in one data store, run an experiment, then get another only if the dataset doesn´t have any predictive power.

  •  Do not work with sensitive data

Anonymizing data is costly and may obfuscate results. However, it may be necessary to avoid being held responsible for a data breach.

  •  Do not build a separate data store 

Instead, just download it somewhere secure with low latency, such as a local machine. 

  •  Do not build a data platform 

Decide on all the tools that the entire team will use to explore manage data (mostly you can store it at zerocode.ai. But needs are very likely to chance, so consider delaying this choice beyond the initial phase of a project.

Take a look at how AI First can be implemented using a financial company as an example.

Bank
bottom of page