Setting up Data Collection (ML4Devs, Issue 8)

Machine Learning starts with collecting data.

Cliché: Without data, there can be no data science.

But it is true.

While learning data science, we mostly use public data sets, or by scraping data off the web. But in ML-assisted products, most of the data is generated by and collected through business applications.

The first step in any data pipeline is instrumenting your application to:

  • Capture needed data when an interesting event happens in the application

  • Ingest the captured data into your data storage (typically an event queue like Kafka)

This sequence of data is commonly known as event-stream or click-stream. The data quality depends on the accuracy and completeness of the data you capture and ingest.

There are alternatives for capturing and ingesting a click-stream.

Do It Yourself (DIY)

Write a small library in the language of your application that captures the event and sends it to a microservice endpoint or a cloud function for further processing and storage (such as AWS, Google Cloud, Azure, Snowflake, Databricks).

This is the most flexible alternative that you can optimize to the need of your application and data requirement.

It also takes the most development effort. You need to write code to process the data and store it in a data lake or data warehouse. If you use any 3rd-party analytics/ML application, most of these can consume data from a popular lake or warehouse.

Fully Outsource It

If you are doing analytics or business intellegence, you may use a tool like Google Analytics, MixPanel, Amplitude, or Heap.

This is the quickest and easiest approach to get started. These tools provide an SDK with simple APIs to dump the data, and can compute and show the common analytics charts.

This approach is also the least flexible. I recommend it for analytics, but not for collecting data for data science or machine learning.

Also, carefully examine the cost structure to decide if it is a good approach for your data load.

The Middle Path

There are a number of tools that provide a library to send an event/data, and also a rich list of connectors to filter, lightly process, and route that data to multiple destinations (e.g. data lakes, warehouses, and popular 3rd part tools).

What is the best solution for you?

The convenience and rich connectors offered by tools like Fivetran, RudderStack etc. is valuable. But it depends on how diverse your needs are and how deep your pockets are.

I recommend Do It Yourself if:

  • you have a high volume of events/data (convenience will most likely be expensive), or

  • your data processing is limited to a single cloud provider.

Only if you are collecting a moderate amount of data with typical schema, and mostly doing analytics, you may consider Fully Outsource It.

For the rest of the use cases, tradeoffs will depend upon in-house data engineering expertise and the diversity of data sources and processors. I suggest checking out Snowplow and RudderStack GitHub repos.