Franck Dernoncourt's answer to How are most data sets for large-scale machine learning stored?

How are most data sets for large-scale machine learning stored?

Researcher in Machine Learning & Natural Language Processing · Upvoted by

, Sr. Data Scientist at Episource · Author has 1.3K answers and 7.9M answer views · 11y ·

Originally Answered: How are most datasets for large-scale machine learning stored? ·

I would say it mostly depends on the nature of the data as well as the kind of queries you intend to run most often. Ultimately as Sean said you'll end up having to read the data from the disk anyway, unless you are rich enough to live in memory. And before I start answering from what I have seen around me flat files are the most common solution by far.

Let me give you two concrete examples I faced this summer, for which I ended up with two different solutions:

RDBMS for MOOC data like edX or Coursera
flat files for storing waveforms for blood pressure of patients in ICU

Example 1: dealing with MOOC data.

I received the data of a bunch of courses from edX, basically everything that were recorded on the students' activities: which pages they visited, what they submitted, the forum, and so forth. Overall the data size wasn't that huge: around 50 GB per course in the raw format I received.

In addition to the multifariousness‎ of the data, we intended to run at large amounts of different queries on it to investigate many aspects of the course from miscellaneous descriptive statistics (e.g. Massive Open Online Courses (MOOC): Where can I find data and statistics on MOOCs? ) to machine learning and educational research like predicting dropouts and which resources are useful to answer a given problem.

As a result we thought that using a DBMS would be handy to achieve those tasks efficiently. We were further motivated as we were watching some other labs having a hard time handling CSVs all over the place. Also, the largest table fit in RAM so we didn't have any IO trouble.

Here is one example of a query we needed to answer:

As you can see, answering a simple question was a pain using flat files (the blue-grey boxes in this slide), but could be taken care of using a single SQL query once we switched to DBMS (vicelet MySQL)

(more details on this work here if interested: http://francky.me/publications.php#MOOCdb-edX-TechReport-2013)

Example 2: dealing with ICU data.

In this project, the data was much simpler as we focused on a waveform dataset of patient in ICU. They were only the 22 different waveforms.

Here was the motivation (I copy it from a paper my lab and I have just submitted)

For typical physiological waveform studies, researchers deﬁne a study group within which they designate case and controls. They extract the group’s waveforms, ﬁlter the signals, pre process them and extract features before iteratively executing, evaluating and interpreting a pre-selected machine
learning algorithm with metrics such as area under the curve and analyses such as variable sensitivity.

Recognizing that a typical study, even with modest quantities of patients, can take 6 to 12 months, we have asked how this duration can be shrunk and multiple studies can share development effort. In response we are designing a large scale machine learning and analytics framework, named beatDB, for mining knowledge from high resolution physiological waveforms. beatDB is our ﬁrst cut at creating very large, open access, repositories of feature-level data derived from continuous periodic physiological signal waveforms such as electro-cardiograms (ECG) or arterial blood pressure. We have presently processed close to a billion arterial blood pressure beats from MIMIC 2 version 3 waveform database and developed a strategy for feature extraction and discovery which supports efﬁcient studies.

Thus, beatDB radically shrinks the time of large scale investigations by judiciously pre-computing beat features which are likely to be frequently used. It supports agile investigation by offering parameterizations that allow task speciﬁc compute and storage tradeoff decisions to be made when computing additional features and preparing data for machine learning or visualization.

To summarize the data structure is straightforward: we have some waveforms and features we computed on them in order to make machine learning easier.

In addition to the simplicity of the data, the queries we intend to run are always the same: take these features or these waveform for these patients. That's it.

Importing the data into some DBMS wouldn't be that useful. In fact, it would even be a burden to retrieve the data as the size was pretty big (over 1 TB) and potentially growing as we add more features.

So in this second case we decided to use flat files, with some folders to group by patients/features.

But we are now thinking of incorporating more intricate patient data, like demographics or what happens to them during the ICU stay, and we would like to see how their waveforms behave depending on the conditions. As the data we have become more complex we are considering some more structured format.

17.1K views ·

View upvotes

View 4 shares

· Answer requested by

Máté Kovács

1 of 6 answers

Something went wrong. Wait a moment and try again.

View 5 other answers to this question

About · Careers · Privacy · Terms · Contact · Languages · Your Ad Choices · Press ·