Is pandas now faster than data.table?Open source Anomaly Detection in PythonHow is H2O faster than R or SAS?Merging large CSV files in PandasHow to shift rows values as columns in pandas?rows to columns in data.table R (or Python)Help me choose a Data Science book in PythonIs there a way in pandas to import NA fields as a string rather than NaN?Theoretical Question: Data.table vs Data.frame with Big DataIssues with pandas chunk mergeMean across every several rows in pandas
How do researchers send unsolicited emails asking for feedback on their works?
Does the Shadow Magic sorcerer's Eyes of the Dark feature work on all Darkness spells or just his/her own?
What is the tangent at a sharp point on a curve?
Print a physical multiplication table
What will the Frenchman say?
Why didn’t Eve recognize the little cockroach as a living organism?
pipe commands inside find -exec?
How do you justify more code being written by following clean code practices?
Are hand made posters acceptable in Academia?
Weird lines in Microsoft Word
Why is indicated airspeed rather than ground speed used during the takeoff roll?
Do I need an EFI partition for each 18.04 ubuntu I have on my HD?
Why is participating in the European Parliamentary elections used as a threat?
How can a new country break out from a developed country without war?
Isn't the word "experience" wrongly used in this context?
Would this string work as string?
Friend wants my recommendation but I don't want to
Would mining huge amounts of resources on the Moon change its orbit?
Is xar preinstalled on macOS?
Emojional cryptic crossword
Turning a hard to access nut?
Does fire aspect on a sword, destroy mob drops?
Do native speakers use "ultima" and "proxima" frequently in spoken English?
Print last inputted byte
Is pandas now faster than data.table?
Open source Anomaly Detection in PythonHow is H2O faster than R or SAS?Merging large CSV files in PandasHow to shift rows values as columns in pandas?rows to columns in data.table R (or Python)Help me choose a Data Science book in PythonIs there a way in pandas to import NA fields as a string rather than NaN?Theoretical Question: Data.table vs Data.frame with Big DataIssues with pandas chunk mergeMean across every several rows in pandas
$begingroup$
https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping
The data.table benchmarks hasn't been updated since 2014. I heard somewhere that Pandas
is now faster than data.table
. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas
can beat data.table
?
python r pandas data data.table
$endgroup$
|
show 2 more comments
$begingroup$
https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping
The data.table benchmarks hasn't been updated since 2014. I heard somewhere that Pandas
is now faster than data.table
. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas
can beat data.table
?
python r pandas data data.table
$endgroup$
5
$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47
1
$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31
1
$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52
1
$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46
$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04
|
show 2 more comments
$begingroup$
https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping
The data.table benchmarks hasn't been updated since 2014. I heard somewhere that Pandas
is now faster than data.table
. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas
can beat data.table
?
python r pandas data data.table
$endgroup$
https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping
The data.table benchmarks hasn't been updated since 2014. I heard somewhere that Pandas
is now faster than data.table
. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas
can beat data.table
?
python r pandas data data.table
python r pandas data data.table
edited Nov 1 '18 at 15:11
oW_
3,196730
3,196730
asked Oct 25 '17 at 2:43
xiaodaixiaodai
15316
15316
5
$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47
1
$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31
1
$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52
1
$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46
$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04
|
show 2 more comments
5
$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47
1
$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31
1
$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52
1
$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46
$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04
5
5
$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47
$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47
1
1
$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31
$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31
1
1
$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52
$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52
1
1
$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46
$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46
$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04
$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04
|
show 2 more comments
4 Answers
4
active
oldest
votes
$begingroup$
A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).
We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.
EDIT:
If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:
Setup
We compared pandas
and data.table
on 12 different simulated data sets on the following operations (so far), which we called scenarios.
- Data retrieval with a select-like operation
- Data filtering with a conditional select operation
- Data sort operations
- Data aggregation operations
The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table
Results in a nutshell
data.table
seems to be faster when selecting columns (pandas
on average takes 50% more time)pandas
is faster at filtering rows (roughly 50% on average)data.table
seems to be considerably faster at sorting (pandas
was sometimes 100 times slower)- adding a new column appears faster with
pandas
- aggregating results are completely mixed
Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.
$endgroup$
1
$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30
1
$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23
1
$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18
1
$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29
1
$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15
|
show 1 more comment
$begingroup$
Has anyone done any benchmarks?
Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark
Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby
on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.
To not just link the content you are asking for I am pasting recent timings for those solutions.
| in_rows|question | data.table| pandas|
|-------:|:---------------------|----------:|------:|
| 1e+07|sum v1 by id1 | 0.140| 0.414|
| 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
| 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
| 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
| 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
| 1e+08|sum v1 by id1 | 1.551| 4.091|
| 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
| 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
| 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
| 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
| 1e+09|sum v1 by id1 | 15.063| NA|
| 1e+09|sum v1 by id1:id2 | 44.240| NA|
| 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
| 1e+09|mean v1:v3 by id4 | 26.855| NA|
| 1e+09|sum v1:v3 by id6 | 120.376| NA|
In 4 out of 5 questions data.table is faster, and we can see it scales better.
Just note this timings are as of now, where id1
, id2
and id3
are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.
Other tasks are coming to this continuous benchmarking project so if you are interested in join
, sort
, read
and others be sure to check it later.
And of course you are welcome to provide feedback in project repo!
$endgroup$
1
$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09
1
$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17
add a comment |
$begingroup$
I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.
See feather's github page
$endgroup$
add a comment |
$begingroup$
Nope, In fact if dataset size is sooooooo large that pandas crashes, you are basically stuck with dask, which sucks and you can't even do a simple groupby-sum. dplyr may not be fast, but it doesn't mess up.
I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())
crashes the dask.
Didn't experience this error with dplyr.
So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.
And yes, you can convert dask back to pandas dataframe with a simple df.compute()
But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24052%2fis-pandas-now-faster-than-data-table%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).
We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.
EDIT:
If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:
Setup
We compared pandas
and data.table
on 12 different simulated data sets on the following operations (so far), which we called scenarios.
- Data retrieval with a select-like operation
- Data filtering with a conditional select operation
- Data sort operations
- Data aggregation operations
The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table
Results in a nutshell
data.table
seems to be faster when selecting columns (pandas
on average takes 50% more time)pandas
is faster at filtering rows (roughly 50% on average)data.table
seems to be considerably faster at sorting (pandas
was sometimes 100 times slower)- adding a new column appears faster with
pandas
- aggregating results are completely mixed
Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.
$endgroup$
1
$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30
1
$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23
1
$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18
1
$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29
1
$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15
|
show 1 more comment
$begingroup$
A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).
We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.
EDIT:
If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:
Setup
We compared pandas
and data.table
on 12 different simulated data sets on the following operations (so far), which we called scenarios.
- Data retrieval with a select-like operation
- Data filtering with a conditional select operation
- Data sort operations
- Data aggregation operations
The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table
Results in a nutshell
data.table
seems to be faster when selecting columns (pandas
on average takes 50% more time)pandas
is faster at filtering rows (roughly 50% on average)data.table
seems to be considerably faster at sorting (pandas
was sometimes 100 times slower)- adding a new column appears faster with
pandas
- aggregating results are completely mixed
Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.
$endgroup$
1
$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30
1
$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23
1
$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18
1
$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29
1
$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15
|
show 1 more comment
$begingroup$
A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).
We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.
EDIT:
If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:
Setup
We compared pandas
and data.table
on 12 different simulated data sets on the following operations (so far), which we called scenarios.
- Data retrieval with a select-like operation
- Data filtering with a conditional select operation
- Data sort operations
- Data aggregation operations
The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table
Results in a nutshell
data.table
seems to be faster when selecting columns (pandas
on average takes 50% more time)pandas
is faster at filtering rows (roughly 50% on average)data.table
seems to be considerably faster at sorting (pandas
was sometimes 100 times slower)- adding a new column appears faster with
pandas
- aggregating results are completely mixed
Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.
$endgroup$
A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).
We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.
EDIT:
If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:
Setup
We compared pandas
and data.table
on 12 different simulated data sets on the following operations (so far), which we called scenarios.
- Data retrieval with a select-like operation
- Data filtering with a conditional select operation
- Data sort operations
- Data aggregation operations
The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table
Results in a nutshell
data.table
seems to be faster when selecting columns (pandas
on average takes 50% more time)pandas
is faster at filtering rows (roughly 50% on average)data.table
seems to be considerably faster at sorting (pandas
was sometimes 100 times slower)- adding a new column appears faster with
pandas
- aggregating results are completely mixed
Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.
edited Apr 26 '18 at 7:45
answered Apr 25 '18 at 12:41
Tobias KrabelTobias Krabel
19113
19113
1
$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30
1
$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23
1
$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18
1
$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29
1
$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15
|
show 1 more comment
1
$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30
1
$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23
1
$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18
1
$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29
1
$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15
1
1
$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30
$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30
1
1
$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23
$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23
1
1
$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18
$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18
1
1
$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29
$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29
1
1
$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15
$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15
|
show 1 more comment
$begingroup$
Has anyone done any benchmarks?
Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark
Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby
on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.
To not just link the content you are asking for I am pasting recent timings for those solutions.
| in_rows|question | data.table| pandas|
|-------:|:---------------------|----------:|------:|
| 1e+07|sum v1 by id1 | 0.140| 0.414|
| 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
| 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
| 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
| 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
| 1e+08|sum v1 by id1 | 1.551| 4.091|
| 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
| 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
| 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
| 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
| 1e+09|sum v1 by id1 | 15.063| NA|
| 1e+09|sum v1 by id1:id2 | 44.240| NA|
| 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
| 1e+09|mean v1:v3 by id4 | 26.855| NA|
| 1e+09|sum v1:v3 by id6 | 120.376| NA|
In 4 out of 5 questions data.table is faster, and we can see it scales better.
Just note this timings are as of now, where id1
, id2
and id3
are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.
Other tasks are coming to this continuous benchmarking project so if you are interested in join
, sort
, read
and others be sure to check it later.
And of course you are welcome to provide feedback in project repo!
$endgroup$
1
$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09
1
$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17
add a comment |
$begingroup$
Has anyone done any benchmarks?
Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark
Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby
on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.
To not just link the content you are asking for I am pasting recent timings for those solutions.
| in_rows|question | data.table| pandas|
|-------:|:---------------------|----------:|------:|
| 1e+07|sum v1 by id1 | 0.140| 0.414|
| 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
| 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
| 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
| 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
| 1e+08|sum v1 by id1 | 1.551| 4.091|
| 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
| 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
| 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
| 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
| 1e+09|sum v1 by id1 | 15.063| NA|
| 1e+09|sum v1 by id1:id2 | 44.240| NA|
| 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
| 1e+09|mean v1:v3 by id4 | 26.855| NA|
| 1e+09|sum v1:v3 by id6 | 120.376| NA|
In 4 out of 5 questions data.table is faster, and we can see it scales better.
Just note this timings are as of now, where id1
, id2
and id3
are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.
Other tasks are coming to this continuous benchmarking project so if you are interested in join
, sort
, read
and others be sure to check it later.
And of course you are welcome to provide feedback in project repo!
$endgroup$
1
$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09
1
$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17
add a comment |
$begingroup$
Has anyone done any benchmarks?
Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark
Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby
on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.
To not just link the content you are asking for I am pasting recent timings for those solutions.
| in_rows|question | data.table| pandas|
|-------:|:---------------------|----------:|------:|
| 1e+07|sum v1 by id1 | 0.140| 0.414|
| 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
| 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
| 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
| 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
| 1e+08|sum v1 by id1 | 1.551| 4.091|
| 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
| 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
| 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
| 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
| 1e+09|sum v1 by id1 | 15.063| NA|
| 1e+09|sum v1 by id1:id2 | 44.240| NA|
| 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
| 1e+09|mean v1:v3 by id4 | 26.855| NA|
| 1e+09|sum v1:v3 by id6 | 120.376| NA|
In 4 out of 5 questions data.table is faster, and we can see it scales better.
Just note this timings are as of now, where id1
, id2
and id3
are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.
Other tasks are coming to this continuous benchmarking project so if you are interested in join
, sort
, read
and others be sure to check it later.
And of course you are welcome to provide feedback in project repo!
$endgroup$
Has anyone done any benchmarks?
Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark
Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby
on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.
To not just link the content you are asking for I am pasting recent timings for those solutions.
| in_rows|question | data.table| pandas|
|-------:|:---------------------|----------:|------:|
| 1e+07|sum v1 by id1 | 0.140| 0.414|
| 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
| 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
| 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
| 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
| 1e+08|sum v1 by id1 | 1.551| 4.091|
| 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
| 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
| 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
| 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
| 1e+09|sum v1 by id1 | 15.063| NA|
| 1e+09|sum v1 by id1:id2 | 44.240| NA|
| 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
| 1e+09|mean v1:v3 by id4 | 26.855| NA|
| 1e+09|sum v1:v3 by id6 | 120.376| NA|
In 4 out of 5 questions data.table is faster, and we can see it scales better.
Just note this timings are as of now, where id1
, id2
and id3
are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.
Other tasks are coming to this continuous benchmarking project so if you are interested in join
, sort
, read
and others be sure to check it later.
And of course you are welcome to provide feedback in project repo!
edited Nov 1 '18 at 14:37
answered Oct 31 '18 at 21:53
jangoreckijangorecki
15113
15113
1
$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09
1
$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17
add a comment |
1
$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09
1
$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17
1
1
$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09
$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09
1
1
$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17
$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17
add a comment |
$begingroup$
I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.
See feather's github page
$endgroup$
add a comment |
$begingroup$
I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.
See feather's github page
$endgroup$
add a comment |
$begingroup$
I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.
See feather's github page
$endgroup$
I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.
See feather's github page
answered Mar 6 at 1:39
DonQuixoteDonQuixote
111
111
add a comment |
add a comment |
$begingroup$
Nope, In fact if dataset size is sooooooo large that pandas crashes, you are basically stuck with dask, which sucks and you can't even do a simple groupby-sum. dplyr may not be fast, but it doesn't mess up.
I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())
crashes the dask.
Didn't experience this error with dplyr.
So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.
And yes, you can convert dask back to pandas dataframe with a simple df.compute()
But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.
New contributor
$endgroup$
add a comment |
$begingroup$
Nope, In fact if dataset size is sooooooo large that pandas crashes, you are basically stuck with dask, which sucks and you can't even do a simple groupby-sum. dplyr may not be fast, but it doesn't mess up.
I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())
crashes the dask.
Didn't experience this error with dplyr.
So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.
And yes, you can convert dask back to pandas dataframe with a simple df.compute()
But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.
New contributor
$endgroup$
add a comment |
$begingroup$
Nope, In fact if dataset size is sooooooo large that pandas crashes, you are basically stuck with dask, which sucks and you can't even do a simple groupby-sum. dplyr may not be fast, but it doesn't mess up.
I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())
crashes the dask.
Didn't experience this error with dplyr.
So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.
And yes, you can convert dask back to pandas dataframe with a simple df.compute()
But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.
New contributor
$endgroup$
Nope, In fact if dataset size is sooooooo large that pandas crashes, you are basically stuck with dask, which sucks and you can't even do a simple groupby-sum. dplyr may not be fast, but it doesn't mess up.
I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())
crashes the dask.
Didn't experience this error with dplyr.
So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.
And yes, you can convert dask back to pandas dataframe with a simple df.compute()
But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.
New contributor
edited 2 hours ago
New contributor
answered 2 hours ago
Chenying GaoChenying Gao
112
112
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24052%2fis-pandas-now-faster-than-data-table%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
5
$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47
1
$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31
1
$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52
1
$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46
$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04