Is pandas now faster than data.table?Open source Anomaly Detection in PythonHow is H2O faster than R or SAS?Merging large CSV files in PandasHow to shift rows values as columns in pandas?rows to columns in data.table R (or Python)Help me choose a Data Science book in PythonIs there a way in pandas to import NA fields as a string rather than NaN?Theoretical Question: Data.table vs Data.frame with Big DataIssues with pandas chunk mergeMean across every several rows in pandas

How do researchers send unsolicited emails asking for feedback on their works?

Does the Shadow Magic sorcerer's Eyes of the Dark feature work on all Darkness spells or just his/her own?

What is the tangent at a sharp point on a curve?

Print a physical multiplication table

What will the Frenchman say?

Why didn’t Eve recognize the little cockroach as a living organism?

pipe commands inside find -exec?

How do you justify more code being written by following clean code practices?

Are hand made posters acceptable in Academia?

Weird lines in Microsoft Word

Why is indicated airspeed rather than ground speed used during the takeoff roll?

Do I need an EFI partition for each 18.04 ubuntu I have on my HD?

Why is participating in the European Parliamentary elections used as a threat?

How can a new country break out from a developed country without war?

Isn't the word "experience" wrongly used in this context?

Would this string work as string?

Friend wants my recommendation but I don't want to

Would mining huge amounts of resources on the Moon change its orbit?

Is xar preinstalled on macOS?

Emojional cryptic crossword

Turning a hard to access nut?

Does fire aspect on a sword, destroy mob drops?

Do native speakers use "ultima" and "proxima" frequently in spoken English?

Print last inputted byte

Is pandas now faster than data.table?

Open source Anomaly Detection in PythonHow is H2O faster than R or SAS?Merging large CSV files in PandasHow to shift rows values as columns in pandas?rows to columns in data.table R (or Python)Help me choose a Data Science book in PythonIs there a way in pandas to import NA fields as a string rather than NaN?Theoretical Question: Data.table vs Data.frame with Big DataIssues with pandas chunk mergeMean across every several rows in pandas

https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping

The data.table benchmarks hasn't been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas can beat data.table?

edited Nov 1 '18 at 15:11

oW_

3,196730

asked Oct 25 '17 at 2:43

xiaodai

15316

5

$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47

1

$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31

1

$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52

1

$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46

$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04

|
show 2 more comments

https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping

edited Nov 1 '18 at 15:11

oW_

3,196730

asked Oct 25 '17 at 2:43

xiaodai

15316

5

$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47

1

$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31

1

$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52

1

$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46

$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04

|
show 2 more comments

https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping

edited Nov 1 '18 at 15:11

oW_

3,196730

asked Oct 25 '17 at 2:43

xiaodai

15316

https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping

python r pandas data data.table

edited Nov 1 '18 at 15:11

oW_

3,196730

asked Oct 25 '17 at 2:43

xiaodai

15316

edited Nov 1 '18 at 15:11

oW_

3,196730

asked Oct 25 '17 at 2:43

xiaodai

15316

edited Nov 1 '18 at 15:11

oW_

3,196730

edited Nov 1 '18 at 15:11

oW_

3,196730

edited Nov 1 '18 at 15:11

oW_

3,196730

asked Oct 25 '17 at 2:43

xiaodai

15316

asked Oct 25 '17 at 2:43

xiaodai

15316

asked Oct 25 '17 at 2:43

xiaodai

15316

5

$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47

1

$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31

1

$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52

1

$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46

$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04

|
show 2 more comments

5

$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47

1

$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31

1

$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52

1

$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46

$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04

That's a really bad reason to switch to python.

– Matthew Drury
Oct 25 '17 at 3:47

@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?

– xiaodai
Oct 25 '17 at 4:31

Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.

– Matthew Drury
Oct 25 '17 at 4:52

you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.

– xiaodai
Oct 25 '17 at 6:46

You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).

– slackline
Apr 25 '18 at 13:04

|
show 2 more comments

4 Answers
4

active

oldest

votes

A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).

We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.

EDIT:

If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:

Setup

We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.

Data retrieval with a select-like operation

Data filtering with a conditional select operation

Data sort operations

Data aggregation operations

The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table

Results in a nutshell

data.tableseems to be faster when selecting columns (pandason average takes 50% more time)

pandas is faster at filtering rows (roughly 50% on average)

data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

adding a new column appears faster with pandas

aggregating results are completely mixed

Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.

edited Apr 26 '18 at 7:45

answered Apr 25 '18 at 12:41

Tobias Krabel

19113

1

$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30

1

$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23

1

$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18

1

$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29

1

$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15

|
show 1 more comment

Has anyone done any benchmarks?

Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark

Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.

To not just link the content you are asking for I am pasting recent timings for those solutions.

| in_rows|question | data.table| pandas|
|-------:|:---------------------|----------:|------:|
| 1e+07|sum v1 by id1 | 0.140| 0.414|
| 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
| 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
| 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
| 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
| 1e+08|sum v1 by id1 | 1.551| 4.091|
| 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
| 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
| 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
| 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
| 1e+09|sum v1 by id1 | 15.063| NA|
| 1e+09|sum v1 by id1:id2 | 44.240| NA|
| 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
| 1e+09|mean v1:v3 by id4 | 26.855| NA|
| 1e+09|sum v1:v3 by id6 | 120.376| NA|

In 4 out of 5 questions data.table is faster, and we can see it scales better.

Just note this timings are as of now, where id1, id2 and id3 are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.

Other tasks are coming to this continuous benchmarking project so if you are interested in join, sort, read and others be sure to check it later.

And of course you are welcome to provide feedback in project repo!

edited Nov 1 '18 at 14:37

answered Oct 31 '18 at 21:53

jangorecki

15113

1

$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09

1

$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17

add a comment |

I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.

See feather's github page

answered Mar 6 at 1:39

DonQuixote

111

add a comment |

Nope, In fact if dataset size is sooooooo large that pandas crashes, you are basically stuck with dask, which sucks and you can't even do a simple groupby-sum. dplyr may not be fast, but it doesn't mess up.

I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())crashes the dask.

Didn't experience this error with dplyr.

So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.

And yes, you can convert dask back to pandas dataframe with a simple df.compute()
But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.

edited 2 hours ago

answered 2 hours ago

Chenying Gao

112

New contributor

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24052%2fis-pandas-now-faster-than-data-table%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

EDIT:

If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:

Setup

We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.

Data retrieval with a select-like operation

Data filtering with a conditional select operation

Data sort operations

Data aggregation operations

Results in a nutshell

data.tableseems to be faster when selecting columns (pandason average takes 50% more time)

pandas is faster at filtering rows (roughly 50% on average)

data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

adding a new column appears faster with pandas

aggregating results are completely mixed

edited Apr 26 '18 at 7:45

answered Apr 25 '18 at 12:41

Tobias Krabel

19113

1

$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30

1

$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23

1

$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18

1

$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29

1

$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15

|
show 1 more comment

EDIT:

If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:

Setup

We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.

Data retrieval with a select-like operation

Data filtering with a conditional select operation

Data sort operations

Data aggregation operations

Results in a nutshell

data.tableseems to be faster when selecting columns (pandason average takes 50% more time)

pandas is faster at filtering rows (roughly 50% on average)

data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

adding a new column appears faster with pandas

aggregating results are completely mixed

edited Apr 26 '18 at 7:45

answered Apr 25 '18 at 12:41

Tobias Krabel

19113

1

$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30

1

$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23

1

$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18

1

$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29

1

$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15

|
show 1 more comment

EDIT:

If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:

Setup

We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.

Data retrieval with a select-like operation

Data filtering with a conditional select operation

Data sort operations

Data aggregation operations

Results in a nutshell

data.tableseems to be faster when selecting columns (pandason average takes 50% more time)

pandas is faster at filtering rows (roughly 50% on average)

data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

adding a new column appears faster with pandas

aggregating results are completely mixed

edited Apr 26 '18 at 7:45

answered Apr 25 '18 at 12:41

Tobias Krabel

19113

EDIT:

If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:

Setup

We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.

Data retrieval with a select-like operation

Data filtering with a conditional select operation

Data sort operations

Data aggregation operations

Results in a nutshell

data.tableseems to be faster when selecting columns (pandason average takes 50% more time)

pandas is faster at filtering rows (roughly 50% on average)

data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

adding a new column appears faster with pandas

aggregating results are completely mixed

edited Apr 26 '18 at 7:45

answered Apr 25 '18 at 12:41

Tobias Krabel

19113

edited Apr 26 '18 at 7:45

answered Apr 25 '18 at 12:41

Tobias Krabel

19113

answered Apr 25 '18 at 12:41

Tobias Krabel

19113

answered Apr 25 '18 at 12:41

Tobias Krabel

19113

1

$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30

1

$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23

1

$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18

1

$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29

1

$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15

|
show 1 more comment

1

$begingroup$
A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
$endgroup$
– Stephen Rauch
Apr 25 '18 at 13:30

1

$begingroup$
As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
$endgroup$
– Tobias Krabel
Apr 25 '18 at 18:23

1

$begingroup$
"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
$endgroup$
– xiaodai
Apr 25 '18 at 22:18

1

$begingroup$
I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
$endgroup$
– Tobias Krabel
Apr 26 '18 at 7:29

1

$begingroup$
"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
$endgroup$
– smci
Aug 2 '18 at 18:15

A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.

– Stephen Rauch
Apr 25 '18 at 13:30

As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.

– Tobias Krabel
Apr 25 '18 at 18:23

"Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.

– xiaodai
Apr 25 '18 at 22:18

I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?

– Tobias Krabel
Apr 26 '18 at 7:29

"4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?

– smci
Aug 2 '18 at 18:15

|
show 1 more comment

Has anyone done any benchmarks?

To not just link the content you are asking for I am pasting recent timings for those solutions.

| in_rows|question | data.table| pandas|
|-------:|:---------------------|----------:|------:|
| 1e+07|sum v1 by id1 | 0.140| 0.414|
| 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
| 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
| 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
| 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
| 1e+08|sum v1 by id1 | 1.551| 4.091|
| 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
| 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
| 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
| 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
| 1e+09|sum v1 by id1 | 15.063| NA|
| 1e+09|sum v1 by id1:id2 | 44.240| NA|
| 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
| 1e+09|mean v1:v3 by id4 | 26.855| NA|
| 1e+09|sum v1:v3 by id6 | 120.376| NA|

edited Nov 1 '18 at 14:37

answered Oct 31 '18 at 21:53

jangorecki

15113

1

$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09

1

$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17

add a comment |

Has anyone done any benchmarks?

To not just link the content you are asking for I am pasting recent timings for those solutions.

| in_rows|question | data.table| pandas|
|-------:|:---------------------|----------:|------:|
| 1e+07|sum v1 by id1 | 0.140| 0.414|
| 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
| 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
| 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
| 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
| 1e+08|sum v1 by id1 | 1.551| 4.091|
| 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
| 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
| 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
| 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
| 1e+09|sum v1 by id1 | 15.063| NA|
| 1e+09|sum v1 by id1:id2 | 44.240| NA|
| 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
| 1e+09|mean v1:v3 by id4 | 26.855| NA|
| 1e+09|sum v1:v3 by id6 | 120.376| NA|

edited Nov 1 '18 at 14:37

answered Oct 31 '18 at 21:53

jangorecki

15113

1

$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09

1

$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17

add a comment |

Has anyone done any benchmarks?

To not just link the content you are asking for I am pasting recent timings for those solutions.

| in_rows|question | data.table| pandas|
|-------:|:---------------------|----------:|------:|
| 1e+07|sum v1 by id1 | 0.140| 0.414|
| 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
| 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
| 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
| 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
| 1e+08|sum v1 by id1 | 1.551| 4.091|
| 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
| 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
| 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
| 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
| 1e+09|sum v1 by id1 | 15.063| NA|
| 1e+09|sum v1 by id1:id2 | 44.240| NA|
| 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
| 1e+09|mean v1:v3 by id4 | 26.855| NA|
| 1e+09|sum v1:v3 by id6 | 120.376| NA|

edited Nov 1 '18 at 14:37

answered Oct 31 '18 at 21:53

jangorecki

15113

Has anyone done any benchmarks?

To not just link the content you are asking for I am pasting recent timings for those solutions.

| in_rows|question | data.table| pandas|
|-------:|:---------------------|----------:|------:|
| 1e+07|sum v1 by id1 | 0.140| 0.414|
| 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
| 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
| 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
| 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
| 1e+08|sum v1 by id1 | 1.551| 4.091|
| 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
| 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
| 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
| 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
| 1e+09|sum v1 by id1 | 15.063| NA|
| 1e+09|sum v1 by id1:id2 | 44.240| NA|
| 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
| 1e+09|mean v1:v3 by id4 | 26.855| NA|
| 1e+09|sum v1:v3 by id6 | 120.376| NA|

edited Nov 1 '18 at 14:37

answered Oct 31 '18 at 21:53

jangorecki

15113

edited Nov 1 '18 at 14:37

answered Oct 31 '18 at 21:53

jangorecki

15113

answered Oct 31 '18 at 21:53

jangorecki

15113

answered Oct 31 '18 at 21:53

jangorecki

15113

1

$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09

1

$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17

add a comment |

1

$begingroup$
What about JuliaDB?
$endgroup$
– skan
Dec 16 '18 at 0:09

1

$begingroup$
@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
$endgroup$
– jangorecki
Dec 17 '18 at 5:17

What about JuliaDB?

– skan
Dec 16 '18 at 0:09

@skan you can track status of that in github.com/h2oai/db-benchmark/issues/63

– jangorecki
Dec 17 '18 at 5:17

add a comment |

I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.

See feather's github page

answered Mar 6 at 1:39

DonQuixote

111

add a comment |

I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.

See feather's github page

answered Mar 6 at 1:39

DonQuixote

111

add a comment |

I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.

See feather's github page

answered Mar 6 at 1:39

DonQuixote

111

I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.

See feather's github page

answered Mar 6 at 1:39

DonQuixote

111

answered Mar 6 at 1:39

DonQuixote

111

answered Mar 6 at 1:39

DonQuixote

111

answered Mar 6 at 1:39

DonQuixote

111

add a comment |

I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())crashes the dask.

Didn't experience this error with dplyr.

So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.

And yes, you can convert dask back to pandas dataframe with a simple df.compute()
But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.

edited 2 hours ago

answered 2 hours ago

Chenying Gao

112

New contributor

add a comment |

I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())crashes the dask.

Didn't experience this error with dplyr.

So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.

And yes, you can convert dask back to pandas dataframe with a simple df.compute()
But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.

edited 2 hours ago

answered 2 hours ago

Chenying Gao

112

New contributor

add a comment |

I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())crashes the dask.

Didn't experience this error with dplyr.

So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.

And yes, you can convert dask back to pandas dataframe with a simple df.compute()
But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.

edited 2 hours ago

answered 2 hours ago

Chenying Gao

112

New contributor

I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())crashes the dask.

Didn't experience this error with dplyr.

So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.

And yes, you can convert dask back to pandas dataframe with a simple df.compute()
But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.

edited 2 hours ago

answered 2 hours ago

Chenying Gao

112

New contributor

edited 2 hours ago

answered 2 hours ago

Chenying Gao

112

New contributor

answered 2 hours ago

Chenying Gao

112

answered 2 hours ago

Chenying Gao

112

New contributor

Chenying Gao is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Hfrxdjt

4 Answers
4

Your Answer

Post as a guest

4 Answers
4

4 Answers
4

Post as a guest

Popular posts from this blog

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

4 Answers 4

4 Answers 4

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

4 Answers
4

4 Answers
4

4 Answers
4