Is pandas now faster than data.table?Open source Anomaly Detection in PythonHow is H2O faster than R or SAS?Merging large CSV files in PandasHow to shift rows values as columns in pandas?rows to columns in data.table R (or Python)Help me choose a Data Science book in PythonIs there a way in pandas to import NA fields as a string rather than NaN?Theoretical Question: Data.table vs Data.frame with Big DataIssues with pandas chunk mergeMean across every several rows in pandas

How do researchers send unsolicited emails asking for feedback on their works?

Does the Shadow Magic sorcerer's Eyes of the Dark feature work on all Darkness spells or just his/her own?

What is the tangent at a sharp point on a curve?

Print a physical multiplication table

What will the Frenchman say?

Why didn’t Eve recognize the little cockroach as a living organism?

pipe commands inside find -exec?

How do you justify more code being written by following clean code practices?

Are hand made posters acceptable in Academia?

Weird lines in Microsoft Word

Why is indicated airspeed rather than ground speed used during the takeoff roll?

Do I need an EFI partition for each 18.04 ubuntu I have on my HD?

Why is participating in the European Parliamentary elections used as a threat?

How can a new country break out from a developed country without war?

Isn't the word "experience" wrongly used in this context?

Would this string work as string?

Friend wants my recommendation but I don't want to

Would mining huge amounts of resources on the Moon change its orbit?

Is xar preinstalled on macOS?

Emojional cryptic crossword

Turning a hard to access nut?

Does fire aspect on a sword, destroy mob drops?

Do native speakers use "ultima" and "proxima" frequently in spoken English?

Print last inputted byte



Is pandas now faster than data.table?


Open source Anomaly Detection in PythonHow is H2O faster than R or SAS?Merging large CSV files in PandasHow to shift rows values as columns in pandas?rows to columns in data.table R (or Python)Help me choose a Data Science book in PythonIs there a way in pandas to import NA fields as a string rather than NaN?Theoretical Question: Data.table vs Data.frame with Big DataIssues with pandas chunk mergeMean across every several rows in pandas













6












$begingroup$


https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping



The data.table benchmarks hasn't been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas can beat data.table?










share|improve this question











$endgroup$







  • 5




    $begingroup$
    That's a really bad reason to switch to python.
    $endgroup$
    – Matthew Drury
    Oct 25 '17 at 3:47






  • 1




    $begingroup$
    @MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
    $endgroup$
    – xiaodai
    Oct 25 '17 at 4:31






  • 1




    $begingroup$
    Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
    $endgroup$
    – Matthew Drury
    Oct 25 '17 at 4:52







  • 1




    $begingroup$
    you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
    $endgroup$
    – xiaodai
    Oct 25 '17 at 6:46










  • $begingroup$
    You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
    $endgroup$
    – slackline
    Apr 25 '18 at 13:04















6












$begingroup$


https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping



The data.table benchmarks hasn't been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas can beat data.table?










share|improve this question











$endgroup$







  • 5




    $begingroup$
    That's a really bad reason to switch to python.
    $endgroup$
    – Matthew Drury
    Oct 25 '17 at 3:47






  • 1




    $begingroup$
    @MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
    $endgroup$
    – xiaodai
    Oct 25 '17 at 4:31






  • 1




    $begingroup$
    Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
    $endgroup$
    – Matthew Drury
    Oct 25 '17 at 4:52







  • 1




    $begingroup$
    you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
    $endgroup$
    – xiaodai
    Oct 25 '17 at 6:46










  • $begingroup$
    You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
    $endgroup$
    – slackline
    Apr 25 '18 at 13:04













6












6








6





$begingroup$


https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping



The data.table benchmarks hasn't been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas can beat data.table?










share|improve this question











$endgroup$




https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping



The data.table benchmarks hasn't been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas can beat data.table?







python r pandas data data.table






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 1 '18 at 15:11









oW_

3,196730




3,196730










asked Oct 25 '17 at 2:43









xiaodaixiaodai

15316




15316







  • 5




    $begingroup$
    That's a really bad reason to switch to python.
    $endgroup$
    – Matthew Drury
    Oct 25 '17 at 3:47






  • 1




    $begingroup$
    @MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
    $endgroup$
    – xiaodai
    Oct 25 '17 at 4:31






  • 1




    $begingroup$
    Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
    $endgroup$
    – Matthew Drury
    Oct 25 '17 at 4:52







  • 1




    $begingroup$
    you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
    $endgroup$
    – xiaodai
    Oct 25 '17 at 6:46










  • $begingroup$
    You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
    $endgroup$
    – slackline
    Apr 25 '18 at 13:04












  • 5




    $begingroup$
    That's a really bad reason to switch to python.
    $endgroup$
    – Matthew Drury
    Oct 25 '17 at 3:47






  • 1




    $begingroup$
    @MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
    $endgroup$
    – xiaodai
    Oct 25 '17 at 4:31






  • 1




    $begingroup$
    Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
    $endgroup$
    – Matthew Drury
    Oct 25 '17 at 4:52







  • 1




    $begingroup$
    you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
    $endgroup$
    – xiaodai
    Oct 25 '17 at 6:46










  • $begingroup$
    You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
    $endgroup$
    – slackline
    Apr 25 '18 at 13:04







5




5




$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47




$begingroup$
That's a really bad reason to switch to python.
$endgroup$
– Matthew Drury
Oct 25 '17 at 3:47




1




1




$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31




$begingroup$
@MatthewDrury how so? Data and the manipulation of it is 80% of my job. Only 20% is to fitting models and presentation. Why shouldn't I choose the one that gives me the results the quickest?
$endgroup$
– xiaodai
Oct 25 '17 at 4:31




1




1




$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52





$begingroup$
Both python and R are established languages with huge ecosystems and communities. To reduce the choice to a single library is worshiping a single tree in a vast forest. Even so, efficiency is just a single concern among many even for a single library (how expressive is the interface, how does it connect to other library, how extensible is the codebase, how open are its developers). I would argue that the choice itself is a false dichotomy; both communities have a different focus, which lends the languages different strengths.
$endgroup$
– Matthew Drury
Oct 25 '17 at 4:52





1




1




$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46




$begingroup$
you have a huge forest that is good for 20% of the work? so don't make a choice thst affecta 80% of your work? nothing stopping me from using panda to do data prep and then model in R python or Julia. i think my thinking is sound. if panda is faster than i should choose it as my main tool.
$endgroup$
– xiaodai
Oct 25 '17 at 6:46












$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04




$begingroup$
You might find the reticulate package in R of interest/use. Also, increasingly a lot of effort has been put into getting R to work/play with databases (see efforts such as dbplyr).
$endgroup$
– slackline
Apr 25 '18 at 13:04










4 Answers
4






active

oldest

votes


















9












$begingroup$

A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).



We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.



EDIT:

If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:



Setup



We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.



  • Data retrieval with a select-like operation

  • Data filtering with a conditional select operation

  • Data sort operations

  • Data aggregation operations

The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table



Results in a nutshell




  • data.tableseems to be faster when selecting columns (pandason average takes 50% more time)


  • pandas is faster at filtering rows (roughly 50% on average)


  • data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

  • adding a new column appears faster with pandas

  • aggregating results are completely mixed

Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.






share|improve this answer











$endgroup$








  • 1




    $begingroup$
    A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
    $endgroup$
    – Stephen Rauch
    Apr 25 '18 at 13:30






  • 1




    $begingroup$
    As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
    $endgroup$
    – Tobias Krabel
    Apr 25 '18 at 18:23






  • 1




    $begingroup$
    "Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
    $endgroup$
    – xiaodai
    Apr 25 '18 at 22:18






  • 1




    $begingroup$
    I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
    $endgroup$
    – Tobias Krabel
    Apr 26 '18 at 7:29






  • 1




    $begingroup$
    "4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
    $endgroup$
    – smci
    Aug 2 '18 at 18:15



















5












$begingroup$


Has anyone done any benchmarks?




Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark

Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.



To not just link the content you are asking for I am pasting recent timings for those solutions.



| in_rows|question | data.table| pandas|
|-------:|:---------------------|----------:|------:|
| 1e+07|sum v1 by id1 | 0.140| 0.414|
| 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
| 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
| 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
| 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
| 1e+08|sum v1 by id1 | 1.551| 4.091|
| 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
| 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
| 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
| 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
| 1e+09|sum v1 by id1 | 15.063| NA|
| 1e+09|sum v1 by id1:id2 | 44.240| NA|
| 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
| 1e+09|mean v1:v3 by id4 | 26.855| NA|
| 1e+09|sum v1:v3 by id6 | 120.376| NA|


In 4 out of 5 questions data.table is faster, and we can see it scales better.

Just note this timings are as of now, where id1, id2 and id3 are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.



Other tasks are coming to this continuous benchmarking project so if you are interested in join, sort, read and others be sure to check it later.

And of course you are welcome to provide feedback in project repo!






share|improve this answer











$endgroup$








  • 1




    $begingroup$
    What about JuliaDB?
    $endgroup$
    – skan
    Dec 16 '18 at 0:09






  • 1




    $begingroup$
    @skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
    $endgroup$
    – jangorecki
    Dec 17 '18 at 5:17


















1












$begingroup$

I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.



See feather's github page






share|improve this answer









$endgroup$




















    1












    $begingroup$

    Nope, In fact if dataset size is sooooooo large that pandas crashes, you are basically stuck with dask, which sucks and you can't even do a simple groupby-sum. dplyr may not be fast, but it doesn't mess up.



    I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())crashes the dask.



    Didn't experience this error with dplyr.



    So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.



    And yes, you can convert dask back to pandas dataframe with a simple df.compute()
    But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.






    share|improve this answer










    New contributor




    Chenying Gao is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$












      Your Answer





      StackExchange.ifUsing("editor", function ()
      return StackExchange.using("mathjaxEditing", function ()
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      );
      );
      , "mathjax-editing");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "557"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24052%2fis-pandas-now-faster-than-data-table%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      4 Answers
      4






      active

      oldest

      votes








      4 Answers
      4






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      9












      $begingroup$

      A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).



      We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.



      EDIT:

      If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:



      Setup



      We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.



      • Data retrieval with a select-like operation

      • Data filtering with a conditional select operation

      • Data sort operations

      • Data aggregation operations

      The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table



      Results in a nutshell




      • data.tableseems to be faster when selecting columns (pandason average takes 50% more time)


      • pandas is faster at filtering rows (roughly 50% on average)


      • data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

      • adding a new column appears faster with pandas

      • aggregating results are completely mixed

      Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.






      share|improve this answer











      $endgroup$








      • 1




        $begingroup$
        A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
        $endgroup$
        – Stephen Rauch
        Apr 25 '18 at 13:30






      • 1




        $begingroup$
        As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
        $endgroup$
        – Tobias Krabel
        Apr 25 '18 at 18:23






      • 1




        $begingroup$
        "Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
        $endgroup$
        – xiaodai
        Apr 25 '18 at 22:18






      • 1




        $begingroup$
        I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
        $endgroup$
        – Tobias Krabel
        Apr 26 '18 at 7:29






      • 1




        $begingroup$
        "4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
        $endgroup$
        – smci
        Aug 2 '18 at 18:15
















      9












      $begingroup$

      A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).



      We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.



      EDIT:

      If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:



      Setup



      We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.



      • Data retrieval with a select-like operation

      • Data filtering with a conditional select operation

      • Data sort operations

      • Data aggregation operations

      The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table



      Results in a nutshell




      • data.tableseems to be faster when selecting columns (pandason average takes 50% more time)


      • pandas is faster at filtering rows (roughly 50% on average)


      • data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

      • adding a new column appears faster with pandas

      • aggregating results are completely mixed

      Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.






      share|improve this answer











      $endgroup$








      • 1




        $begingroup$
        A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
        $endgroup$
        – Stephen Rauch
        Apr 25 '18 at 13:30






      • 1




        $begingroup$
        As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
        $endgroup$
        – Tobias Krabel
        Apr 25 '18 at 18:23






      • 1




        $begingroup$
        "Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
        $endgroup$
        – xiaodai
        Apr 25 '18 at 22:18






      • 1




        $begingroup$
        I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
        $endgroup$
        – Tobias Krabel
        Apr 26 '18 at 7:29






      • 1




        $begingroup$
        "4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
        $endgroup$
        – smci
        Aug 2 '18 at 18:15














      9












      9








      9





      $begingroup$

      A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).



      We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.



      EDIT:

      If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:



      Setup



      We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.



      • Data retrieval with a select-like operation

      • Data filtering with a conditional select operation

      • Data sort operations

      • Data aggregation operations

      The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table



      Results in a nutshell




      • data.tableseems to be faster when selecting columns (pandason average takes 50% more time)


      • pandas is faster at filtering rows (roughly 50% on average)


      • data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

      • adding a new column appears faster with pandas

      • aggregating results are completely mixed

      Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.






      share|improve this answer











      $endgroup$



      A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).



      We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.



      EDIT:

      If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:



      Setup



      We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.



      • Data retrieval with a select-like operation

      • Data filtering with a conditional select operation

      • Data sort operations

      • Data aggregation operations

      The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table



      Results in a nutshell




      • data.tableseems to be faster when selecting columns (pandason average takes 50% more time)


      • pandas is faster at filtering rows (roughly 50% on average)


      • data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)

      • adding a new column appears faster with pandas

      • aggregating results are completely mixed

      Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Apr 26 '18 at 7:45

























      answered Apr 25 '18 at 12:41









      Tobias KrabelTobias Krabel

      19113




      19113







      • 1




        $begingroup$
        A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
        $endgroup$
        – Stephen Rauch
        Apr 25 '18 at 13:30






      • 1




        $begingroup$
        As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
        $endgroup$
        – Tobias Krabel
        Apr 25 '18 at 18:23






      • 1




        $begingroup$
        "Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
        $endgroup$
        – xiaodai
        Apr 25 '18 at 22:18






      • 1




        $begingroup$
        I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
        $endgroup$
        – Tobias Krabel
        Apr 26 '18 at 7:29






      • 1




        $begingroup$
        "4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
        $endgroup$
        – smci
        Aug 2 '18 at 18:15













      • 1




        $begingroup$
        A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
        $endgroup$
        – Stephen Rauch
        Apr 25 '18 at 13:30






      • 1




        $begingroup$
        As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
        $endgroup$
        – Tobias Krabel
        Apr 25 '18 at 18:23






      • 1




        $begingroup$
        "Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
        $endgroup$
        – xiaodai
        Apr 25 '18 at 22:18






      • 1




        $begingroup$
        I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
        $endgroup$
        – Tobias Krabel
        Apr 26 '18 at 7:29






      • 1




        $begingroup$
        "4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
        $endgroup$
        – smci
        Aug 2 '18 at 18:15








      1




      1




      $begingroup$
      A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
      $endgroup$
      – Stephen Rauch
      Apr 25 '18 at 13:30




      $begingroup$
      A link to an external Blog is considered to be not an answer and is subject to deletion. Please consider summarizing the main points from the external link here to make this a viable answer.
      $endgroup$
      – Stephen Rauch
      Apr 25 '18 at 13:30




      1




      1




      $begingroup$
      As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
      $endgroup$
      – Tobias Krabel
      Apr 25 '18 at 18:23




      $begingroup$
      As you may have read from my answer, I already say that the results are mixed. Please clarify if I shall be more specific in my answer, potentially elaborating on some numbers.
      $endgroup$
      – Tobias Krabel
      Apr 25 '18 at 18:23




      1




      1




      $begingroup$
      "Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
      $endgroup$
      – xiaodai
      Apr 25 '18 at 22:18




      $begingroup$
      "Your access to this site has been limited." I can't seem to access the site on my phone nor on my work computer.
      $endgroup$
      – xiaodai
      Apr 25 '18 at 22:18




      1




      1




      $begingroup$
      I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
      $endgroup$
      – Tobias Krabel
      Apr 26 '18 at 7:29




      $begingroup$
      I am sorry to read that. I have checked it myself on my phone and had no issues. Could have something to do with the country you try to connect from?
      $endgroup$
      – Tobias Krabel
      Apr 26 '18 at 7:29




      1




      1




      $begingroup$
      "4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
      $endgroup$
      – smci
      Aug 2 '18 at 18:15





      $begingroup$
      "4 physical cores" = 8 logical cores. Also it helps to say which specific Intel i7 2.2GHz (which generation? which variant? -HQ?) and what cache size. And for the SSD, what read and write speeds?
      $endgroup$
      – smci
      Aug 2 '18 at 18:15












      5












      $begingroup$


      Has anyone done any benchmarks?




      Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark

      Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.



      To not just link the content you are asking for I am pasting recent timings for those solutions.



      | in_rows|question | data.table| pandas|
      |-------:|:---------------------|----------:|------:|
      | 1e+07|sum v1 by id1 | 0.140| 0.414|
      | 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
      | 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
      | 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
      | 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
      | 1e+08|sum v1 by id1 | 1.551| 4.091|
      | 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
      | 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
      | 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
      | 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
      | 1e+09|sum v1 by id1 | 15.063| NA|
      | 1e+09|sum v1 by id1:id2 | 44.240| NA|
      | 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
      | 1e+09|mean v1:v3 by id4 | 26.855| NA|
      | 1e+09|sum v1:v3 by id6 | 120.376| NA|


      In 4 out of 5 questions data.table is faster, and we can see it scales better.

      Just note this timings are as of now, where id1, id2 and id3 are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.



      Other tasks are coming to this continuous benchmarking project so if you are interested in join, sort, read and others be sure to check it later.

      And of course you are welcome to provide feedback in project repo!






      share|improve this answer











      $endgroup$








      • 1




        $begingroup$
        What about JuliaDB?
        $endgroup$
        – skan
        Dec 16 '18 at 0:09






      • 1




        $begingroup$
        @skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
        $endgroup$
        – jangorecki
        Dec 17 '18 at 5:17















      5












      $begingroup$


      Has anyone done any benchmarks?




      Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark

      Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.



      To not just link the content you are asking for I am pasting recent timings for those solutions.



      | in_rows|question | data.table| pandas|
      |-------:|:---------------------|----------:|------:|
      | 1e+07|sum v1 by id1 | 0.140| 0.414|
      | 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
      | 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
      | 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
      | 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
      | 1e+08|sum v1 by id1 | 1.551| 4.091|
      | 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
      | 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
      | 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
      | 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
      | 1e+09|sum v1 by id1 | 15.063| NA|
      | 1e+09|sum v1 by id1:id2 | 44.240| NA|
      | 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
      | 1e+09|mean v1:v3 by id4 | 26.855| NA|
      | 1e+09|sum v1:v3 by id6 | 120.376| NA|


      In 4 out of 5 questions data.table is faster, and we can see it scales better.

      Just note this timings are as of now, where id1, id2 and id3 are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.



      Other tasks are coming to this continuous benchmarking project so if you are interested in join, sort, read and others be sure to check it later.

      And of course you are welcome to provide feedback in project repo!






      share|improve this answer











      $endgroup$








      • 1




        $begingroup$
        What about JuliaDB?
        $endgroup$
        – skan
        Dec 16 '18 at 0:09






      • 1




        $begingroup$
        @skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
        $endgroup$
        – jangorecki
        Dec 17 '18 at 5:17













      5












      5








      5





      $begingroup$


      Has anyone done any benchmarks?




      Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark

      Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.



      To not just link the content you are asking for I am pasting recent timings for those solutions.



      | in_rows|question | data.table| pandas|
      |-------:|:---------------------|----------:|------:|
      | 1e+07|sum v1 by id1 | 0.140| 0.414|
      | 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
      | 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
      | 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
      | 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
      | 1e+08|sum v1 by id1 | 1.551| 4.091|
      | 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
      | 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
      | 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
      | 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
      | 1e+09|sum v1 by id1 | 15.063| NA|
      | 1e+09|sum v1 by id1:id2 | 44.240| NA|
      | 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
      | 1e+09|mean v1:v3 by id4 | 26.855| NA|
      | 1e+09|sum v1:v3 by id6 | 120.376| NA|


      In 4 out of 5 questions data.table is faster, and we can see it scales better.

      Just note this timings are as of now, where id1, id2 and id3 are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.



      Other tasks are coming to this continuous benchmarking project so if you are interested in join, sort, read and others be sure to check it later.

      And of course you are welcome to provide feedback in project repo!






      share|improve this answer











      $endgroup$




      Has anyone done any benchmarks?




      Yes, the benchmark you have linked in your question has been recently updated for recent version of data.table and pandas. Additionally other software has been added. You can find updated benchmark at https://h2oai.github.io/db-benchmark

      Unfortunately it is scheduled on 125GB Memory machine (not 244GB as the original one). As a result pandas and dask are unable to make an attempt of groupby on 1e9 rows (50GB csv) data because they run out of memory when reading data. So for pandas vs data.table you have to look at 1e8 rows (5GB) data.



      To not just link the content you are asking for I am pasting recent timings for those solutions.



      | in_rows|question | data.table| pandas|
      |-------:|:---------------------|----------:|------:|
      | 1e+07|sum v1 by id1 | 0.140| 0.414|
      | 1e+07|sum v1 by id1:id2 | 0.411| 1.171|
      | 1e+07|sum v1 mean v3 by id3 | 0.574| 1.327|
      | 1e+07|mean v1:v3 by id4 | 0.252| 0.189|
      | 1e+07|sum v1:v3 by id6 | 0.595| 0.893|
      | 1e+08|sum v1 by id1 | 1.551| 4.091|
      | 1e+08|sum v1 by id1:id2 | 4.200| 11.557|
      | 1e+08|sum v1 mean v3 by id3 | 10.634| 24.590|
      | 1e+08|mean v1:v3 by id4 | 2.683| 2.133|
      | 1e+08|sum v1:v3 by id6 | 6.963| 16.451|
      | 1e+09|sum v1 by id1 | 15.063| NA|
      | 1e+09|sum v1 by id1:id2 | 44.240| NA|
      | 1e+09|sum v1 mean v3 by id3 | 157.430| NA|
      | 1e+09|mean v1:v3 by id4 | 26.855| NA|
      | 1e+09|sum v1:v3 by id6 | 120.376| NA|


      In 4 out of 5 questions data.table is faster, and we can see it scales better.

      Just note this timings are as of now, where id1, id2 and id3 are character fields. Those will be changed soon to categorical. Besides there are other factors that are likely to impact those timings in near future (like grouping in parallel). We are also going to add separate benchmarks for data having NAs, and various cardinalities.



      Other tasks are coming to this continuous benchmarking project so if you are interested in join, sort, read and others be sure to check it later.

      And of course you are welcome to provide feedback in project repo!







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 1 '18 at 14:37

























      answered Oct 31 '18 at 21:53









      jangoreckijangorecki

      15113




      15113







      • 1




        $begingroup$
        What about JuliaDB?
        $endgroup$
        – skan
        Dec 16 '18 at 0:09






      • 1




        $begingroup$
        @skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
        $endgroup$
        – jangorecki
        Dec 17 '18 at 5:17












      • 1




        $begingroup$
        What about JuliaDB?
        $endgroup$
        – skan
        Dec 16 '18 at 0:09






      • 1




        $begingroup$
        @skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
        $endgroup$
        – jangorecki
        Dec 17 '18 at 5:17







      1




      1




      $begingroup$
      What about JuliaDB?
      $endgroup$
      – skan
      Dec 16 '18 at 0:09




      $begingroup$
      What about JuliaDB?
      $endgroup$
      – skan
      Dec 16 '18 at 0:09




      1




      1




      $begingroup$
      @skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
      $endgroup$
      – jangorecki
      Dec 17 '18 at 5:17




      $begingroup$
      @skan you can track status of that in github.com/h2oai/db-benchmark/issues/63
      $endgroup$
      – jangorecki
      Dec 17 '18 at 5:17











      1












      $begingroup$

      I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.



      See feather's github page






      share|improve this answer









      $endgroup$

















        1












        $begingroup$

        I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.



        See feather's github page






        share|improve this answer









        $endgroup$















          1












          1








          1





          $begingroup$

          I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.



          See feather's github page






          share|improve this answer









          $endgroup$



          I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.



          See feather's github page







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 6 at 1:39









          DonQuixoteDonQuixote

          111




          111





















              1












              $begingroup$

              Nope, In fact if dataset size is sooooooo large that pandas crashes, you are basically stuck with dask, which sucks and you can't even do a simple groupby-sum. dplyr may not be fast, but it doesn't mess up.



              I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())crashes the dask.



              Didn't experience this error with dplyr.



              So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.



              And yes, you can convert dask back to pandas dataframe with a simple df.compute()
              But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.






              share|improve this answer










              New contributor




              Chenying Gao is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              $endgroup$

















                1












                $begingroup$

                Nope, In fact if dataset size is sooooooo large that pandas crashes, you are basically stuck with dask, which sucks and you can't even do a simple groupby-sum. dplyr may not be fast, but it doesn't mess up.



                I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())crashes the dask.



                Didn't experience this error with dplyr.



                So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.



                And yes, you can convert dask back to pandas dataframe with a simple df.compute()
                But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.






                share|improve this answer










                New contributor




                Chenying Gao is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$















                  1












                  1








                  1





                  $begingroup$

                  Nope, In fact if dataset size is sooooooo large that pandas crashes, you are basically stuck with dask, which sucks and you can't even do a simple groupby-sum. dplyr may not be fast, but it doesn't mess up.



                  I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())crashes the dask.



                  Didn't experience this error with dplyr.



                  So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.



                  And yes, you can convert dask back to pandas dataframe with a simple df.compute()
                  But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.






                  share|improve this answer










                  New contributor




                  Chenying Gao is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  $endgroup$



                  Nope, In fact if dataset size is sooooooo large that pandas crashes, you are basically stuck with dask, which sucks and you can't even do a simple groupby-sum. dplyr may not be fast, but it doesn't mess up.



                  I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())crashes the dask.



                  Didn't experience this error with dplyr.



                  So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.



                  And yes, you can convert dask back to pandas dataframe with a simple df.compute()
                  But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.







                  share|improve this answer










                  New contributor




                  Chenying Gao is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  share|improve this answer



                  share|improve this answer








                  edited 2 hours ago





















                  New contributor




                  Chenying Gao is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  answered 2 hours ago









                  Chenying GaoChenying Gao

                  112




                  112




                  New contributor




                  Chenying Gao is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.





                  New contributor





                  Chenying Gao is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  Chenying Gao is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Data Science Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24052%2fis-pandas-now-faster-than-data-table%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Францішак Багушэвіч Змест Сям'я | Біяграфія | Творчасць | Мова Багушэвіча | Ацэнкі дзейнасці | Цікавыя факты | Спадчына | Выбраная бібліяграфія | Ушанаванне памяці | У філатэліі | Зноскі | Літаратура | Спасылкі | НавігацыяЛяхоўскі У. Рупіўся дзеля Бога і людзей: Жыццёвы шлях Лявона Вітан-Дубейкаўскага // Вольскі і Памідораў з песняй пра немца Адвакат, паэт, народны заступнік Ашмянскі веснікВ Минске появится площадь Богушевича и улица Сырокомли, Белорусская деловая газета, 19 июля 2001 г.Айцец беларускай нацыянальнай ідэі паўстаў у бронзе Сяргей Аляксандравіч Адашкевіч (1918, Мінск). 80-я гады. Бюст «Францішак Багушэвіч».Яўген Мікалаевіч Ціхановіч. «Партрэт Францішка Багушэвіча»Мікола Мікалаевіч Купава. «Партрэт зачынальніка новай беларускай літаратуры Францішка Багушэвіча»Уладзімір Іванавіч Мелехаў. На помніку «Змагарам за родную мову» Барэльеф «Францішак Багушэвіч»Памяць пра Багушэвіча на Віленшчыне Страчаная сталіца. Беларускія шыльды на вуліцах Вільні«Krynica». Ideologia i przywódcy białoruskiego katolicyzmuФранцішак БагушэвічТворы на knihi.comТворы Францішка Багушэвіча на bellib.byСодаль Уладзімір. Францішак Багушэвіч на Лідчыне;Луцкевіч Антон. Жыцьцё і творчасьць Фр. Багушэвіча ў успамінах ягоных сучасьнікаў // Запісы Беларускага Навуковага таварыства. Вільня, 1938. Сшытак 1. С. 16-34.Большая российская1188761710000 0000 5537 633Xn9209310021619551927869394п

                      Беларусь Змест Назва Гісторыя Геаграфія Сімволіка Дзяржаўны лад Палітычныя партыі Міжнароднае становішча і знешняя палітыка Адміністрацыйны падзел Насельніцтва Эканоміка Культура і грамадства Сацыяльная сфера Узброеныя сілы Заўвагі Літаратура Спасылкі НавігацыяHGЯOiТоп-2011 г. (па версіі ej.by)Топ-2013 г. (па версіі ej.by)Топ-2016 г. (па версіі ej.by)Топ-2017 г. (па версіі ej.by)Нацыянальны статыстычны камітэт Рэспублікі БеларусьШчыльнасць насельніцтва па краінахhttp://naviny.by/rubrics/society/2011/09/16/ic_articles_116_175144/А. Калечыц, У. Ксяндзоў. Спробы засялення краю неандэртальскім чалавекам.І ў Менску былі мамантыА. Калечыц, У. Ксяндзоў. Старажытны каменны век (палеаліт). Першапачатковае засяленне тэрыторыіГ. Штыхаў. Балты і славяне ў VI—VIII стст.М. Клімаў. Полацкае княства ў IX—XI стст.Г. Штыхаў, В. Ляўко. Палітычная гісторыя Полацкай зямліГ. Штыхаў. Дзяржаўны лад у землях-княствахГ. Штыхаў. Дзяржаўны лад у землях-княствахБеларускія землі ў складзе Вялікага Княства ЛітоўскагаЛюблінская унія 1569 г."The Early Stages of Independence"Zapomniane prawdy25 гадоў таму было аб'яўлена, што Язэп Пілсудскі — беларус (фота)Наша вадаДакументы ЧАЭС: Забруджванне тэрыторыі Беларусі « ЧАЭС Зона адчужэнняСведения о политических партиях, зарегистрированных в Республике Беларусь // Министерство юстиции Республики БеларусьСтатыстычны бюлетэнь „Полаўзроставая структура насельніцтва Рэспублікі Беларусь на 1 студзеня 2012 года і сярэднегадовая колькасць насельніцтва за 2011 год“Индекс человеческого развития Беларуси — не было бы нижеБеларусь занимает первое место в СНГ по индексу развития с учетом гендерного факцёраНацыянальны статыстычны камітэт Рэспублікі БеларусьКанстытуцыя РБ. Артыкул 17Трансфармацыйныя задачы БеларусіВыйсце з крызісу — далейшае рэфармаванне Беларускі рубель — сусветны лідар па дэвальвацыяхПра змену коштаў у кастрычніку 2011 г.Бядней за беларусаў у СНД толькі таджыкіСярэдні заробак у верасні дасягнуў 2,26 мільёна рублёўЭканомікаГаласуем за ТОП-100 беларускай прозыСучасныя беларускія мастакіАрхитектура Беларуси BELARUS.BYА. Каханоўскі. Культура Беларусі ўсярэдзіне XVII—XVIII ст.Анталогія беларускай народнай песні, гуказапісы спеваўБеларускія Музычныя IнструментыБеларускі рок, які мы страцілі. Топ-10 гуртоў«Мясцовы час» — нязгаслая легенда беларускай рок-музыкіСЯРГЕЙ БУДКІН. МЫ НЯ ЗНАЕМ СВАЁЙ МУЗЫКІМ. А. Каладзінскі. НАРОДНЫ ТЭАТРМагнацкія культурныя цэнтрыПублічная дыскусія «Беларуская новая пьеса: без беларускай мовы ці беларуская?»Беларускія драматургі па-ранейшаму лепш ставяцца за мяжой, чым на радзіме«Працэс незалежнага кіно пайшоў, і дзяржаву турбуе яго непадкантрольнасць»Беларускія філосафы ў пошуках прасторыВсе идём в библиотекуАрхіваванаАб Нацыянальнай праграме даследавання і выкарыстання касмічнай прасторы ў мірных мэтах на 2008—2012 гадыУ космас — разам.У суседнім з Барысаўскім раёне пабудуюць Камандна-вымяральны пунктСвяты і абрады беларусаў«Мірныя бульбашы з малой краіны» — 5 непраўдзівых стэрэатыпаў пра БеларусьМ. Раманюк. Беларускае народнае адзеннеУ Беларусі скарачаецца колькасць злачынстваўЛукашэнка незадаволены мінскімі ўладамі Крадзяжы складаюць у Мінску каля 70% злачынстваў Узровень злачыннасці ў Мінскай вобласці — адзін з самых высокіх у краіне Генпракуратура аналізуе стан са злачыннасцю ў Беларусі па каэфіцыенце злачыннасці У Беларусі стабілізавалася крымінагеннае становішча, лічыць генпракурорЗамежнікі сталі здзяйсняць у Беларусі больш злачынстваўМУС Беларусі турбуе рост рэцыдыўнай злачыннасціЯ з ЖЭСа. Дазволіце вас абкрасці! Рэйтынг усіх службаў і падраздзяленняў ГУУС Мінгарвыканкама вырасАб КДБ РБГісторыя Аператыўна-аналітычнага цэнтра РБГісторыя ДКФРТаможняagentura.ruБеларусьBelarus.by — Афіцыйны сайт Рэспублікі БеларусьСайт урада БеларусіRadzima.org — Збор архітэктурных помнікаў, гісторыя Беларусі«Глобус Беларуси»Гербы и флаги БеларусиАсаблівасці каменнага веку на БеларусіА. Калечыц, У. Ксяндзоў. Старажытны каменны век (палеаліт). Першапачатковае засяленне тэрыторыіУ. Ксяндзоў. Сярэдні каменны век (мезаліт). Засяленне краю плямёнамі паляўнічых, рыбакоў і збіральнікаўА. Калечыц, М. Чарняўскі. Плямёны на тэрыторыі Беларусі ў новым каменным веку (неаліце)А. Калечыц, У. Ксяндзоў, М. Чарняўскі. Гаспадарчыя заняткі ў каменным векуЭ. Зайкоўскі. Духоўная культура ў каменным векуАсаблівасці бронзавага веку на БеларусіФарміраванне супольнасцей ранняга перыяду бронзавага векуФотографии БеларусиРоля беларускіх зямель ва ўтварэнні і ўмацаванні ВКЛВ. Фадзеева. З гісторыі развіцця беларускай народнай вышыўкіDMOZGran catalanaБольшая российскаяBritannica (анлайн)Швейцарскі гістарычны15325917611952699xDA123282154079143-90000 0001 2171 2080n9112870100577502ge128882171858027501086026362074122714179пппппп

                      ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6 (SMOTE) The 2019 Stack Overflow Developer Survey Results Are InCan SMOTE be applied over sequence of words (sentences)?ValueError when doing validation with random forestsSMOTE and multi class oversamplingLogic behind SMOTE-NC?ValueError: Error when checking target: expected dense_1 to have shape (7,) but got array with shape (1,)SmoteBoost: Should SMOTE be ran individually for each iteration/tree in the boosting?solving multi-class imbalance classification using smote and OSSUsing SMOTE for Synthetic Data generation to improve performance on unbalanced dataproblem of entry format for a simple model in KerasSVM SMOTE fit_resample() function runs forever with no result