In this episode I speak about data transformation frameworks available for the data scientist who writes Python code.
The usual suspect is clearly Pandas, as the most widely used library and de-facto standard. However when data volumes increase and distributed algorithms are in place (according to a map-reduce paradigm of computation), Pandas no longer performs as expected. Other frameworks play a role in such context.
In this episode I explain the frameworks that are the best equivalent to Pandas in bigdata contexts.
References
- Pandas a fast, powerful, flexible and easy to use open source data analysis and manipulation tool – https://pandas.pydata.org/
- Modin – Scale your pandas workflows by changing one line of code – https://github.com/modin-project/modin
- Dask advanced parallelism for analytics https://dask.org/
- Ray is a fast and simple framework for building and running distributed applications https://github.com/ray-project/ray
- RAPIDS – GPU data science https://rapids.ai/
This article has been published from a wire agency feed without modifications to th etext. Only the headline has been changed.