Espoir Murhabazi | Lifelong learner
Hello, I'm Espoir!
lifelong Learner | Software Engineer|Aspiring Data-Scientist | Meetups organiser

Posts

  • My First Step In The Bayesian Word

    Here is the outiline of this blog post :

  • Remaing Step Datascience

    Remaining steps in my journey to become a data-scientist

    A simple list of what I need to learn to become a full data scientist

  • Dive Into Pandas

    Click here to visualize the notebook here

  • Dive Into Pandas

    { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### [Quick dive into Pandas for Data Science](https://towardsdatascience.com/quick-dive-into-pandas-for-data-science-cc1c1a80d9c4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas is python open source library for that is build on top of numpy, \n", "It allows you do fast analysis as well as data cleaning and preparation \n", "**(it has been proved that data scientist spend more than 80% of their time cleaning and preparing data)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A. Installing Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **colab-users:** Pandas is embeded in google colab\n", "- **Condas Users:** `conda install pandas` from comande line\n", "- **No condas users:** `pip install pandas` from comand line make sure you venv is activated if you are using venv" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# importing make sure you run this code to cacth up with the tutorial\n", "\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## B. Pandas Data Structures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### B.1 Series " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Series is a one-dimensional array which is very similar to a NumPy array. As a matter of fact, Series are built on top of NumPy array objects. What differentiates Series from NumPy arrays is that series can have an access labels with which it can be indexed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "here is the basic syntax for cretaing a serie\n", "\n", "`my_series = pd.Series(data, index)`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the above, data can be any object type such as dictionary, list, or even a NumPy array while index signifies axis labels with which the Series will be indexed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "here is an example " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "countries = ['Kenya', 'Rwanda', 'Tanzania', 'Uganda', 'DRC']\n", "country_codes = ['+254', '+250', '+255', '+256', '+243']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "countries_serie = pd.Series(country_codes, countries)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Kenya +254\n", "Rwanda +250\n", "Tanzania +256\n", "Uganda +256\n", "DRC +243\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries_serie" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note : The index is optional it can be imply from data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also crreate series from dict, or numpy array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> What differentiates a Pandas Series from a NumPy array is that Pandas Series can hold a variety of object types." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Grabbing information from Series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we grab information from a serie the same way we do for a dictionary" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries_serie.get('Ethiopia', '0') # this is the best way" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'+243'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries_serie['DRC']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Performing Arithmetic operations on Series\n", "\n", "Operations on Series are done based off the index. When we use any of the mathematical operations such as -, +, /, *, pandas does the computation using the value of the index. The resulting value is thereafter converted to a float so that you do not loose any information." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "prices1 = pd.Series([10, 23, 34, 35], ['tomatao', 'banana', 'avocados', 'beans'])\n", "prices2 = pd.Series([12, 13, 54, 65], ['tomatao', 'banana', 'avocados', 'beans'])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "tomatao 22\n", "banana 36\n", "avocados 88\n", "beans 100\n", "dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prices1 + prices2" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tomatao -2\n", "banana 10\n", "avocados -20\n", "beans -30\n", "dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prices1 - prices2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### B.2. DataFrames\n", "\n", "A DataFrame is a two-dimensional data structure in which the data is aligned in a tabular form i.e. in rows and columns. Pandas DataFrames make manipulating your data easy. You can select, replace columns and rows and even reshape your data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> A dataframe is the core data structure of pandas\n", "> you can view it as a list of series sharing the same index , an excel sheet, a sql table or matrix with label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the basic syntax for creating a DataFrame:\n", "\n", "`pd.DataFrame(data,index)`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "data can be any structural datatype:\n", " - a dictionary where key a column names and values are list of values\n", " - data can be a list of series or list of numpy arrays\n", " - data can be a numpy 2D array \n", " -etc" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['+254', '+250', '+255', '+256', '+243']" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries\n", "country_codes" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "capitals = ['NBO', 'KG', 'DES', 'KLA', 'KIN']" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    capitalcodes
    KenyaNairobie+254
    RwandaKigali+250
    TanzaniaDar-el-Salama+255
    UgandaKampala+256
    DRCKinshasa+243
    \n", "
    " ], "text/plain": [ " capital codes\n", "Kenya Nairobie +254\n", "Rwanda Kigali +250\n", "Tanzania Dar-el-Salama +255\n", "Uganda Kampala +256\n", "DRC Kinshasa +243" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "country_df = pd.DataFrame(data={'capital': capitals, 'codes':country_codes}, index=countries)\n", "country_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Most of the time in your data science project you will never create dataframe , but read them from diffrent datasource, " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### C. Usefuls function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1. Data Input and Output\n", "Using the pd.read_ methods Pandas allows you access data from a wide variety of sources such as; excel sheet, csv, sql, , google sheet , Html etc... (For some format you need to install additional libraies)\n", "\n", "To reference any of the files, you need to pass the path of the file you are reading\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us do some data science job, \n", "I have created a form where you will fill it with some data and we are going to work with it" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "let read data from the sheet recentely created (need to be edited to be avialable to everyone with the link)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "doc_id = '1bIhLt6BO4byo2VnqdIgzdEWWdfQU-eD2vsZfXeHGHjk'\n", "sheet_id = 809226885\n", "path = 'https://docs.google.com/spreadsheets/d/{}/export?gid={}&format=csv'.format(doc_id, sheet_id)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv(path,\n", " # Set first column as rownames in data frame\n", " index_col=0,\n", " # Parse column values to datetim\n", " )" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    Email addressFirst NameLast NameCountry of OriginPython proficiencyNumpy ProficiencyGenderDate of inscription
    Timestamp
    07/09/2018 20:31:18emurhabazi@exuus.comEspoirMurDemocratic Republic of the Congo77Male09/12/2018
    08/09/2018 13:42:39ganzamick@gmail.comMickGanzaRwanda11Male09/04/2018
    08/09/2018 13:42:45mjos12002@gmail.comJosephManziRwanda77Male30/08/1983
    08/09/2018 13:42:45sauvebrade@gmail.comMizerereJeanRwanda53Male08/09/2018
    08/09/2018 13:46:16abigailnet@anonymous.comAnonymousMetasploitSouth Sudan1010Female06/05/2005
    \n", "
    " ], "text/plain": [ " Email address First Name Last Name \\\n", "Timestamp \n", "07/09/2018 20:31:18 emurhabazi@exuus.com Espoir Mur \n", "08/09/2018 13:42:39 ganzamick@gmail.com Mick Ganza \n", "08/09/2018 13:42:45 mjos12002@gmail.com Joseph Manzi \n", "08/09/2018 13:42:45 sauvebrade@gmail.com Mizerere Jean \n", "08/09/2018 13:46:16 abigailnet@anonymous.com Anonymous Metasploit \n", "\n", " Country of Origin Python proficiency \\\n", "Timestamp \n", "07/09/2018 20:31:18 Democratic Republic of the Congo 7 \n", "08/09/2018 13:42:39 Rwanda 1 \n", "08/09/2018 13:42:45 Rwanda 7 \n", "08/09/2018 13:42:45 Rwanda 5 \n", "08/09/2018 13:46:16 South Sudan 10 \n", "\n", " Numpy Proficiency Gender Date of inscription \n", "Timestamp \n", "07/09/2018 20:31:18 7 Male 09/12/2018 \n", "08/09/2018 13:42:39 1 Male 09/04/2018 \n", "08/09/2018 13:42:45 7 Male 30/08/1983 \n", "08/09/2018 13:42:45 3 Male 08/09/2018 \n", "08/09/2018 13:46:16 10 Female 06/05/2005 " ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "you can either download the document and read it from your laptop or read " ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "data.reset_index(inplace=True)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    TimestampEmail addressFirst NameLast NameCountry of OriginPython proficiencyNumpy ProficiencyGenderDate of inscription
    007/09/2018 20:31:18emurhabazi@exuus.comEspoirMurDemocratic Republic of the Congo77Male09/12/2018
    108/09/2018 13:42:39ganzamick@gmail.comMickGanzaRwanda11Male09/04/2018
    208/09/2018 13:42:45mjos12002@gmail.comJosephManziRwanda77Male30/08/1983
    308/09/2018 13:42:45sauvebrade@gmail.comMizerereJeanRwanda53Male08/09/2018
    408/09/2018 13:46:16abigailnet@anonymous.comAnonymousMetasploitSouth Sudan1010Female06/05/2005
    \n", "
    " ], "text/plain": [ " Timestamp Email address First Name Last Name \\\n", "0 07/09/2018 20:31:18 emurhabazi@exuus.com Espoir Mur \n", "1 08/09/2018 13:42:39 ganzamick@gmail.com Mick Ganza \n", "2 08/09/2018 13:42:45 mjos12002@gmail.com Joseph Manzi \n", "3 08/09/2018 13:42:45 sauvebrade@gmail.com Mizerere Jean \n", "4 08/09/2018 13:46:16 abigailnet@anonymous.com Anonymous Metasploit \n", "\n", " Country of Origin Python proficiency Numpy Proficiency \\\n", "0 Democratic Republic of the Congo 7 7 \n", "1 Rwanda 1 1 \n", "2 Rwanda 7 7 \n", "3 Rwanda 5 3 \n", "4 South Sudan 10 10 \n", "\n", " Gender Date of inscription \n", "0 Male 09/12/2018 \n", "1 Male 09/04/2018 \n", "2 Male 30/08/1983 \n", "3 Male 08/09/2018 \n", "4 Female 06/05/2005 " ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "data.set_index('First Name', inplace=True)" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    TimestampEmail addressLast NameCountry of OriginPython proficiencyNumpy ProficiencyGenderDate of inscription
    First Name
    Espoir07/09/2018 20:31:18emurhabazi@exuus.comMurDemocratic Republic of the Congo77Male09/12/2018
    Mick08/09/2018 13:42:39ganzamick@gmail.comGanzaRwanda11Male09/04/2018
    Joseph08/09/2018 13:42:45mjos12002@gmail.comManziRwanda77Male30/08/1983
    Mizerere08/09/2018 13:42:45sauvebrade@gmail.comJeanRwanda53Male08/09/2018
    Anonymous08/09/2018 13:46:16abigailnet@anonymous.comMetasploitSouth Sudan1010Female06/05/2005
    \n", "
    " ], "text/plain": [ " Timestamp Email address Last Name \\\n", "First Name \n", "Espoir 07/09/2018 20:31:18 emurhabazi@exuus.com Mur \n", "Mick 08/09/2018 13:42:39 ganzamick@gmail.com Ganza \n", "Joseph 08/09/2018 13:42:45 mjos12002@gmail.com Manzi \n", "Mizerere 08/09/2018 13:42:45 sauvebrade@gmail.com Jean \n", "Anonymous 08/09/2018 13:46:16 abigailnet@anonymous.com Metasploit \n", "\n", " Country of Origin Python proficiency \\\n", "First Name \n", "Espoir Democratic Republic of the Congo 7 \n", "Mick Rwanda 1 \n", "Joseph Rwanda 7 \n", "Mizerere Rwanda 5 \n", "Anonymous South Sudan 10 \n", "\n", " Numpy Proficiency Gender Date of inscription \n", "First Name \n", "Espoir 7 Male 09/12/2018 \n", "Mick 1 Male 09/04/2018 \n", "Joseph 7 Male 30/08/1983 \n", "Mizerere 3 Male 08/09/2018 \n", "Anonymous 10 Female 06/05/2005 " ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have our dataframe we can :" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1. Selecting Columns from DataFrames" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "#eg : select: Last Name name from our df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using bracket notation [], we can easily grab objects from a DataFrame same way it’s done with Series. Let’s grab a column name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because we grabbed a single column, it returns a Series. Go ahead and confirm the data type returned using" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "First Name\n", "Espoir Mur\n", "Mick Ganza\n", "Joseph Manzi\n", "Mizerere Jean\n", "Anonymous Metasploit\n", "Kenneth Kamurali\n", "Aimable GAKIRE\n", "John Doe\n", "Spider Man\n", "pac bob\n", "kiki Paul\n", "Jay Brown\n", "Dushime Aimable\n", "Sylvia Asiimwe\n", "testa atset\n", "Tunmise Raji\n", "stef kijo\n", "David Butera\n", "Kenneth Kamurali\n", "Musafiri ildephonse\n", "Reponse Jean\n", "Jessica Ingabire\n", "Jean uwumukiza\n", "Karamba Gaston\n", "Name: Last Name, dtype: object" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['Last Name']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Renaming columns name\n", "\n", "We can rename py passing a dictionary with colums name and axis: df.rename" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "## rename the column for proficiency in python to python \n", "### and proficiency in numpy to numpy" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Timestamp', 'Email address', 'Last Name', 'Country of Origin',\n", " 'Python proficiency', 'Numpy Proficiency', 'Gender',\n", " 'Date of inscription'],\n", " dtype='object')" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.columns" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "data.rename({'Python proficiency': 'python', \n", " 'Numpy Proficiency': 'numpy'}, axis=1, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3. Adding Columns to a DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create a new one or creating from existing one :" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "# eg : add proficiency in pyhton and numpy to create a new column." ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "data['data_proficiency'] = data['python'] + data['numpy']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'data' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhead\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mNameError\u001b[0m: name 'data' is not defined" ] } ], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4. Removing rows/columns from a DataFrame\n", "\n", "We can remove a row or a column using the [.drop()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) function. In doing this, we have to specify the **axis=0 for row, and axis=1 for column.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Hint: click to[here for the documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "very important to know" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [], "source": [ "##eg remove the new row recently created " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5. Selecting Rows in a DataFrame\n", "\n", "To select rows, we have to call the location of the rows using .loc[] which takes in the label name or .iloc[] which takes in the index position of the row.\n", "\n", "**hint : [click me](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc)**" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "#eg get row number with your name as index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6. Conditional selection\n", "\n", "Pandas allows you to perform conditional selection using bracket notation [] . The example below returns the rows where 'W'>0:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "# get row for ladies with proficiency in pandas superior to 6" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "# get just their name and country of origin" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**hint : [click me](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing) and [me](https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/)** " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.a : GO deeper find the difference between loc, iloc, at, ix, iat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "hint : \n", "**loc:** only work on index\n", "\n", "**iloc :** work on position\n", "\n", "**ix:** this is the most general and\n", "supports index and position based\n", "retrieval\n", "\n", "**at:** get scalar values , it 's a very fast\n", "loc\n", "\n", "**iat:** get scalar values , it 's a very fast\n", "iloc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 7. query method\n", "\n", "Also, use the query method where you can embed boolean\n", "expressions on columns within quotes\n", "Example\n", "df. query ('one > 0')\n", "one two" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.Missing Data\n", "\n", "A lot of times, when you’re using Pandas to read-in data and there are missing points, Pandas will automatically fill-in those missing points with a NaN or Null value. Hence, we can either drop those auto-filled values using .dropna() or fill them using.fillna().\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "let find missing data in non required columns and either fill or drop the corresponding row" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Say you have a large dataset, Pandas has made it very easy to locate null values using .isnull():" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "# fill columns with empty values in pandas and numpy\n", "# drop columns with na in last name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**hint : [click me](http://pandas.pydata.org/pandas-docs/stable/missing_data.html#cleaning-filling-missing-data)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 7. GroupBy\n", "\n", "Grouby allows you group together rows based off a column so that you can perform aggregate functions (such as sum, mean, median, standard deviation, etc) on them.\n", "\n", "Using the .groupby() method, we can group rows based on the 'country' column and call the aggregate function .mean()on it and get the values profidiciency in pandas and python:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we can apply others function such as count, decribe (for satistical description)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "## group by country and get the mean for score in python\n", "## group by gender and get the lady with max score in pyhton" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Hint : [click me](http://pandas.pydata.org/pandas-docs/stable/groupby.html)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 8. The apply() Method\n", "\n", "The .apply() method is used to call custom functions on a DataFrame. Imagine we have a function:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "## get the square of prociciency in pyhton" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**hint : [click me](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 9. Map method\n", "\n", "can apply map to change values from a colums:\n", " " ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "## map gender and return m for male and F for Female" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**hint: find yourself**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 10. Sorting and Ordering DataFrame\n", "\n", "Imagine we wanted to display the DataFrame with a certain column being displayed in ascending order, we could easily sort it using .sort_values():" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "### let sort our data by country" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**hint: google is your friend**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Advanced topics " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 11.Concatenating, Merging, and Joining DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Concatenation basically glues DataFrames together. When concatenating DataFrames, keep in mind that dimensions should match along the axis you are concatenating on. Having, a list of DataFrames:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "## let works with the following df" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\n", " 'B': ['B0', 'B1', 'B2', 'B3'],\n", " 'C': ['C0', 'C1', 'C2', 'C3'],\n", " 'D': ['D0', 'D1', 'D2', 'D3']},\n", " index=[0, 1, 2, 3])\n", "df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],\n", " 'B': ['B4', 'B5', 'B6', 'B7'],\n", " 'C': ['C4', 'C5', 'C6', 'C7'],\n", " 'D': ['D4', 'D5', 'D6', 'D7']},\n", " index=[4, 5, 6, 7]) \n", "df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],\n", " 'B': ['B8', 'B9', 'B10', 'B11'],\n", " 'C': ['C8', 'C9', 'C10', 'C11'],\n", " 'D': ['D8', 'D9', 'D10', 'D11']},\n", " index=[8, 9, 10, 11])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**hint??: Use google**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 12. pivot tables " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 13. Take Advantage of Accessor Methods(str, dt, cat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "more infos [here](https://realpython.com/python-pandas-tricks/#3-take-advantage-of-accessor-methods)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 14. Working with dates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 15. ploting" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }
  • Regex Match Emails

    Usings regular expressions to extract emails from documents

  • Test Driven Development

    Test-Driven development

  • Find And Replace

    find replace in atom text editor

  • Mananging Import Errors

    import error , import errors , python … how to manage this????????

  • Fpl Analysis

    FPL Team Analysis Season 2018-2019:

  • Using Flask Signals

    Learning how to send signals via flask:

  • Bayes Calssifier

    The first assumption we made for the bayes classifier is the fact that the data is IID (Identical independant distributed)

  • Export_data_from_local_database_to_dockerdatabase

    Migrating Data from Local Database to docker database.

  • Docker Flask Angular Postgres

    I spent more than 24 hours trying to figure out why a flask app is unable to communicate with a postgres sql database while both are shipping in docker containners.

  • Adding Nginx To Serve Applications In Production

    Configure nginx to serve apps in production

  • Datascienceafrica_day1

    Fundemental of IOT session

  • Intro

    Espoir Murhabazi

  • What my ideal day should look like

    In this blog post I’m going to describe my ideal day.

  • Blogs_ideas

    This is a draft of what i need to talk about but what??

  • How to solve an 8 Puzzle game using search algorithms

    how to solve an 8-puzzle game using search algorithms

subscribe via RSS