pandas filter : filter vs query

Published by onesixx on

Pandas Filter by Column Value

Related:

1. Quick Examples of pandas Filter by Column Value

If you are in hurry, below are quick examples of how to filter by column value in pandas DataFrame.

[ ] 특이한 케이스.

df[ 'colNm1'] 의 확장 (Series의 확장)

df[['colNm1', 'colNm2']]   

df.filter(items=['colNm1','colNm2'])
df.filter(like='colNm', axis=1)
df.filter(items=['colNm1','colNm2'])
df.filter(like='colNm', axis=1)
df.filter(regex='b$', axis=1)
# Filter Rows using DataFrame.query()
df2=df.query("Courses == 'Spark'")

#Using variable
value='Spark'
df2=df.query("Courses == @value")

#inpace
df.query("Courses == 'Spark'",inplace=True)

#Not equals, in & multiple conditions
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")

# Other ways to Filter Rows
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

df[df["Courses"] == 'Spark'] 
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]

df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
df.dropna()

PythonCopy

If you are a learner, Let’s see with sample data and run through these examples and explore the output to understand better. First, let’s create a pandas DataFrame from Dictionary.

import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
df = pd.DataFrame(technologies)
print(df)

PythonCopy

Note that the above DataFrame also contains None and Nan values on Duration column that I would be using in my examples below to select rows that has None & Nan values or select ignoring these values.

2. Using query() to Filter by Column Value in pandas

DataFrame.query() function is used to filter rows based on column value in pandas. After applying the expression, it returns a new DataFrame. If you wanted to update the existing DataFrame use inplace=True param.

# Filter all rows with Courses rquals 'Spark'
df2=df.query("Courses == 'Spark'")
print(df2)

PythonCopy

Yields below output.

  Courses    Fee Duration  Discount
0   Spark  22000   30days      1000

BashCopy

In case you wanted to use a variable in the expression, use @ character.

# Filter Rows by using Python variable
value='Spark'
df2=df.query("Courses == @value")
print(df2)

PythonCopy

If you notice the above examples return a new DataFrame after filtering the rows. if you wanted to update the existing DataFrame use inplace=True

# Replace current esisting DataFrame
df.query("Courses == 'Spark'",inplace=True)
print(df)

PythonCopy

If you wanted to select based on column value not equals then use != operator.

# not equals condition
df2=df.query("Courses != 'Spark'")

PythonCopy

Yields below output.

   Courses  Courses Fee Duration  Discount
1  PySpark        25000   50days      2300
2   Hadoop        23000   30days      1000
3   Python        24000     None      1200
4   Pandas        26000      NaN      2500

BashCopy

3. Filter Rows Based on List of Column Values

If you have values in a python list and wanted to select the rows based on the list of values, use in operator.

# Filter Rows by list of values
print(df.query("Courses in ('Spark','PySpark')"))

PythonCopy

Yields below output.

   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300

BashCopy

You can also create a list of values and use it as a python variable.

# Filter Rows by list of values
values=['Spark','PySpark']
print(df.query("Courses in @values"))

PythonCopy

Use not in operator to select rows that are not in a list of column values.

# Filter Rows not in list of values
values=['Spark','PySpark']
print(df.query("Courses not in @values"))

PythonCopy

If you have column names with special characters using column name surrounded by tick ` character .

# Using columns with special characters
print(df.query("`Courses Fee` >= 23000"))

PythonCopy

4. pands Filter by Multiple Columns

In pandas or any table-like structures, most of the time we would need to filter the rows based on multiple conditions by using multiple columns, you can do that in Pandas DataFrame as below.

# Filter by multiple conditions
print(df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000"))

PythonCopy

Yields below output. Alternatively, you can also use pandas loc with multiple conditions.

  Courses  Courses Fee Duration  Discount
2  Hadoop        23000   30days      1000
3  Python        24000     None      1200

BashCopy

5. Using DataFrame.apply() & Lambda Function

pandas.DataFrame.apply() method is used to apply the expression row-by-row and return the rows that matched the values.

# By using lambda function
print(df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])]))

PythonCopy

Yields below output. To apply a function for each row, use apply function with a lambda expression.

   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300

BashCopy

7. Filter Rows with Nan using dropna() Method

In case you wanted to filter and ignore rows that have None or nan on column values, use DataFrame.dropna() method.

# filter rows by ignoreing columns that have None & Nan values
print(df.dropna())

PythonCopy

Yields below output

   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300
2   Hadoop  23000   30days      1000

BashCopy

In case you wanted to drop columns when column values are None or nan. To delete columns, I have covered some examples on how to drop Pandas DataFrame columns.

# Filter all column that have None or NaN
print(df.dropna(axis='columns'))

PythonCopy

Yields below output.

   Courses    Fee  Discount
0    Spark  22000      1000
1  PySpark  25000      2300
2   Hadoop  23000      1000
3   Python  24000      1200
4   Pandas  26000      2500

BashCopy

8. Using DataFrame.loc[] and df[]

# Other examples you can try to filter rows
df[df["Courses"] == 'Spark'] 
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

# Select based on value contains
print(df[df['Courses'].str.contains("Spark")])

# Select after converting values
print(df[df['Courses'].str.lower().str.contains("spark")])

#Select startswith
print(df[df['Courses'].str.startswith("P")])

PythonCopy

Conclusion

In this article, I have explained multiple examples of how to filter Pandas DataFrame by column value (Rows & Columns). Remember when you query DataFrame Rows, it always returns a new DataFrame with selected rows, in order to update existing df you have to use inplace=True or use df[] with df.loc[]. I hope this article helps you learn Pandas.

Happy Learning !!

You May Also Like

Categories: pandas

onesixx

Blog Owner

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x