pandas filter : filter vs query
Pandas Filter by Column Value
Related:
- pandas.DataFrame.filter() – To filter rows by index and columns by name.
- pandas.DataFrame.loc[] – To select rows by indices label and column by name.
- pandas.DataFrame.iloc[] – To select rows by index and column by position.
- pandas.DataFrame.apply() – To custom select using lambda function.
1. Quick Examples of pandas Filter by Column Value
If you are in hurry, below are quick examples of how to filter by column value in pandas DataFrame.
[ ] 특이한 케이스.
df[ 'colNm1'] 의 확장 (Series의 확장) df[['colNm1', 'colNm2']] df.filter(items=['colNm1','colNm2']) df.filter(like='colNm', axis=1)
df.filter(items=['colNm1','colNm2']) df.filter(like='colNm', axis=1) df.filter(regex='b$', axis=1)
# Filter Rows using DataFrame.query() df2=df.query("Courses == 'Spark'") #Using variable value='Spark' df2=df.query("Courses == @value") #inpace df.query("Courses == 'Spark'",inplace=True) #Not equals, in & multiple conditions df.query("Courses != 'Spark'") df.query("Courses in ('Spark','PySpark')") df.query("`Courses Fee` >= 23000") df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000") # Other ways to Filter Rows df.loc[df['Courses'] == value] df.loc[df['Courses'] != 'Spark'] df.loc[df['Courses'].isin(values)] df.loc[~df['Courses'].isin(values)] df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)] df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )] df[df["Courses"] == 'Spark'] df[df['Courses'].str.contains("Spark")] df[df['Courses'].str.lower().str.contains("spark")] df[df['Courses'].str.startswith("P")] df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])]) df.dropna()
PythonCopy
If you are a learner, Let’s see with sample data and run through these examples and explore the output to understand better. First, let’s create a pandas DataFrame from Dictionary.
import pandas as pd import numpy as np technologies= { 'Courses':["Spark","PySpark","Hadoop","Python","Pandas"], 'Fee' :[22000,25000,23000,24000,26000], 'Duration':['30days','50days','30days', None,np.nan], 'Discount':[1000,2300,1000,1200,2500] } df = pd.DataFrame(technologies) print(df)
PythonCopy
Note that the above DataFrame also contains None
and Nan
values on Duration
column that I would be using in my examples below to select rows that has None
& Nan
values or select ignoring these values.
2. Using query() to Filter by Column Value in pandas
DataFrame.query() function is used to filter rows based on column value in pandas. After applying the expression, it returns a new DataFrame. If you wanted to update the existing DataFrame use inplace=True
param.
# Filter all rows with Courses rquals 'Spark' df2=df.query("Courses == 'Spark'") print(df2)
PythonCopy
Yields below output.
Courses Fee Duration Discount 0 Spark 22000 30days 1000
BashCopy
In case you wanted to use a variable in the expression, use @ character
.
# Filter Rows by using Python variable value='Spark' df2=df.query("Courses == @value") print(df2)
PythonCopy
If you notice the above examples return a new DataFrame after filtering the rows. if you wanted to update the existing DataFrame use inplace=True
# Replace current esisting DataFrame df.query("Courses == 'Spark'",inplace=True) print(df)
PythonCopy
If you wanted to select based on column value not equals then use != operator
.
# not equals condition df2=df.query("Courses != 'Spark'")
PythonCopy
Yields below output.
Courses Courses Fee Duration Discount 1 PySpark 25000 50days 2300 2 Hadoop 23000 30days 1000 3 Python 24000 None 1200 4 Pandas 26000 NaN 2500
BashCopy
3. Filter Rows Based on List of Column Values
If you have values in a python list and wanted to select the rows based on the list of values, use in operator
.
# Filter Rows by list of values print(df.query("Courses in ('Spark','PySpark')"))
PythonCopy
Yields below output.
Courses Fee Duration Discount 0 Spark 22000 30days 1000 1 PySpark 25000 50days 2300
BashCopy
You can also create a list of values and use it as a python variable.
# Filter Rows by list of values values=['Spark','PySpark'] print(df.query("Courses in @values"))
PythonCopy
Use not in operator
to select rows that are not in a list of column values.
# Filter Rows not in list of values values=['Spark','PySpark'] print(df.query("Courses not in @values"))
PythonCopy
If you have column names with special characters using column name surrounded by tick ` character
.
# Using columns with special characters print(df.query("`Courses Fee` >= 23000"))
PythonCopy
4. pands Filter by Multiple Columns
In pandas or any table-like structures, most of the time we would need to filter the rows based on multiple conditions by using multiple columns, you can do that in Pandas DataFrame as below.
# Filter by multiple conditions print(df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000"))
PythonCopy
Yields below output. Alternatively, you can also use pandas loc with multiple conditions.
Courses Courses Fee Duration Discount 2 Hadoop 23000 30days 1000 3 Python 24000 None 1200
BashCopy
5. Using DataFrame.apply()
& Lambda Function
pandas.DataFrame.apply() method is used to apply the expression row-by-row and return the rows that matched the values.
# By using lambda function print(df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])]))
PythonCopy
Yields below output. To apply a function for each row, use apply function with a lambda expression.
Courses Fee Duration Discount 0 Spark 22000 30days 1000 1 PySpark 25000 50days 2300
BashCopy
7. Filter Rows with Nan using dropna()
Method
In case you wanted to filter and ignore rows that have None
or nan
on column values, use DataFrame.dropna() method.
# filter rows by ignoreing columns that have None & Nan values print(df.dropna())
PythonCopy
Yields below output
Courses Fee Duration Discount 0 Spark 22000 30days 1000 1 PySpark 25000 50days 2300 2 Hadoop 23000 30days 1000
BashCopy
In case you wanted to drop columns when column values are None
or nan
. To delete columns, I have covered some examples on how to drop Pandas DataFrame columns.
# Filter all column that have None or NaN print(df.dropna(axis='columns'))
PythonCopy
Yields below output.
Courses Fee Discount 0 Spark 22000 1000 1 PySpark 25000 2300 2 Hadoop 23000 1000 3 Python 24000 1200 4 Pandas 26000 2500
BashCopy
8. Using DataFrame.loc[] and df[]
# Other examples you can try to filter rows df[df["Courses"] == 'Spark'] df.loc[df['Courses'] == value] df.loc[df['Courses'] != 'Spark'] df.loc[df['Courses'].isin(values)] df.loc[~df['Courses'].isin(values)] df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)] df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )] # Select based on value contains print(df[df['Courses'].str.contains("Spark")]) # Select after converting values print(df[df['Courses'].str.lower().str.contains("spark")]) #Select startswith print(df[df['Courses'].str.startswith("P")])
PythonCopy
Conclusion
In this article, I have explained multiple examples of how to filter Pandas DataFrame by column value (Rows & Columns). Remember when you query DataFrame Rows, it always returns a new DataFrame with selected rows, in order to update existing df you have to use inplace=True
or use df[] with df.loc[]. I hope this article helps you learn Pandas.
Happy Learning !!