Generalize Names & Duplicate Checking
A function that converts a name into a general format
<last_name><separator><firstname letter(s)> (all lowercase) in a
pandas DataFrame while avoiding duplicate entries.
from mlxtend.text import generalize_names_duplcheck
Note that using
mlxtend.text.generalize_names with few
firstname_output_letters can result in duplicate entries. E.g., if your dataset contains the names "Adam Johnson" and "Andrew Johnson", the default setting (i.e., 1 first name letter) will produce the generalized name "johnson a" in both cases.
One solution is to increase the number of first name letters in the output by setting the parameter
firstname_output_letters to a value larger than 1.
An alternative solution is to use the
generalize_names_duplcheck function if you are working with pandas DataFrames.
generalize_names_duplcheck will apply
generalize_names to a pandas DataFrame column with the minimum number of first name letters and append as many first name letters as necessary until no duplicates are present in the given DataFrame column. An example dataset column that contains the names
Example 1 - Defaults
Reading in a CSV file that has column
Name for which we want to generalize the names:
- Samuel Eto'o
- Adam Johnson
- Andrew Johnson
import pandas as pd from io import StringIO simulated_csv = "name,some_value\n"\ "Samuel Eto'o,1\n"\ "Adam Johnson,1\n"\ "Andrew Johnson,1\n" df = pd.read_csv(StringIO(simulated_csv)) df
generalize_names_duplcheck to generate a new DataFrame with the generalized names without duplicates:
from mlxtend.text import generalize_names_duplcheck df_new = generalize_names_duplcheck(df=df, col_name='name') df_new
Generalizes names and removes duplicates.
Applies mlxtend.text.generalize_names to a DataFrame with 1 first name letter by default and uses more first name letters if duplicates are detected.
DataFrame that contains a column where generalize_names should be applied.
Name of the DataFrame column where
generalize_namesfunction should be applied to.
New DataFrame object where generalize_names function has been applied without duplicates.