[Profanity Filter] Efficiently filtering combinations of bad words out of string inputs

[Profanity Filter] Efficiently filtering combinations of bad words out of string inputs

Date: May 28, 2016Author: pdwitte 1 Comment

So today we ran into this problem where we want to be able to filter player input in our games. However, players are smart, they know that they can spell words to get around filters. Today, we wrote a system that doesn’t allow them to do this. It’s managed by a nice google spread sheet that you can maintain, and it also supports ignoring the flag if the bad word turns out to be a good one (e.g. in the case the word “bass” is found, it can be triggered for the word “*ss”).

Then we ran into the problem where it is really hard to compare parts of strings against a list of strings. For example, finding the word the word *ss in the word “bass” would require you to iterate over every single word in your list (10k+ words in our case) and see if it matches a .contains boolean. I figured there had to be a better way. I wrote a function that grows in runtime based on the message size, not on the list size, which should allow for more efficient, easier filtering for everyone. Let’s make sure we keep the kids safe from finding out bad words on our games!

Here’s how it works:

1) Have a google spreadsheet with all words that I want to filter out

2) Directly download the google spreadsheet into my code with the loadConfigs method (see below)

3) Replace all l33tsp33k characters with their respective alphabet letter

4) Replace all special characters but letters from the sentence

5) Run an algorithm that checks all the possible combinations of words within a string against the list efficiently, note that this part is key – you don’t want to loop over your ENTIRE list every time to see if your word is in the list. In my case, I found every combination within the string input and checked it against a hashmap (O(1) runtime). This way the runtime grows relatively to the string input, not the list input. It also caps the search space at the length of the largest word in your filter.

6) Check if the word is not used in combination with a good word (e.g. bass contains *ss). This is also loaded through the spreadsheet

6) In our case we are also posting the filtered words to Slack, but you can remove that line obviously.

Use this structure in your google sheet.

Then use the functions in this gist to load to sheet, and use the function in the badWordsFound function to return a list of all bad words inside a string input.

Good luck! Feel free to reply with questions.

Here’s the code running with the word “abcdef”:

checking: 0,1

word: a

checking: 0,2

word: ab

checking: 0,3

word: abc

checking: 0,4

word: abcd

checking: 0,5

word: abcde

checking: 0,6

word: abcdef

checking: 1,1

word: b

checking: 1,2

word: bc

checking: 1,3

word: bcd

checking: 1,4

word: bcde

checking: 1,5

word: bcdef

checking: 2,1

word: c

checking: 2,2

word: cd

checking: 2,3

word: cde

checking: 2,4

word: cdef

checking: 3,1

word: d

checking: 3,2

word: de

checking: 3,3

word: def

checking: 4,1

word: e

checking: 4,2

word: ef

checking: 5,1

word: f

	static Map<String, String[]> words = new HashMap<>();

	static int largestWordLength = 0;

	public static void loadConfigs() {
	try {
	BufferedReader reader = new BufferedReader(new InputStreamReader(new URL("https://docs.google.com/spreadsheets/d/1hIEi2YG3ydav1E06Bzf2mQbGZ12kh2fe4ISgLg_UBuM/export?format=csv").openConnection().getInputStream()));
	String line = "";
	int counter = 0;
	while((line = reader.readLine()) != null) {
	counter++;
	String[] content = null;
	try {
	content = line.split(",");
	if(content.length == 0) {
	continue;
	}
	String word = content[0];
	String[] ignore_in_combination_with_words = new String[]{};
	if(content.length > 1) {
	ignore_in_combination_with_words = content[1].split("_");
	}

	if(word.length() > largestWordLength) {
	largestWordLength = word.length();
	}
	words.put(word.replaceAll(" ", ""), ignore_in_combination_with_words);

	} catch(Exception e) {
	e.printStackTrace();
	}

	}
	System.out.println("Loaded " + counter + " words to filter out");
	} catch (IOException e) {
	e.printStackTrace();
	}

	}


	/**
	* Iterates over a String input and checks whether a cuss word was found in a list, then checks if the word should be ignored (e.g. bass contains the word *ss).
	* @param input
	* @return
	*/

	public static ArrayList<String> badWordsFound(String input) {
	if(input == null) {
	return new ArrayList<>();
	}

	// don't forget to remove leetspeak, probably want to move this to its own function and use regex if you want to use this

	input = input.replaceAll("1","i");
	input = input.replaceAll("!","i");
	input = input.replaceAll("3","e");
	input = input.replaceAll("4","a");
	input = input.replaceAll("@","a");
	input = input.replaceAll("5","s");
	input = input.replaceAll("7","t");
	input = input.replaceAll("0","o");
	input = input.replaceAll("9","g");


	ArrayList<String> badWords = new ArrayList<>();
	input = input.toLowerCase().replaceAll("[^a-zA-Z]", "");

	// iterate over each letter in the word
	for(int start = 0; start < input.length(); start++) {
	// from each letter, keep going to find bad words until either the end of the sentence is reached, or the max word length is reached.
	for(int offset = 1; offset < (input.length()+1 – start) && offset < largestWordLength; offset++) {
	String wordToCheck = input.substring(start, start + offset);
	if(words.containsKey(wordToCheck)) {
	// for example, if you want to say the word bass, that should be possible.
	String[] ignoreCheck = words.get(wordToCheck);
	boolean ignore = false;
	for(int s = 0; s < ignoreCheck.length; s++ ) {
	if(input.contains(ignoreCheck[s])) {
	ignore = true;
	break;
	}
	}
	if(!ignore) {
	badWords.add(wordToCheck);
	}
	}
	}
	}


	for(String s: badWords) {
	System.out.println(s + " qualified as a bad word in a username");
	}
	return badWords;

	}

	public static String filterText(String input, String username) {
	ArrayList<String> badWords = badWordsFound(input);
	if(badWords.size() > 0) {
	return "This message was blocked because a bad word was found. If you believe this word should not be blocked, please message support.";
	}
	return input;
	}

view raw

Efficient Bad Word Filter

hosted with ❤ by GitHub

	static Map<String, String[]> words = new HashMap<>();

	static int largestWordLength = 0;

	public static void loadConfigs() {
	try {
	BufferedReader reader = new BufferedReader(new InputStreamReader(new URL("https://docs.google.com/spreadsheets/d/1hIEi2YG3ydav1E06Bzf2mQbGZ12kh2fe4ISgLg_UBuM/export?format=csv").openConnection().getInputStream()));
	String line = "";
	int counter = 0;
	while((line = reader.readLine()) != null) {
	counter++;
	String[] content = null;
	try {
	content = line.split(",");
	if(content.length == 0) {
	continue;
	}
	String word = content[0];
	String[] ignore_in_combination_with_words = new String[]{};
	if(content.length > 1) {
	ignore_in_combination_with_words = content[1].split("_");
	}

	if(word.length() > largestWordLength) {
	largestWordLength = word.length();
	}
	words.put(word.replaceAll(" ", ""), ignore_in_combination_with_words);

	} catch(Exception e) {
	e.printStackTrace();
	}

	}
	System.out.println("Loaded " + counter + " words to filter out");
	} catch (IOException e) {
	e.printStackTrace();
	}

	}


	/**
	* Iterates over a String input and checks whether a cuss word was found in a list, then checks if the word should be ignored (e.g. bass contains the word *ss).
	* @param input
	* @return
	*/

	public static ArrayList<String> badWordsFound(String input) {
	if(input == null) {
	return new ArrayList<>();
	}

	// don't forget to remove leetspeak, probably want to move this to its own function and use regex if you want to use this

	input = input.replaceAll("1","i");
	input = input.replaceAll("!","i");
	input = input.replaceAll("3","e");
	input = input.replaceAll("4","a");
	input = input.replaceAll("@","a");
	input = input.replaceAll("5","s");
	input = input.replaceAll("7","t");
	input = input.replaceAll("0","o");
	input = input.replaceAll("9","g");


	ArrayList<String> badWords = new ArrayList<>();
	input = input.toLowerCase().replaceAll("[^a-zA-Z]", "");

	// iterate over each letter in the word
	for(int start = 0; start < input.length(); start++) {
	// from each letter, keep going to find bad words until either the end of the sentence is reached, or the max word length is reached.
	for(int offset = 1; offset < (input.length()+1 – start) && offset < largestWordLength; offset++) {
	String wordToCheck = input.substring(start, start + offset);
	if(words.containsKey(wordToCheck)) {
	// for example, if you want to say the word bass, that should be possible.
	String[] ignoreCheck = words.get(wordToCheck);
	boolean ignore = false;
	for(int s = 0; s < ignoreCheck.length; s++ ) {
	if(input.contains(ignoreCheck[s])) {
	ignore = true;
	break;
	}
	}
	if(!ignore) {
	badWords.add(wordToCheck);
	}
	}
	}
	}


	for(String s: badWords) {
	System.out.println(s + " qualified as a bad word in a username");
	}
	return badWords;

	}

	public static String filterText(String input, String username) {
	ArrayList<String> badWords = badWordsFound(input);
	if(badWords.size() > 0) {
	return "This message was blocked because a bad word was found. If you believe this word should not be blocked, please message support.";
	}
	return input;
	}

view raw

Efficient Bad Word Filter

hosted with ❤ by GitHub

	static Map<String, String[]> words = new HashMap<>();

	static int largestWordLength = 0;

	public static void loadConfigs() {
	try {
	BufferedReader reader = new BufferedReader(new InputStreamReader(new URL("https://docs.google.com/spreadsheets/d/1hIEi2YG3ydav1E06Bzf2mQbGZ12kh2fe4ISgLg_UBuM/export?format=csv").openConnection().getInputStream()));
	String line = "";
	int counter = 0;
	while((line = reader.readLine()) != null) {
	counter++;
	String[] content = null;
	try {
	content = line.split(",");
	if(content.length == 0) {
	continue;
	}
	String word = content[0];
	String[] ignore_in_combination_with_words = new String[]{};
	if(content.length > 1) {
	ignore_in_combination_with_words = content[1].split("_");
	}

	if(word.length() > largestWordLength) {
	largestWordLength = word.length();
	}
	words.put(word.replaceAll(" ", ""), ignore_in_combination_with_words);

	} catch(Exception e) {
	e.printStackTrace();
	}

	}
	System.out.println("Loaded " + counter + " words to filter out");
	} catch (IOException e) {
	e.printStackTrace();
	}

	}


	/**
	* Iterates over a String input and checks whether a cuss word was found in a list, then checks if the word should be ignored (e.g. bass contains the word *ss).
	* @param input
	* @return
	*/

	public static ArrayList<String> badWordsFound(String input) {
	if(input == null) {
	return new ArrayList<>();
	}

	// don't forget to remove leetspeak, probably want to move this to its own function and use regex if you want to use this

	input = input.replaceAll("1","i");
	input = input.replaceAll("!","i");
	input = input.replaceAll("3","e");
	input = input.replaceAll("4","a");
	input = input.replaceAll("@","a");
	input = input.replaceAll("5","s");
	input = input.replaceAll("7","t");
	input = input.replaceAll("0","o");
	input = input.replaceAll("9","g");


	ArrayList<String> badWords = new ArrayList<>();
	input = input.toLowerCase().replaceAll("[^a-zA-Z]", "");

	// iterate over each letter in the word
	for(int start = 0; start < input.length(); start++) {
	// from each letter, keep going to find bad words until either the end of the sentence is reached, or the max word length is reached.
	for(int offset = 1; offset < (input.length()+1 – start) && offset < largestWordLength; offset++) {
	String wordToCheck = input.substring(start, start + offset);
	if(words.containsKey(wordToCheck)) {
	// for example, if you want to say the word bass, that should be possible.
	String[] ignoreCheck = words.get(wordToCheck);
	boolean ignore = false;
	for(int s = 0; s < ignoreCheck.length; s++ ) {
	if(input.contains(ignoreCheck[s])) {
	ignore = true;
	break;
	}
	}
	if(!ignore) {
	badWords.add(wordToCheck);
	}
	}
	}
	}


	for(String s: badWords) {
	System.out.println(s + " qualified as a bad word in a username");
	}
	return badWords;

	}

	public static String filterText(String input, String username) {
	ArrayList<String> badWords = badWordsFound(input);
	if(badWords.size() > 0) {
	return "This message was blocked because a bad word was found. If you believe this word should not be blocked, please message support.";
	}
	return input;
	}

view raw

Efficient Bad Word Filter

hosted with ❤ by GitHub

One thought on “[Profanity Filter] Efficiently filtering combinations of bad words out of string inputs”

James says:

July 12, 2016 at 4:48 pm

Mind sharing your list of bad words? I’m working on a hackathon project for web-browsers and need a decent starting list prototype.

Reply

Leave a comment Cancel reply