Friday, 9 August 2013

Removing characters from a String in Python

Removing characters from a String in Python

[Hoping a mod can help make this question clearer]
I've a huge csv file of tweets. I read them both into the computer and
stored them in two separate dictionaries - one for negative tweets, one
for positive. I wanted to read the file in and parse it to a dictionary
whilst removing any punctuation marks. I've used this code:
tweets = []
for (text, sentiment) in pos_tweets.items() + neg_tweets.items():
shortenedText = [e.lower() and e.translate(string.maketrans("",""),
string.punctuation) for e in text.split() if len(e) >= 3 and not
e.startswith('http')]
print shortenedText
It's all worked well barring one minor problem. The huge csv file I've
downloaded has unfortunately changed some of the punctuation. I'm not sure
what this is called so can't really google it, but effectively some
sentence might begin:
"ampampFightin"
""The truth is out there"
"&altThis is the way I feel"
Is there a way to get rid of all these? I notice the latter two begin with
an ampersand - will a simple search for that get rid of it (the only
reason I'm asking and not doing is because there's too many tweets for me
to manually check)

No comments:

Post a Comment