I am using Spacy nlp.pipe() for getting doc objects for text data in pandas Dataframe column but the parsed text returned as “text” in the code has length of only 32. However, the shape of dataframe is (14640, 16).
Here is the data link if someone wants to read the data.
nlp = spacy.load("en_core_web_sm") for text in nlp.pipe(iter(df['text']), batch_size = 1000, n_threads=-1): print(text) len(text)
Can someone help me with this what is going on? What I am doing wrong?
According to the Spacy Documentation of
Doc object here, the
__len__ operator gets “the number of tokens in the document.”.
The last text in your data is:
>>> df['text'].values[-1] @AmericanAir we have 8 ppl so we need 2 know how many seats are on the next flight. Plz put us on standby for 4 people on the next flight?
After running the
nlp.pipe() method, this sentence will be tokenized into 32 tokens which what you’re asking for. To verfiy that, try runn the following code after
len(text) and will get the exact result:
>>> last_tokens = [token for token in text] >>> last_tokens [@AmericanAir, we, have, 8, ppl, so, we, need, 2, know, how, many, seats, are, on, the, next, flight, ., Plz, put, us, on, standby, for, 4, people, on, the, next, flight, ?] >>> len(last_tokens) 32
You can iterate over the tokens of each
doc returned from the pipeline like so:
nlp = spacy.load("en_core_web_sm") for text in nlp.pipe(iter(df['text']), batch_size = 1000, n_threads=-1): for token in text: print(token) print('n')