Converting a text document with special format to Pandas DataFrame Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!How can I reverse a list in Python?Add one row to pandas DataFrameSelecting multiple columns in a pandas dataframeUse a list of values to select rows from a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersConvert list of dictionaries to a pandas DataFrame

Does using the Inspiration rules for character defects encourage My Guy Syndrome?

Why these surprising proportionalities of integrals involving odd zeta values?

Why not use the yoke to control yaw, as well as pitch and roll?

Can I take recommendation from someone I met at a conference?

Is Vivien of the Wilds + Wilderness Reclimation a competitive combo?

Are bags of holding fireproof?

How to charge percentage of transaction cost?

How to create a command for the "strange m" symbol in latex?

Can I ask an author to send me his ebook?

Grounding PCB Within Aluminum Enclosure

Why are two-digit numbers in Jonathan Swift's "Gulliver's Travels" (1726) written in "German style"?

xkeyval -- read keys from file

Does the Pact of the Blade warlock feature allow me to customize the properties of the pact weapon I create?

Would I be safe to drive a 23 year old truck for 7 hours / 450 miles?

Lights are flickering on and off after accidentally bumping into light switch

Why do C and C++ allow the expression (int) + 4*5?

Putting Ant-Man on house arrest

Married in secret, can marital status in passport be changed at a later date?

lm and glm function in R

How to ask rejected full-time candidates to apply to teach individual courses?

How do I overlay a PNG over two videos (one video overlays another) in one command using FFmpeg?

What's the connection between Mr. Nancy and fried chicken?

What documents does someone with a long-term visa need to travel to another Schengen country?

Can a Knight grant Knighthood to another?



Converting a text document with special format to Pandas DataFrame



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
Data science time! April 2019 and salary with experience
The Ask Question Wizard is Live!How can I reverse a list in Python?Add one row to pandas DataFrameSelecting multiple columns in a pandas dataframeUse a list of values to select rows from a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersConvert list of dictionaries to a pandas DataFrame



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








10















I have a text file with the following format:



1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345


I need to covert this text to a DataFrame with the following format:



Id Term weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345


How I can do it?










share|improve this question
























  • I can only think of regex helping here.

    – amanb
    4 hours ago







  • 1





    Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

    – Quang Hoang
    4 hours ago











  • It can be done with explode and split

    – Wen-Ben
    4 hours ago











  • Also , When you read the text to pandas what is the format of the df ?

    – Wen-Ben
    4 hours ago












  • The data is in text format.

    – Mary
    4 hours ago

















10















I have a text file with the following format:



1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345


I need to covert this text to a DataFrame with the following format:



Id Term weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345


How I can do it?










share|improve this question
























  • I can only think of regex helping here.

    – amanb
    4 hours ago







  • 1





    Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

    – Quang Hoang
    4 hours ago











  • It can be done with explode and split

    – Wen-Ben
    4 hours ago











  • Also , When you read the text to pandas what is the format of the df ?

    – Wen-Ben
    4 hours ago












  • The data is in text format.

    – Mary
    4 hours ago













10












10








10


5






I have a text file with the following format:



1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345


I need to covert this text to a DataFrame with the following format:



Id Term weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345


How I can do it?










share|improve this question
















I have a text file with the following format:



1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345


I need to covert this text to a DataFrame with the following format:



Id Term weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345


How I can do it?







python pandas






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 2 hours ago









Brad Solomon

15k83995




15k83995










asked 4 hours ago









MaryMary

462217




462217












  • I can only think of regex helping here.

    – amanb
    4 hours ago







  • 1





    Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

    – Quang Hoang
    4 hours ago











  • It can be done with explode and split

    – Wen-Ben
    4 hours ago











  • Also , When you read the text to pandas what is the format of the df ?

    – Wen-Ben
    4 hours ago












  • The data is in text format.

    – Mary
    4 hours ago

















  • I can only think of regex helping here.

    – amanb
    4 hours ago







  • 1





    Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

    – Quang Hoang
    4 hours ago











  • It can be done with explode and split

    – Wen-Ben
    4 hours ago











  • Also , When you read the text to pandas what is the format of the df ?

    – Wen-Ben
    4 hours ago












  • The data is in text format.

    – Mary
    4 hours ago
















I can only think of regex helping here.

– amanb
4 hours ago






I can only think of regex helping here.

– amanb
4 hours ago





1




1





Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

– Quang Hoang
4 hours ago





Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

– Quang Hoang
4 hours ago













It can be done with explode and split

– Wen-Ben
4 hours ago





It can be done with explode and split

– Wen-Ben
4 hours ago













Also , When you read the text to pandas what is the format of the df ?

– Wen-Ben
4 hours ago






Also , When you read the text to pandas what is the format of the df ?

– Wen-Ben
4 hours ago














The data is in text format.

– Mary
4 hours ago





The data is in text format.

– Mary
4 hours ago












8 Answers
8






active

oldest

votes


















9














Here's an optimized way to parse the file with re, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.



import re
import pandas as pd

SEP_RE = re.compile(r":s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)


def parse(filepath: str):
def _parse(filepath):
with open(filepath) as f:
for line in f:
id, rest = SEP_RE.split(line, maxsplit=1)
for match in DATA_RE.finditer(rest):
yield [int(id), match["term"], float(match["weight"])]
return list(_parse(filepath))


Example:



>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
... columns=["Id", "Term", "weight"])
>>>
>>> df
Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345

>>> df.dtypes
Id int64
Term object
weight float64
dtype: object



Walkthrough



SEP_RE looks for an initial separator: a literal : followed by one or more spaces. It uses maxsplit=1 to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.



After that, DATA_RE.finditer() deals with each (term, weight) pair extraxted from rest. The string rest itself will look like frack 0.733, shale 0.700,. .finditer() gives you multiple match objects, where you can use ["key"] notation to access the element from a given named capture group, such as (?P<term>[a-z]+).



An easy way to visualize this is to use an example line from your file as a string:



>>> line = "1: frack 0.733, shale 0.700,n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,n']


Now you have the initial ID and rest of the components, which you can unpack into two identifiers.



>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'


The better way to visualize it is with pdb. Give it a try if you dare ;)



Disclaimer



This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.



For instance, it assumes that each each Term can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re characters such as w.






share|improve this answer




















  • 2





    Brilliant answer, I must say.

    – amanb
    4 hours ago











  • @amanb Thank you!

    – Brad Solomon
    4 hours ago


















3














You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:



import pandas as pd
from itertools import chain

text="""1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345 """

df = pd.DataFrame(
list(
chain.from_iterable(
map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
)
),
columns=["Id", "Term", "weight"]
)

print(df)
# Id Term weight
#0 4 frack 0.733
#1 4 shale 0.700
#2 4 space 0.645
#3 4 station 0.327
#4 4 nasa 0.258
#5 4 celebr 0.262
#6 4 bahar 0.345


Explanation



I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :



print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'],
# ['10', ' space 0.645, station 0.327, nasa 0.258'],
# ['4', ' celebr 0.262, bahar 0.345']]


The next step is to split on the comma to separate the values, and assign the Id to each set of values:



print(
[
list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
# ('10', 'station', '0.327'),
# ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]


Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.



Note: The * tuple unpacking is a python 3 feature.






share|improve this answer
































    3














    Assuming your data (csv file) looks like given:



    df = pd.read_csv('untitled.txt', sep=': ', header=None)
    df.set_index(0, inplace=True)

    # split the `,`
    df = df[1].str.strip().str.split(',', expand=True)

    # 0 1 2 3
    #-- ------------ ------------- ---------- ---
    # 1 frack 0.733 shale 0.700
    #10 space 0.645 station 0.327 nasa 0.258
    # 4 celebr 0.262 bahar 0.345

    # stack and drop empty
    df = df.stack()
    df = df[~df.eq('')]

    # split ' '
    df = df.str.strip().str.split(' ', expand=True)

    # edit to give final expected output:

    # rename index and columns for reset_index
    df.index.names = ['Id', 'to_drop']
    df.columns = ['Term', 'weight']

    # final df
    final_df = df.reset_index().drop('to_drop', axis=1)





    share|improve this answer

























    • how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

      – Rebin
      4 hours ago






    • 1





      @Rebin add engine='python'

      – pault
      4 hours ago











    • @pault weird, 'cause I already split by ' '. It yields correct data on my computer.

      – Quang Hoang
      4 hours ago











    • I dont know how to add engine python? what is the command?

      – Rebin
      4 hours ago






    • 1





      @Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

      – pault
      4 hours ago


















    1














    Just to put my two cents in: you could write yourself a parser and feed the result into pandas:



    import pandas as pd
    from parsimonious.grammar import Grammar
    from parsimonious.nodes import NodeVisitor

    file = """1: frack 0.733, shale 0.700,
    10: space 0.645, station 0.327, nasa 0.258,
    4: celebr 0.262, bahar 0.345
    """

    grammar = Grammar(
    r"""
    expr = line+

    line = id colon pair*
    pair = term ws weight sep? ws?

    id = ~"d+"
    colon = ws? ":" ws?
    sep = ws? "," ws?

    term = ~"[a-zA-Z]+"
    weight = ~"d+(?:.d+)?"

    ws = ~"s+"
    """
    )

    tree = grammar.parse(file)

    class PandasVisitor(NodeVisitor):
    def generic_visit(self, node, visited_children):
    return visited_children or node

    def visit_pair(self, node, visited_children):
    term, _, weight, *_ = visited_children
    return (term.text, weight.text)

    def visit_line(self, node, visited_children):
    id, _, pairs = visited_children
    return [(id.text, *pair) for pair in pairs]

    def visit_expr(self, node, visited_children):
    return [item for lst in visited_children for item in lst]

    pv = PandasVisitor()
    result = pv.visit(tree)

    df = pd.DataFrame(result, columns=["Id", "Term", "weight"])
    print(df)


    This yields



     Id Term weight
    0 1 frack 0.733
    1 1 shale 0.700
    2 10 space 0.645
    3 10 station 0.327
    4 10 nasa 0.258
    5 4 celebr 0.262
    6 4 bahar 0.345





    share|improve this answer






























      0














      Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.



      import pandas as pd
      file=r"give_your_path".replace('\', '/')
      my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
      with open(file,"r+") as f:
      for line in f.readlines():#looping every line
      my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
      for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
      my_list_of_lists.append(my_id+term)
      df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
      df.columns=["Id","Term","weight"]#giving columns their names





      share|improve this answer






























        0














        It is possible to just use entirely pandas:



        df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
        10: space 0.645, station 0.327, nasa 0.258,
        4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

        #df:
        0 1
        0 1 frack 0.733, shale 0.700,
        1 10 space 0.645, station 0.327, nasa 0.258,
        2 4 celebr 0.262, bahar 0.345


        Turn the column 1 into a list and then expand:



        df[1] = df[1].str.split(",", expand=False)

        dfs = []
        for idx, rows in df.iterrows():
        print(rows)
        dfslice = pd.DataFrame("Id": [rows[0]]*len(rows[1]), "terms": rows[1])
        dfs.append(dfslice)
        newdf = pd.concat(dfs, ignore_index=True)

        # this creates newdf:
        Id terms
        0 1 frack 0.733
        1 1 shale 0.700
        2 1
        3 10 space 0.645
        4 10 station 0.327
        5 10 nasa 0.258
        6 10
        7 4 celebr 0.262
        8 4 bahar 0.345


        Now we need to str split the last line and drop empties:



        newdf["terms"] = newdf["terms"].str.strip()
        newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
        newdf.columns = ["Id", "terms", "Term", "Weights"]
        newdf = newdf.drop("terms", axis=1).dropna()


        Resulting newdf:



         Id Term Weights
        0 1 frack 0.733
        1 1 shale 0.700
        3 10 space 0.645
        4 10 station 0.327
        5 10 nasa 0.258
        7 4 celebr 0.262
        8 4 bahar 0.345





        share|improve this answer






























          0














          Could I assume that there is just 1 space before 'TERM'?



          df=pd.DataFrame(columns=['ID','Term','Weight'])
          with open('C:/random/d1','r') as readObject:
          for line in readObject:
          line=line.rstrip('n')
          tempList1=line.split(':')
          tempList2=tempList1[1]
          tempList2=tempList2.rstrip(',')
          tempList2=tempList2.split(',')
          for item in tempList2:
          e=item.split(' ')
          tempRow=[tempList1[0], e[0],e[1]]
          df.loc[len(df)]=tempRow
          print(df)





          share|improve this answer






























            -3














            1) You can read row by row.



            2) Then you can separate by ':' for your index and ',' for the values



            1)



            with open('path/filename.txt','r') as filename:
            content = filename.readlines()


            2)
            content = [x.split(':') for x in content]



            This will give you the following result:



            content =[
            ['1','frack 0.733, shale 0.700,'],
            ['10', 'space 0.645, station 0.327, nasa 0.258,'],
            ['4','celebr 0.262, bahar 0.345 ']]





            share|improve this answer


















            • 3





              Your result is not the result asked for in the question.

              – GiraffeMan91
              4 hours ago











            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55799784%2fconverting-a-text-document-with-special-format-to-pandas-dataframe%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            8 Answers
            8






            active

            oldest

            votes








            8 Answers
            8






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            9














            Here's an optimized way to parse the file with re, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.



            import re
            import pandas as pd

            SEP_RE = re.compile(r":s+")
            DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)


            def parse(filepath: str):
            def _parse(filepath):
            with open(filepath) as f:
            for line in f:
            id, rest = SEP_RE.split(line, maxsplit=1)
            for match in DATA_RE.finditer(rest):
            yield [int(id), match["term"], float(match["weight"])]
            return list(_parse(filepath))


            Example:



            >>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
            ... columns=["Id", "Term", "weight"])
            >>>
            >>> df
            Id Term weight
            0 1 frack 0.733
            1 1 shale 0.700
            2 10 space 0.645
            3 10 station 0.327
            4 10 nasa 0.258
            5 4 celebr 0.262
            6 4 bahar 0.345

            >>> df.dtypes
            Id int64
            Term object
            weight float64
            dtype: object



            Walkthrough



            SEP_RE looks for an initial separator: a literal : followed by one or more spaces. It uses maxsplit=1 to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.



            After that, DATA_RE.finditer() deals with each (term, weight) pair extraxted from rest. The string rest itself will look like frack 0.733, shale 0.700,. .finditer() gives you multiple match objects, where you can use ["key"] notation to access the element from a given named capture group, such as (?P<term>[a-z]+).



            An easy way to visualize this is to use an example line from your file as a string:



            >>> line = "1: frack 0.733, shale 0.700,n"
            >>> SEP_RE.split(line, maxsplit=1)
            ['1', 'frack 0.733, shale 0.700,n']


            Now you have the initial ID and rest of the components, which you can unpack into two identifiers.



            >>> id, rest = SEP_RE.split(line, maxsplit=1)
            >>> it = DATA_RE.finditer(rest)
            >>> match = next(it)
            >>> match
            <re.Match object; span=(0, 11), match='frack 0.733'>
            >>> match["term"]
            'frack'
            >>> match["weight"]
            '0.733'


            The better way to visualize it is with pdb. Give it a try if you dare ;)



            Disclaimer



            This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.



            For instance, it assumes that each each Term can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re characters such as w.






            share|improve this answer




















            • 2





              Brilliant answer, I must say.

              – amanb
              4 hours ago











            • @amanb Thank you!

              – Brad Solomon
              4 hours ago















            9














            Here's an optimized way to parse the file with re, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.



            import re
            import pandas as pd

            SEP_RE = re.compile(r":s+")
            DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)


            def parse(filepath: str):
            def _parse(filepath):
            with open(filepath) as f:
            for line in f:
            id, rest = SEP_RE.split(line, maxsplit=1)
            for match in DATA_RE.finditer(rest):
            yield [int(id), match["term"], float(match["weight"])]
            return list(_parse(filepath))


            Example:



            >>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
            ... columns=["Id", "Term", "weight"])
            >>>
            >>> df
            Id Term weight
            0 1 frack 0.733
            1 1 shale 0.700
            2 10 space 0.645
            3 10 station 0.327
            4 10 nasa 0.258
            5 4 celebr 0.262
            6 4 bahar 0.345

            >>> df.dtypes
            Id int64
            Term object
            weight float64
            dtype: object



            Walkthrough



            SEP_RE looks for an initial separator: a literal : followed by one or more spaces. It uses maxsplit=1 to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.



            After that, DATA_RE.finditer() deals with each (term, weight) pair extraxted from rest. The string rest itself will look like frack 0.733, shale 0.700,. .finditer() gives you multiple match objects, where you can use ["key"] notation to access the element from a given named capture group, such as (?P<term>[a-z]+).



            An easy way to visualize this is to use an example line from your file as a string:



            >>> line = "1: frack 0.733, shale 0.700,n"
            >>> SEP_RE.split(line, maxsplit=1)
            ['1', 'frack 0.733, shale 0.700,n']


            Now you have the initial ID and rest of the components, which you can unpack into two identifiers.



            >>> id, rest = SEP_RE.split(line, maxsplit=1)
            >>> it = DATA_RE.finditer(rest)
            >>> match = next(it)
            >>> match
            <re.Match object; span=(0, 11), match='frack 0.733'>
            >>> match["term"]
            'frack'
            >>> match["weight"]
            '0.733'


            The better way to visualize it is with pdb. Give it a try if you dare ;)



            Disclaimer



            This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.



            For instance, it assumes that each each Term can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re characters such as w.






            share|improve this answer




















            • 2





              Brilliant answer, I must say.

              – amanb
              4 hours ago











            • @amanb Thank you!

              – Brad Solomon
              4 hours ago













            9












            9








            9







            Here's an optimized way to parse the file with re, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.



            import re
            import pandas as pd

            SEP_RE = re.compile(r":s+")
            DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)


            def parse(filepath: str):
            def _parse(filepath):
            with open(filepath) as f:
            for line in f:
            id, rest = SEP_RE.split(line, maxsplit=1)
            for match in DATA_RE.finditer(rest):
            yield [int(id), match["term"], float(match["weight"])]
            return list(_parse(filepath))


            Example:



            >>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
            ... columns=["Id", "Term", "weight"])
            >>>
            >>> df
            Id Term weight
            0 1 frack 0.733
            1 1 shale 0.700
            2 10 space 0.645
            3 10 station 0.327
            4 10 nasa 0.258
            5 4 celebr 0.262
            6 4 bahar 0.345

            >>> df.dtypes
            Id int64
            Term object
            weight float64
            dtype: object



            Walkthrough



            SEP_RE looks for an initial separator: a literal : followed by one or more spaces. It uses maxsplit=1 to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.



            After that, DATA_RE.finditer() deals with each (term, weight) pair extraxted from rest. The string rest itself will look like frack 0.733, shale 0.700,. .finditer() gives you multiple match objects, where you can use ["key"] notation to access the element from a given named capture group, such as (?P<term>[a-z]+).



            An easy way to visualize this is to use an example line from your file as a string:



            >>> line = "1: frack 0.733, shale 0.700,n"
            >>> SEP_RE.split(line, maxsplit=1)
            ['1', 'frack 0.733, shale 0.700,n']


            Now you have the initial ID and rest of the components, which you can unpack into two identifiers.



            >>> id, rest = SEP_RE.split(line, maxsplit=1)
            >>> it = DATA_RE.finditer(rest)
            >>> match = next(it)
            >>> match
            <re.Match object; span=(0, 11), match='frack 0.733'>
            >>> match["term"]
            'frack'
            >>> match["weight"]
            '0.733'


            The better way to visualize it is with pdb. Give it a try if you dare ;)



            Disclaimer



            This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.



            For instance, it assumes that each each Term can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re characters such as w.






            share|improve this answer















            Here's an optimized way to parse the file with re, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.



            import re
            import pandas as pd

            SEP_RE = re.compile(r":s+")
            DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)


            def parse(filepath: str):
            def _parse(filepath):
            with open(filepath) as f:
            for line in f:
            id, rest = SEP_RE.split(line, maxsplit=1)
            for match in DATA_RE.finditer(rest):
            yield [int(id), match["term"], float(match["weight"])]
            return list(_parse(filepath))


            Example:



            >>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
            ... columns=["Id", "Term", "weight"])
            >>>
            >>> df
            Id Term weight
            0 1 frack 0.733
            1 1 shale 0.700
            2 10 space 0.645
            3 10 station 0.327
            4 10 nasa 0.258
            5 4 celebr 0.262
            6 4 bahar 0.345

            >>> df.dtypes
            Id int64
            Term object
            weight float64
            dtype: object



            Walkthrough



            SEP_RE looks for an initial separator: a literal : followed by one or more spaces. It uses maxsplit=1 to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.



            After that, DATA_RE.finditer() deals with each (term, weight) pair extraxted from rest. The string rest itself will look like frack 0.733, shale 0.700,. .finditer() gives you multiple match objects, where you can use ["key"] notation to access the element from a given named capture group, such as (?P<term>[a-z]+).



            An easy way to visualize this is to use an example line from your file as a string:



            >>> line = "1: frack 0.733, shale 0.700,n"
            >>> SEP_RE.split(line, maxsplit=1)
            ['1', 'frack 0.733, shale 0.700,n']


            Now you have the initial ID and rest of the components, which you can unpack into two identifiers.



            >>> id, rest = SEP_RE.split(line, maxsplit=1)
            >>> it = DATA_RE.finditer(rest)
            >>> match = next(it)
            >>> match
            <re.Match object; span=(0, 11), match='frack 0.733'>
            >>> match["term"]
            'frack'
            >>> match["weight"]
            '0.733'


            The better way to visualize it is with pdb. Give it a try if you dare ;)



            Disclaimer



            This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.



            For instance, it assumes that each each Term can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re characters such as w.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited 4 hours ago

























            answered 4 hours ago









            Brad SolomonBrad Solomon

            15k83995




            15k83995







            • 2





              Brilliant answer, I must say.

              – amanb
              4 hours ago











            • @amanb Thank you!

              – Brad Solomon
              4 hours ago












            • 2





              Brilliant answer, I must say.

              – amanb
              4 hours ago











            • @amanb Thank you!

              – Brad Solomon
              4 hours ago







            2




            2





            Brilliant answer, I must say.

            – amanb
            4 hours ago





            Brilliant answer, I must say.

            – amanb
            4 hours ago













            @amanb Thank you!

            – Brad Solomon
            4 hours ago





            @amanb Thank you!

            – Brad Solomon
            4 hours ago













            3














            You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:



            import pandas as pd
            from itertools import chain

            text="""1: frack 0.733, shale 0.700,
            10: space 0.645, station 0.327, nasa 0.258,
            4: celebr 0.262, bahar 0.345 """

            df = pd.DataFrame(
            list(
            chain.from_iterable(
            map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
            map(lambda x: x.strip(" ,").split(":"), text.splitlines())
            )
            ),
            columns=["Id", "Term", "weight"]
            )

            print(df)
            # Id Term weight
            #0 4 frack 0.733
            #1 4 shale 0.700
            #2 4 space 0.645
            #3 4 station 0.327
            #4 4 nasa 0.258
            #5 4 celebr 0.262
            #6 4 bahar 0.345


            Explanation



            I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :



            print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
            #[['1', ' frack 0.733, shale 0.700'],
            # ['10', ' space 0.645, station 0.327, nasa 0.258'],
            # ['4', ' celebr 0.262, bahar 0.345']]


            The next step is to split on the comma to separate the values, and assign the Id to each set of values:



            print(
            [
            list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
            map(lambda x: x.strip(" ,").split(":"), text.splitlines())
            ]
            )
            #[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
            # [('10', 'space', '0.645'),
            # ('10', 'station', '0.327'),
            # ('10', 'nasa', '0.258')],
            # [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]


            Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.



            Note: The * tuple unpacking is a python 3 feature.






            share|improve this answer





























              3














              You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:



              import pandas as pd
              from itertools import chain

              text="""1: frack 0.733, shale 0.700,
              10: space 0.645, station 0.327, nasa 0.258,
              4: celebr 0.262, bahar 0.345 """

              df = pd.DataFrame(
              list(
              chain.from_iterable(
              map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
              map(lambda x: x.strip(" ,").split(":"), text.splitlines())
              )
              ),
              columns=["Id", "Term", "weight"]
              )

              print(df)
              # Id Term weight
              #0 4 frack 0.733
              #1 4 shale 0.700
              #2 4 space 0.645
              #3 4 station 0.327
              #4 4 nasa 0.258
              #5 4 celebr 0.262
              #6 4 bahar 0.345


              Explanation



              I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :



              print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
              #[['1', ' frack 0.733, shale 0.700'],
              # ['10', ' space 0.645, station 0.327, nasa 0.258'],
              # ['4', ' celebr 0.262, bahar 0.345']]


              The next step is to split on the comma to separate the values, and assign the Id to each set of values:



              print(
              [
              list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
              map(lambda x: x.strip(" ,").split(":"), text.splitlines())
              ]
              )
              #[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
              # [('10', 'space', '0.645'),
              # ('10', 'station', '0.327'),
              # ('10', 'nasa', '0.258')],
              # [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]


              Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.



              Note: The * tuple unpacking is a python 3 feature.






              share|improve this answer



























                3












                3








                3







                You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:



                import pandas as pd
                from itertools import chain

                text="""1: frack 0.733, shale 0.700,
                10: space 0.645, station 0.327, nasa 0.258,
                4: celebr 0.262, bahar 0.345 """

                df = pd.DataFrame(
                list(
                chain.from_iterable(
                map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
                map(lambda x: x.strip(" ,").split(":"), text.splitlines())
                )
                ),
                columns=["Id", "Term", "weight"]
                )

                print(df)
                # Id Term weight
                #0 4 frack 0.733
                #1 4 shale 0.700
                #2 4 space 0.645
                #3 4 station 0.327
                #4 4 nasa 0.258
                #5 4 celebr 0.262
                #6 4 bahar 0.345


                Explanation



                I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :



                print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
                #[['1', ' frack 0.733, shale 0.700'],
                # ['10', ' space 0.645, station 0.327, nasa 0.258'],
                # ['4', ' celebr 0.262, bahar 0.345']]


                The next step is to split on the comma to separate the values, and assign the Id to each set of values:



                print(
                [
                list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
                map(lambda x: x.strip(" ,").split(":"), text.splitlines())
                ]
                )
                #[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
                # [('10', 'space', '0.645'),
                # ('10', 'station', '0.327'),
                # ('10', 'nasa', '0.258')],
                # [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]


                Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.



                Note: The * tuple unpacking is a python 3 feature.






                share|improve this answer















                You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:



                import pandas as pd
                from itertools import chain

                text="""1: frack 0.733, shale 0.700,
                10: space 0.645, station 0.327, nasa 0.258,
                4: celebr 0.262, bahar 0.345 """

                df = pd.DataFrame(
                list(
                chain.from_iterable(
                map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
                map(lambda x: x.strip(" ,").split(":"), text.splitlines())
                )
                ),
                columns=["Id", "Term", "weight"]
                )

                print(df)
                # Id Term weight
                #0 4 frack 0.733
                #1 4 shale 0.700
                #2 4 space 0.645
                #3 4 station 0.327
                #4 4 nasa 0.258
                #5 4 celebr 0.262
                #6 4 bahar 0.345


                Explanation



                I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :



                print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
                #[['1', ' frack 0.733, shale 0.700'],
                # ['10', ' space 0.645, station 0.327, nasa 0.258'],
                # ['4', ' celebr 0.262, bahar 0.345']]


                The next step is to split on the comma to separate the values, and assign the Id to each set of values:



                print(
                [
                list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
                map(lambda x: x.strip(" ,").split(":"), text.splitlines())
                ]
                )
                #[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
                # [('10', 'space', '0.645'),
                # ('10', 'station', '0.327'),
                # ('10', 'nasa', '0.258')],
                # [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]


                Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.



                Note: The * tuple unpacking is a python 3 feature.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited 4 hours ago

























                answered 4 hours ago









                paultpault

                17.3k42754




                17.3k42754





















                    3














                    Assuming your data (csv file) looks like given:



                    df = pd.read_csv('untitled.txt', sep=': ', header=None)
                    df.set_index(0, inplace=True)

                    # split the `,`
                    df = df[1].str.strip().str.split(',', expand=True)

                    # 0 1 2 3
                    #-- ------------ ------------- ---------- ---
                    # 1 frack 0.733 shale 0.700
                    #10 space 0.645 station 0.327 nasa 0.258
                    # 4 celebr 0.262 bahar 0.345

                    # stack and drop empty
                    df = df.stack()
                    df = df[~df.eq('')]

                    # split ' '
                    df = df.str.strip().str.split(' ', expand=True)

                    # edit to give final expected output:

                    # rename index and columns for reset_index
                    df.index.names = ['Id', 'to_drop']
                    df.columns = ['Term', 'weight']

                    # final df
                    final_df = df.reset_index().drop('to_drop', axis=1)





                    share|improve this answer

























                    • how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

                      – Rebin
                      4 hours ago






                    • 1





                      @Rebin add engine='python'

                      – pault
                      4 hours ago











                    • @pault weird, 'cause I already split by ' '. It yields correct data on my computer.

                      – Quang Hoang
                      4 hours ago











                    • I dont know how to add engine python? what is the command?

                      – Rebin
                      4 hours ago






                    • 1





                      @Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

                      – pault
                      4 hours ago















                    3














                    Assuming your data (csv file) looks like given:



                    df = pd.read_csv('untitled.txt', sep=': ', header=None)
                    df.set_index(0, inplace=True)

                    # split the `,`
                    df = df[1].str.strip().str.split(',', expand=True)

                    # 0 1 2 3
                    #-- ------------ ------------- ---------- ---
                    # 1 frack 0.733 shale 0.700
                    #10 space 0.645 station 0.327 nasa 0.258
                    # 4 celebr 0.262 bahar 0.345

                    # stack and drop empty
                    df = df.stack()
                    df = df[~df.eq('')]

                    # split ' '
                    df = df.str.strip().str.split(' ', expand=True)

                    # edit to give final expected output:

                    # rename index and columns for reset_index
                    df.index.names = ['Id', 'to_drop']
                    df.columns = ['Term', 'weight']

                    # final df
                    final_df = df.reset_index().drop('to_drop', axis=1)





                    share|improve this answer

























                    • how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

                      – Rebin
                      4 hours ago






                    • 1





                      @Rebin add engine='python'

                      – pault
                      4 hours ago











                    • @pault weird, 'cause I already split by ' '. It yields correct data on my computer.

                      – Quang Hoang
                      4 hours ago











                    • I dont know how to add engine python? what is the command?

                      – Rebin
                      4 hours ago






                    • 1





                      @Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

                      – pault
                      4 hours ago













                    3












                    3








                    3







                    Assuming your data (csv file) looks like given:



                    df = pd.read_csv('untitled.txt', sep=': ', header=None)
                    df.set_index(0, inplace=True)

                    # split the `,`
                    df = df[1].str.strip().str.split(',', expand=True)

                    # 0 1 2 3
                    #-- ------------ ------------- ---------- ---
                    # 1 frack 0.733 shale 0.700
                    #10 space 0.645 station 0.327 nasa 0.258
                    # 4 celebr 0.262 bahar 0.345

                    # stack and drop empty
                    df = df.stack()
                    df = df[~df.eq('')]

                    # split ' '
                    df = df.str.strip().str.split(' ', expand=True)

                    # edit to give final expected output:

                    # rename index and columns for reset_index
                    df.index.names = ['Id', 'to_drop']
                    df.columns = ['Term', 'weight']

                    # final df
                    final_df = df.reset_index().drop('to_drop', axis=1)





                    share|improve this answer















                    Assuming your data (csv file) looks like given:



                    df = pd.read_csv('untitled.txt', sep=': ', header=None)
                    df.set_index(0, inplace=True)

                    # split the `,`
                    df = df[1].str.strip().str.split(',', expand=True)

                    # 0 1 2 3
                    #-- ------------ ------------- ---------- ---
                    # 1 frack 0.733 shale 0.700
                    #10 space 0.645 station 0.327 nasa 0.258
                    # 4 celebr 0.262 bahar 0.345

                    # stack and drop empty
                    df = df.stack()
                    df = df[~df.eq('')]

                    # split ' '
                    df = df.str.strip().str.split(' ', expand=True)

                    # edit to give final expected output:

                    # rename index and columns for reset_index
                    df.index.names = ['Id', 'to_drop']
                    df.columns = ['Term', 'weight']

                    # final df
                    final_df = df.reset_index().drop('to_drop', axis=1)






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited 4 hours ago

























                    answered 4 hours ago









                    Quang HoangQuang Hoang

                    3,75711019




                    3,75711019












                    • how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

                      – Rebin
                      4 hours ago






                    • 1





                      @Rebin add engine='python'

                      – pault
                      4 hours ago











                    • @pault weird, 'cause I already split by ' '. It yields correct data on my computer.

                      – Quang Hoang
                      4 hours ago











                    • I dont know how to add engine python? what is the command?

                      – Rebin
                      4 hours ago






                    • 1





                      @Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

                      – pault
                      4 hours ago

















                    • how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

                      – Rebin
                      4 hours ago






                    • 1





                      @Rebin add engine='python'

                      – pault
                      4 hours ago











                    • @pault weird, 'cause I already split by ' '. It yields correct data on my computer.

                      – Quang Hoang
                      4 hours ago











                    • I dont know how to add engine python? what is the command?

                      – Rebin
                      4 hours ago






                    • 1





                      @Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

                      – pault
                      4 hours ago
















                    how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

                    – Rebin
                    4 hours ago





                    how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

                    – Rebin
                    4 hours ago




                    1




                    1





                    @Rebin add engine='python'

                    – pault
                    4 hours ago





                    @Rebin add engine='python'

                    – pault
                    4 hours ago













                    @pault weird, 'cause I already split by ' '. It yields correct data on my computer.

                    – Quang Hoang
                    4 hours ago





                    @pault weird, 'cause I already split by ' '. It yields correct data on my computer.

                    – Quang Hoang
                    4 hours ago













                    I dont know how to add engine python? what is the command?

                    – Rebin
                    4 hours ago





                    I dont know how to add engine python? what is the command?

                    – Rebin
                    4 hours ago




                    1




                    1





                    @Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

                    – pault
                    4 hours ago





                    @Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

                    – pault
                    4 hours ago











                    1














                    Just to put my two cents in: you could write yourself a parser and feed the result into pandas:



                    import pandas as pd
                    from parsimonious.grammar import Grammar
                    from parsimonious.nodes import NodeVisitor

                    file = """1: frack 0.733, shale 0.700,
                    10: space 0.645, station 0.327, nasa 0.258,
                    4: celebr 0.262, bahar 0.345
                    """

                    grammar = Grammar(
                    r"""
                    expr = line+

                    line = id colon pair*
                    pair = term ws weight sep? ws?

                    id = ~"d+"
                    colon = ws? ":" ws?
                    sep = ws? "," ws?

                    term = ~"[a-zA-Z]+"
                    weight = ~"d+(?:.d+)?"

                    ws = ~"s+"
                    """
                    )

                    tree = grammar.parse(file)

                    class PandasVisitor(NodeVisitor):
                    def generic_visit(self, node, visited_children):
                    return visited_children or node

                    def visit_pair(self, node, visited_children):
                    term, _, weight, *_ = visited_children
                    return (term.text, weight.text)

                    def visit_line(self, node, visited_children):
                    id, _, pairs = visited_children
                    return [(id.text, *pair) for pair in pairs]

                    def visit_expr(self, node, visited_children):
                    return [item for lst in visited_children for item in lst]

                    pv = PandasVisitor()
                    result = pv.visit(tree)

                    df = pd.DataFrame(result, columns=["Id", "Term", "weight"])
                    print(df)


                    This yields



                     Id Term weight
                    0 1 frack 0.733
                    1 1 shale 0.700
                    2 10 space 0.645
                    3 10 station 0.327
                    4 10 nasa 0.258
                    5 4 celebr 0.262
                    6 4 bahar 0.345





                    share|improve this answer



























                      1














                      Just to put my two cents in: you could write yourself a parser and feed the result into pandas:



                      import pandas as pd
                      from parsimonious.grammar import Grammar
                      from parsimonious.nodes import NodeVisitor

                      file = """1: frack 0.733, shale 0.700,
                      10: space 0.645, station 0.327, nasa 0.258,
                      4: celebr 0.262, bahar 0.345
                      """

                      grammar = Grammar(
                      r"""
                      expr = line+

                      line = id colon pair*
                      pair = term ws weight sep? ws?

                      id = ~"d+"
                      colon = ws? ":" ws?
                      sep = ws? "," ws?

                      term = ~"[a-zA-Z]+"
                      weight = ~"d+(?:.d+)?"

                      ws = ~"s+"
                      """
                      )

                      tree = grammar.parse(file)

                      class PandasVisitor(NodeVisitor):
                      def generic_visit(self, node, visited_children):
                      return visited_children or node

                      def visit_pair(self, node, visited_children):
                      term, _, weight, *_ = visited_children
                      return (term.text, weight.text)

                      def visit_line(self, node, visited_children):
                      id, _, pairs = visited_children
                      return [(id.text, *pair) for pair in pairs]

                      def visit_expr(self, node, visited_children):
                      return [item for lst in visited_children for item in lst]

                      pv = PandasVisitor()
                      result = pv.visit(tree)

                      df = pd.DataFrame(result, columns=["Id", "Term", "weight"])
                      print(df)


                      This yields



                       Id Term weight
                      0 1 frack 0.733
                      1 1 shale 0.700
                      2 10 space 0.645
                      3 10 station 0.327
                      4 10 nasa 0.258
                      5 4 celebr 0.262
                      6 4 bahar 0.345





                      share|improve this answer

























                        1












                        1








                        1







                        Just to put my two cents in: you could write yourself a parser and feed the result into pandas:



                        import pandas as pd
                        from parsimonious.grammar import Grammar
                        from parsimonious.nodes import NodeVisitor

                        file = """1: frack 0.733, shale 0.700,
                        10: space 0.645, station 0.327, nasa 0.258,
                        4: celebr 0.262, bahar 0.345
                        """

                        grammar = Grammar(
                        r"""
                        expr = line+

                        line = id colon pair*
                        pair = term ws weight sep? ws?

                        id = ~"d+"
                        colon = ws? ":" ws?
                        sep = ws? "," ws?

                        term = ~"[a-zA-Z]+"
                        weight = ~"d+(?:.d+)?"

                        ws = ~"s+"
                        """
                        )

                        tree = grammar.parse(file)

                        class PandasVisitor(NodeVisitor):
                        def generic_visit(self, node, visited_children):
                        return visited_children or node

                        def visit_pair(self, node, visited_children):
                        term, _, weight, *_ = visited_children
                        return (term.text, weight.text)

                        def visit_line(self, node, visited_children):
                        id, _, pairs = visited_children
                        return [(id.text, *pair) for pair in pairs]

                        def visit_expr(self, node, visited_children):
                        return [item for lst in visited_children for item in lst]

                        pv = PandasVisitor()
                        result = pv.visit(tree)

                        df = pd.DataFrame(result, columns=["Id", "Term", "weight"])
                        print(df)


                        This yields



                         Id Term weight
                        0 1 frack 0.733
                        1 1 shale 0.700
                        2 10 space 0.645
                        3 10 station 0.327
                        4 10 nasa 0.258
                        5 4 celebr 0.262
                        6 4 bahar 0.345





                        share|improve this answer













                        Just to put my two cents in: you could write yourself a parser and feed the result into pandas:



                        import pandas as pd
                        from parsimonious.grammar import Grammar
                        from parsimonious.nodes import NodeVisitor

                        file = """1: frack 0.733, shale 0.700,
                        10: space 0.645, station 0.327, nasa 0.258,
                        4: celebr 0.262, bahar 0.345
                        """

                        grammar = Grammar(
                        r"""
                        expr = line+

                        line = id colon pair*
                        pair = term ws weight sep? ws?

                        id = ~"d+"
                        colon = ws? ":" ws?
                        sep = ws? "," ws?

                        term = ~"[a-zA-Z]+"
                        weight = ~"d+(?:.d+)?"

                        ws = ~"s+"
                        """
                        )

                        tree = grammar.parse(file)

                        class PandasVisitor(NodeVisitor):
                        def generic_visit(self, node, visited_children):
                        return visited_children or node

                        def visit_pair(self, node, visited_children):
                        term, _, weight, *_ = visited_children
                        return (term.text, weight.text)

                        def visit_line(self, node, visited_children):
                        id, _, pairs = visited_children
                        return [(id.text, *pair) for pair in pairs]

                        def visit_expr(self, node, visited_children):
                        return [item for lst in visited_children for item in lst]

                        pv = PandasVisitor()
                        result = pv.visit(tree)

                        df = pd.DataFrame(result, columns=["Id", "Term", "weight"])
                        print(df)


                        This yields



                         Id Term weight
                        0 1 frack 0.733
                        1 1 shale 0.700
                        2 10 space 0.645
                        3 10 station 0.327
                        4 10 nasa 0.258
                        5 4 celebr 0.262
                        6 4 bahar 0.345






                        share|improve this answer












                        share|improve this answer



                        share|improve this answer










                        answered 3 hours ago









                        JanJan

                        26.1k52750




                        26.1k52750





















                            0














                            Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.



                            import pandas as pd
                            file=r"give_your_path".replace('\', '/')
                            my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
                            with open(file,"r+") as f:
                            for line in f.readlines():#looping every line
                            my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
                            for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
                            my_list_of_lists.append(my_id+term)
                            df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
                            df.columns=["Id","Term","weight"]#giving columns their names





                            share|improve this answer



























                              0














                              Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.



                              import pandas as pd
                              file=r"give_your_path".replace('\', '/')
                              my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
                              with open(file,"r+") as f:
                              for line in f.readlines():#looping every line
                              my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
                              for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
                              my_list_of_lists.append(my_id+term)
                              df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
                              df.columns=["Id","Term","weight"]#giving columns their names





                              share|improve this answer

























                                0












                                0








                                0







                                Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.



                                import pandas as pd
                                file=r"give_your_path".replace('\', '/')
                                my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
                                with open(file,"r+") as f:
                                for line in f.readlines():#looping every line
                                my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
                                for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
                                my_list_of_lists.append(my_id+term)
                                df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
                                df.columns=["Id","Term","weight"]#giving columns their names





                                share|improve this answer













                                Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.



                                import pandas as pd
                                file=r"give_your_path".replace('\', '/')
                                my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
                                with open(file,"r+") as f:
                                for line in f.readlines():#looping every line
                                my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
                                for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
                                my_list_of_lists.append(my_id+term)
                                df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
                                df.columns=["Id","Term","weight"]#giving columns their names






                                share|improve this answer












                                share|improve this answer



                                share|improve this answer










                                answered 4 hours ago









                                JoPapou13JoPapou13

                                914




                                914





















                                    0














                                    It is possible to just use entirely pandas:



                                    df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
                                    10: space 0.645, station 0.327, nasa 0.258,
                                    4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

                                    #df:
                                    0 1
                                    0 1 frack 0.733, shale 0.700,
                                    1 10 space 0.645, station 0.327, nasa 0.258,
                                    2 4 celebr 0.262, bahar 0.345


                                    Turn the column 1 into a list and then expand:



                                    df[1] = df[1].str.split(",", expand=False)

                                    dfs = []
                                    for idx, rows in df.iterrows():
                                    print(rows)
                                    dfslice = pd.DataFrame("Id": [rows[0]]*len(rows[1]), "terms": rows[1])
                                    dfs.append(dfslice)
                                    newdf = pd.concat(dfs, ignore_index=True)

                                    # this creates newdf:
                                    Id terms
                                    0 1 frack 0.733
                                    1 1 shale 0.700
                                    2 1
                                    3 10 space 0.645
                                    4 10 station 0.327
                                    5 10 nasa 0.258
                                    6 10
                                    7 4 celebr 0.262
                                    8 4 bahar 0.345


                                    Now we need to str split the last line and drop empties:



                                    newdf["terms"] = newdf["terms"].str.strip()
                                    newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
                                    newdf.columns = ["Id", "terms", "Term", "Weights"]
                                    newdf = newdf.drop("terms", axis=1).dropna()


                                    Resulting newdf:



                                     Id Term Weights
                                    0 1 frack 0.733
                                    1 1 shale 0.700
                                    3 10 space 0.645
                                    4 10 station 0.327
                                    5 10 nasa 0.258
                                    7 4 celebr 0.262
                                    8 4 bahar 0.345





                                    share|improve this answer



























                                      0














                                      It is possible to just use entirely pandas:



                                      df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
                                      10: space 0.645, station 0.327, nasa 0.258,
                                      4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

                                      #df:
                                      0 1
                                      0 1 frack 0.733, shale 0.700,
                                      1 10 space 0.645, station 0.327, nasa 0.258,
                                      2 4 celebr 0.262, bahar 0.345


                                      Turn the column 1 into a list and then expand:



                                      df[1] = df[1].str.split(",", expand=False)

                                      dfs = []
                                      for idx, rows in df.iterrows():
                                      print(rows)
                                      dfslice = pd.DataFrame("Id": [rows[0]]*len(rows[1]), "terms": rows[1])
                                      dfs.append(dfslice)
                                      newdf = pd.concat(dfs, ignore_index=True)

                                      # this creates newdf:
                                      Id terms
                                      0 1 frack 0.733
                                      1 1 shale 0.700
                                      2 1
                                      3 10 space 0.645
                                      4 10 station 0.327
                                      5 10 nasa 0.258
                                      6 10
                                      7 4 celebr 0.262
                                      8 4 bahar 0.345


                                      Now we need to str split the last line and drop empties:



                                      newdf["terms"] = newdf["terms"].str.strip()
                                      newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
                                      newdf.columns = ["Id", "terms", "Term", "Weights"]
                                      newdf = newdf.drop("terms", axis=1).dropna()


                                      Resulting newdf:



                                       Id Term Weights
                                      0 1 frack 0.733
                                      1 1 shale 0.700
                                      3 10 space 0.645
                                      4 10 station 0.327
                                      5 10 nasa 0.258
                                      7 4 celebr 0.262
                                      8 4 bahar 0.345





                                      share|improve this answer

























                                        0












                                        0








                                        0







                                        It is possible to just use entirely pandas:



                                        df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
                                        10: space 0.645, station 0.327, nasa 0.258,
                                        4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

                                        #df:
                                        0 1
                                        0 1 frack 0.733, shale 0.700,
                                        1 10 space 0.645, station 0.327, nasa 0.258,
                                        2 4 celebr 0.262, bahar 0.345


                                        Turn the column 1 into a list and then expand:



                                        df[1] = df[1].str.split(",", expand=False)

                                        dfs = []
                                        for idx, rows in df.iterrows():
                                        print(rows)
                                        dfslice = pd.DataFrame("Id": [rows[0]]*len(rows[1]), "terms": rows[1])
                                        dfs.append(dfslice)
                                        newdf = pd.concat(dfs, ignore_index=True)

                                        # this creates newdf:
                                        Id terms
                                        0 1 frack 0.733
                                        1 1 shale 0.700
                                        2 1
                                        3 10 space 0.645
                                        4 10 station 0.327
                                        5 10 nasa 0.258
                                        6 10
                                        7 4 celebr 0.262
                                        8 4 bahar 0.345


                                        Now we need to str split the last line and drop empties:



                                        newdf["terms"] = newdf["terms"].str.strip()
                                        newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
                                        newdf.columns = ["Id", "terms", "Term", "Weights"]
                                        newdf = newdf.drop("terms", axis=1).dropna()


                                        Resulting newdf:



                                         Id Term Weights
                                        0 1 frack 0.733
                                        1 1 shale 0.700
                                        3 10 space 0.645
                                        4 10 station 0.327
                                        5 10 nasa 0.258
                                        7 4 celebr 0.262
                                        8 4 bahar 0.345





                                        share|improve this answer













                                        It is possible to just use entirely pandas:



                                        df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
                                        10: space 0.645, station 0.327, nasa 0.258,
                                        4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

                                        #df:
                                        0 1
                                        0 1 frack 0.733, shale 0.700,
                                        1 10 space 0.645, station 0.327, nasa 0.258,
                                        2 4 celebr 0.262, bahar 0.345


                                        Turn the column 1 into a list and then expand:



                                        df[1] = df[1].str.split(",", expand=False)

                                        dfs = []
                                        for idx, rows in df.iterrows():
                                        print(rows)
                                        dfslice = pd.DataFrame("Id": [rows[0]]*len(rows[1]), "terms": rows[1])
                                        dfs.append(dfslice)
                                        newdf = pd.concat(dfs, ignore_index=True)

                                        # this creates newdf:
                                        Id terms
                                        0 1 frack 0.733
                                        1 1 shale 0.700
                                        2 1
                                        3 10 space 0.645
                                        4 10 station 0.327
                                        5 10 nasa 0.258
                                        6 10
                                        7 4 celebr 0.262
                                        8 4 bahar 0.345


                                        Now we need to str split the last line and drop empties:



                                        newdf["terms"] = newdf["terms"].str.strip()
                                        newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
                                        newdf.columns = ["Id", "terms", "Term", "Weights"]
                                        newdf = newdf.drop("terms", axis=1).dropna()


                                        Resulting newdf:



                                         Id Term Weights
                                        0 1 frack 0.733
                                        1 1 shale 0.700
                                        3 10 space 0.645
                                        4 10 station 0.327
                                        5 10 nasa 0.258
                                        7 4 celebr 0.262
                                        8 4 bahar 0.345






                                        share|improve this answer












                                        share|improve this answer



                                        share|improve this answer










                                        answered 4 hours ago









                                        Rocky LiRocky Li

                                        3,6831719




                                        3,6831719





















                                            0














                                            Could I assume that there is just 1 space before 'TERM'?



                                            df=pd.DataFrame(columns=['ID','Term','Weight'])
                                            with open('C:/random/d1','r') as readObject:
                                            for line in readObject:
                                            line=line.rstrip('n')
                                            tempList1=line.split(':')
                                            tempList2=tempList1[1]
                                            tempList2=tempList2.rstrip(',')
                                            tempList2=tempList2.split(',')
                                            for item in tempList2:
                                            e=item.split(' ')
                                            tempRow=[tempList1[0], e[0],e[1]]
                                            df.loc[len(df)]=tempRow
                                            print(df)





                                            share|improve this answer



























                                              0














                                              Could I assume that there is just 1 space before 'TERM'?



                                              df=pd.DataFrame(columns=['ID','Term','Weight'])
                                              with open('C:/random/d1','r') as readObject:
                                              for line in readObject:
                                              line=line.rstrip('n')
                                              tempList1=line.split(':')
                                              tempList2=tempList1[1]
                                              tempList2=tempList2.rstrip(',')
                                              tempList2=tempList2.split(',')
                                              for item in tempList2:
                                              e=item.split(' ')
                                              tempRow=[tempList1[0], e[0],e[1]]
                                              df.loc[len(df)]=tempRow
                                              print(df)





                                              share|improve this answer

























                                                0












                                                0








                                                0







                                                Could I assume that there is just 1 space before 'TERM'?



                                                df=pd.DataFrame(columns=['ID','Term','Weight'])
                                                with open('C:/random/d1','r') as readObject:
                                                for line in readObject:
                                                line=line.rstrip('n')
                                                tempList1=line.split(':')
                                                tempList2=tempList1[1]
                                                tempList2=tempList2.rstrip(',')
                                                tempList2=tempList2.split(',')
                                                for item in tempList2:
                                                e=item.split(' ')
                                                tempRow=[tempList1[0], e[0],e[1]]
                                                df.loc[len(df)]=tempRow
                                                print(df)





                                                share|improve this answer













                                                Could I assume that there is just 1 space before 'TERM'?



                                                df=pd.DataFrame(columns=['ID','Term','Weight'])
                                                with open('C:/random/d1','r') as readObject:
                                                for line in readObject:
                                                line=line.rstrip('n')
                                                tempList1=line.split(':')
                                                tempList2=tempList1[1]
                                                tempList2=tempList2.rstrip(',')
                                                tempList2=tempList2.split(',')
                                                for item in tempList2:
                                                e=item.split(' ')
                                                tempRow=[tempList1[0], e[0],e[1]]
                                                df.loc[len(df)]=tempRow
                                                print(df)






                                                share|improve this answer












                                                share|improve this answer



                                                share|improve this answer










                                                answered 4 hours ago









                                                RebinRebin

                                                193211




                                                193211





















                                                    -3














                                                    1) You can read row by row.



                                                    2) Then you can separate by ':' for your index and ',' for the values



                                                    1)



                                                    with open('path/filename.txt','r') as filename:
                                                    content = filename.readlines()


                                                    2)
                                                    content = [x.split(':') for x in content]



                                                    This will give you the following result:



                                                    content =[
                                                    ['1','frack 0.733, shale 0.700,'],
                                                    ['10', 'space 0.645, station 0.327, nasa 0.258,'],
                                                    ['4','celebr 0.262, bahar 0.345 ']]





                                                    share|improve this answer


















                                                    • 3





                                                      Your result is not the result asked for in the question.

                                                      – GiraffeMan91
                                                      4 hours ago















                                                    -3














                                                    1) You can read row by row.



                                                    2) Then you can separate by ':' for your index and ',' for the values



                                                    1)



                                                    with open('path/filename.txt','r') as filename:
                                                    content = filename.readlines()


                                                    2)
                                                    content = [x.split(':') for x in content]



                                                    This will give you the following result:



                                                    content =[
                                                    ['1','frack 0.733, shale 0.700,'],
                                                    ['10', 'space 0.645, station 0.327, nasa 0.258,'],
                                                    ['4','celebr 0.262, bahar 0.345 ']]





                                                    share|improve this answer


















                                                    • 3





                                                      Your result is not the result asked for in the question.

                                                      – GiraffeMan91
                                                      4 hours ago













                                                    -3












                                                    -3








                                                    -3







                                                    1) You can read row by row.



                                                    2) Then you can separate by ':' for your index and ',' for the values



                                                    1)



                                                    with open('path/filename.txt','r') as filename:
                                                    content = filename.readlines()


                                                    2)
                                                    content = [x.split(':') for x in content]



                                                    This will give you the following result:



                                                    content =[
                                                    ['1','frack 0.733, shale 0.700,'],
                                                    ['10', 'space 0.645, station 0.327, nasa 0.258,'],
                                                    ['4','celebr 0.262, bahar 0.345 ']]





                                                    share|improve this answer













                                                    1) You can read row by row.



                                                    2) Then you can separate by ':' for your index and ',' for the values



                                                    1)



                                                    with open('path/filename.txt','r') as filename:
                                                    content = filename.readlines()


                                                    2)
                                                    content = [x.split(':') for x in content]



                                                    This will give you the following result:



                                                    content =[
                                                    ['1','frack 0.733, shale 0.700,'],
                                                    ['10', 'space 0.645, station 0.327, nasa 0.258,'],
                                                    ['4','celebr 0.262, bahar 0.345 ']]






                                                    share|improve this answer












                                                    share|improve this answer



                                                    share|improve this answer










                                                    answered 4 hours ago









                                                    CedricLyCedricLy

                                                    11




                                                    11







                                                    • 3





                                                      Your result is not the result asked for in the question.

                                                      – GiraffeMan91
                                                      4 hours ago












                                                    • 3





                                                      Your result is not the result asked for in the question.

                                                      – GiraffeMan91
                                                      4 hours ago







                                                    3




                                                    3





                                                    Your result is not the result asked for in the question.

                                                    – GiraffeMan91
                                                    4 hours ago





                                                    Your result is not the result asked for in the question.

                                                    – GiraffeMan91
                                                    4 hours ago

















                                                    draft saved

                                                    draft discarded
















































                                                    Thanks for contributing an answer to Stack Overflow!


                                                    • Please be sure to answer the question. Provide details and share your research!

                                                    But avoid


                                                    • Asking for help, clarification, or responding to other answers.

                                                    • Making statements based on opinion; back them up with references or personal experience.

                                                    To learn more, see our tips on writing great answers.




                                                    draft saved


                                                    draft discarded














                                                    StackExchange.ready(
                                                    function ()
                                                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55799784%2fconverting-a-text-document-with-special-format-to-pandas-dataframe%23new-answer', 'question_page');

                                                    );

                                                    Post as a guest















                                                    Required, but never shown





















































                                                    Required, but never shown














                                                    Required, but never shown












                                                    Required, but never shown







                                                    Required, but never shown

































                                                    Required, but never shown














                                                    Required, but never shown












                                                    Required, but never shown







                                                    Required, but never shown







                                                    Popular posts from this blog

                                                    Францішак Багушэвіч Змест Сям'я | Біяграфія | Творчасць | Мова Багушэвіча | Ацэнкі дзейнасці | Цікавыя факты | Спадчына | Выбраная бібліяграфія | Ушанаванне памяці | У філатэліі | Зноскі | Літаратура | Спасылкі | НавігацыяЛяхоўскі У. Рупіўся дзеля Бога і людзей: Жыццёвы шлях Лявона Вітан-Дубейкаўскага // Вольскі і Памідораў з песняй пра немца Адвакат, паэт, народны заступнік Ашмянскі веснікВ Минске появится площадь Богушевича и улица Сырокомли, Белорусская деловая газета, 19 июля 2001 г.Айцец беларускай нацыянальнай ідэі паўстаў у бронзе Сяргей Аляксандравіч Адашкевіч (1918, Мінск). 80-я гады. Бюст «Францішак Багушэвіч».Яўген Мікалаевіч Ціхановіч. «Партрэт Францішка Багушэвіча»Мікола Мікалаевіч Купава. «Партрэт зачынальніка новай беларускай літаратуры Францішка Багушэвіча»Уладзімір Іванавіч Мелехаў. На помніку «Змагарам за родную мову» Барэльеф «Францішак Багушэвіч»Памяць пра Багушэвіча на Віленшчыне Страчаная сталіца. Беларускія шыльды на вуліцах Вільні«Krynica». Ideologia i przywódcy białoruskiego katolicyzmuФранцішак БагушэвічТворы на knihi.comТворы Францішка Багушэвіча на bellib.byСодаль Уладзімір. Францішак Багушэвіч на Лідчыне;Луцкевіч Антон. Жыцьцё і творчасьць Фр. Багушэвіча ў успамінах ягоных сучасьнікаў // Запісы Беларускага Навуковага таварыства. Вільня, 1938. Сшытак 1. С. 16-34.Большая российская1188761710000 0000 5537 633Xn9209310021619551927869394п

                                                    Беларусь Змест Назва Гісторыя Геаграфія Сімволіка Дзяржаўны лад Палітычныя партыі Міжнароднае становішча і знешняя палітыка Адміністрацыйны падзел Насельніцтва Эканоміка Культура і грамадства Сацыяльная сфера Узброеныя сілы Заўвагі Літаратура Спасылкі НавігацыяHGЯOiТоп-2011 г. (па версіі ej.by)Топ-2013 г. (па версіі ej.by)Топ-2016 г. (па версіі ej.by)Топ-2017 г. (па версіі ej.by)Нацыянальны статыстычны камітэт Рэспублікі БеларусьШчыльнасць насельніцтва па краінахhttp://naviny.by/rubrics/society/2011/09/16/ic_articles_116_175144/А. Калечыц, У. Ксяндзоў. Спробы засялення краю неандэртальскім чалавекам.І ў Менску былі мамантыА. Калечыц, У. Ксяндзоў. Старажытны каменны век (палеаліт). Першапачатковае засяленне тэрыторыіГ. Штыхаў. Балты і славяне ў VI—VIII стст.М. Клімаў. Полацкае княства ў IX—XI стст.Г. Штыхаў, В. Ляўко. Палітычная гісторыя Полацкай зямліГ. Штыхаў. Дзяржаўны лад у землях-княствахГ. Штыхаў. Дзяржаўны лад у землях-княствахБеларускія землі ў складзе Вялікага Княства ЛітоўскагаЛюблінская унія 1569 г."The Early Stages of Independence"Zapomniane prawdy25 гадоў таму было аб'яўлена, што Язэп Пілсудскі — беларус (фота)Наша вадаДакументы ЧАЭС: Забруджванне тэрыторыі Беларусі « ЧАЭС Зона адчужэнняСведения о политических партиях, зарегистрированных в Республике Беларусь // Министерство юстиции Республики БеларусьСтатыстычны бюлетэнь „Полаўзроставая структура насельніцтва Рэспублікі Беларусь на 1 студзеня 2012 года і сярэднегадовая колькасць насельніцтва за 2011 год“Индекс человеческого развития Беларуси — не было бы нижеБеларусь занимает первое место в СНГ по индексу развития с учетом гендерного факцёраНацыянальны статыстычны камітэт Рэспублікі БеларусьКанстытуцыя РБ. Артыкул 17Трансфармацыйныя задачы БеларусіВыйсце з крызісу — далейшае рэфармаванне Беларускі рубель — сусветны лідар па дэвальвацыяхПра змену коштаў у кастрычніку 2011 г.Бядней за беларусаў у СНД толькі таджыкіСярэдні заробак у верасні дасягнуў 2,26 мільёна рублёўЭканомікаГаласуем за ТОП-100 беларускай прозыСучасныя беларускія мастакіАрхитектура Беларуси BELARUS.BYА. Каханоўскі. Культура Беларусі ўсярэдзіне XVII—XVIII ст.Анталогія беларускай народнай песні, гуказапісы спеваўБеларускія Музычныя IнструментыБеларускі рок, які мы страцілі. Топ-10 гуртоў«Мясцовы час» — нязгаслая легенда беларускай рок-музыкіСЯРГЕЙ БУДКІН. МЫ НЯ ЗНАЕМ СВАЁЙ МУЗЫКІМ. А. Каладзінскі. НАРОДНЫ ТЭАТРМагнацкія культурныя цэнтрыПублічная дыскусія «Беларуская новая пьеса: без беларускай мовы ці беларуская?»Беларускія драматургі па-ранейшаму лепш ставяцца за мяжой, чым на радзіме«Працэс незалежнага кіно пайшоў, і дзяржаву турбуе яго непадкантрольнасць»Беларускія філосафы ў пошуках прасторыВсе идём в библиотекуАрхіваванаАб Нацыянальнай праграме даследавання і выкарыстання касмічнай прасторы ў мірных мэтах на 2008—2012 гадыУ космас — разам.У суседнім з Барысаўскім раёне пабудуюць Камандна-вымяральны пунктСвяты і абрады беларусаў«Мірныя бульбашы з малой краіны» — 5 непраўдзівых стэрэатыпаў пра БеларусьМ. Раманюк. Беларускае народнае адзеннеУ Беларусі скарачаецца колькасць злачынстваўЛукашэнка незадаволены мінскімі ўладамі Крадзяжы складаюць у Мінску каля 70% злачынстваў Узровень злачыннасці ў Мінскай вобласці — адзін з самых высокіх у краіне Генпракуратура аналізуе стан са злачыннасцю ў Беларусі па каэфіцыенце злачыннасці У Беларусі стабілізавалася крымінагеннае становішча, лічыць генпракурорЗамежнікі сталі здзяйсняць у Беларусі больш злачынстваўМУС Беларусі турбуе рост рэцыдыўнай злачыннасціЯ з ЖЭСа. Дазволіце вас абкрасці! Рэйтынг усіх службаў і падраздзяленняў ГУУС Мінгарвыканкама вырасАб КДБ РБГісторыя Аператыўна-аналітычнага цэнтра РБГісторыя ДКФРТаможняagentura.ruБеларусьBelarus.by — Афіцыйны сайт Рэспублікі БеларусьСайт урада БеларусіRadzima.org — Збор архітэктурных помнікаў, гісторыя Беларусі«Глобус Беларуси»Гербы и флаги БеларусиАсаблівасці каменнага веку на БеларусіА. Калечыц, У. Ксяндзоў. Старажытны каменны век (палеаліт). Першапачатковае засяленне тэрыторыіУ. Ксяндзоў. Сярэдні каменны век (мезаліт). Засяленне краю плямёнамі паляўнічых, рыбакоў і збіральнікаўА. Калечыц, М. Чарняўскі. Плямёны на тэрыторыі Беларусі ў новым каменным веку (неаліце)А. Калечыц, У. Ксяндзоў, М. Чарняўскі. Гаспадарчыя заняткі ў каменным векуЭ. Зайкоўскі. Духоўная культура ў каменным векуАсаблівасці бронзавага веку на БеларусіФарміраванне супольнасцей ранняга перыяду бронзавага векуФотографии БеларусиРоля беларускіх зямель ва ўтварэнні і ўмацаванні ВКЛВ. Фадзеева. З гісторыі развіцця беларускай народнай вышыўкіDMOZGran catalanaБольшая российскаяBritannica (анлайн)Швейцарскі гістарычны15325917611952699xDA123282154079143-90000 0001 2171 2080n9112870100577502ge128882171858027501086026362074122714179пппппп

                                                    ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6 (SMOTE) The 2019 Stack Overflow Developer Survey Results Are InCan SMOTE be applied over sequence of words (sentences)?ValueError when doing validation with random forestsSMOTE and multi class oversamplingLogic behind SMOTE-NC?ValueError: Error when checking target: expected dense_1 to have shape (7,) but got array with shape (1,)SmoteBoost: Should SMOTE be ran individually for each iteration/tree in the boosting?solving multi-class imbalance classification using smote and OSSUsing SMOTE for Synthetic Data generation to improve performance on unbalanced dataproblem of entry format for a simple model in KerasSVM SMOTE fit_resample() function runs forever with no result