Splitting a huge word file into individual documents

rockymtnhigh

Hardly Normal
Original poster
Supporting Founder
Apr 14, 2006
30,520
1,161
Normal, IL
Ok, here is my problem. I have need to download more than 160 Supreme Court opinions from Lexis Nexis. I had the citations and managed to write the syntax for a query to generate all the files I need in one results list. Lexis then lets me select all the files and download them. But when it downloads them, it combines them all into one Word (or RTF or TXT) file.

But to do the content analysis for my research I need to have each case in a separate RTF file (so I can use it with both DevonThink Pro Office and HyperResearch). In Microsoft Word, the giant file (its a few megabytes) and 762 pages, has section breaks between each case.

I did some digging today and saw that if I switched to Outline View, I could go to master document, and at the beginning of each section, add a style. Then I could highlight that section and create a subdocument. I did this -- 135 times. I then saved the file as a new file, and it created 135 smaller files. I thought great! This will be perfect. I'll bring the word files into DevonThink pro Office, and just use its Convert to RTF feature...

BUT... and you knew there was a but. DT can read the files, but it can't convert them. There is something about the subdocument thing that has added code to it, and the only way it looks like I can get them into RTFs is to open each file in word, highlight, copy, and paste into a new RTF in DevonThink. A royal pain.

Please tell me, somebody, there is an easier way that I am just missing. Any suggestions would be most appreciated.
 
I found a word add-in, which only works for Windows, of course, that will split files by page breaks and or section breaks. It worked. I'll have to do the file splitting on my old pc, but the goal is accomplished.
 
OK. I was going to say that this was a job for a macro, or possibly VB for Windows Apps. But yes, all that only works on PCs. Glad you found a canned one.

I ran into a similar problem last week when an excel VBA project got locked by someone. I took it home thinking I could break the lock on my mac, only to run into "VBA only works on Windows" message. had to use the history manager to find an old version and then create a new file with the current data and the old VBA code and macros.
 
OK. I was going to say that this was a job for a macro, or possibly VB for Windows Apps. But yes, all that only works on PCs. Glad you found a canned one.

I ran into a similar problem last week when an excel VBA project got locked by someone. I took it home thinking I could break the lock on my mac, only to run into "VBA only works on Windows" message. had to use the history manager to find an old version and then create a new file with the current data and the old VBA code and macros.

Yes, very frustrating. You would think there is a tool to extract text easily for macs (as the files can be saved as RTF), but my searching did not find anything. I tried BBEdit, but could not make heads or tails out of it. But I managed to get the data out of Lexis and into individual files, now I just have to rename them. Hundreds of them :)
 
I feel your pain. Kind of like when airbus delivers a pdf copy of a spec, and I need to extract the several hundred actual requirements and get them into our requirements tracking and traceability tool (DOORS for those who care) Last one they sent was formatted as bitmap images and was forced to use OCR to convert to text and then hand correct all the typos.
 
Do you - or anyone else reading - know how to formulate grep searches?

My documents are filled with things like this:


[***LEdHR1A] [1A] [***LEdHR2A] [2A] [***LEdHR3A] [3A] [***LEdHR4A] [4A] [***LEdHR5A] [5A] [***LEdHR6A] [6A] [***LEdHR7A] [7A] [***LEdHR8A] [8A] [***LEdHR9A] [9A] [***LEdHR10A] [10A] [***LEdHR11A] [11A] [***LEdHR12A] [12A]

which occur throughout the text. AND page numbers, that look like [**3003]

I want to get rid of all of that text. I do not need the headnotes that they refer to, and the pagination is also not needed, as one of my research tasks is to do word-counts of how long various supreme court opinions are. BBEdit can do the search, but reading the manual and making heads or tails out of it might take me years. :)
 
If you want to get rid of everything inside the brackets, you can simply use the MSword replace function (^h) and replace as follows:
Click the MORE button.
Select 'use wildcards'
Type into the Find What field the following: \[?*\]
Leave the Replace with field blank.
Hit 'Replace All'
 
Do you - or anyone else reading - know how to formulate grep searches?

My documents are filled with things like this:


[***LEdHR1A] [1A] [***LEdHR2A] [2A] [***LEdHR3A] [3A] [***LEdHR4A] [4A] [***LEdHR5A] [5A] [***LEdHR6A] [6A] [***LEdHR7A] [7A] [***LEdHR8A] [8A] [***LEdHR9A] [9A] [***LEdHR10A] [10A] [***LEdHR11A] [11A] [***LEdHR12A] [12A]

which occur throughout the text. AND page numbers, that look like [**3003]

Yes, but the [ makes it tough so I need to do it from my Mac instead of my phone.

Sent from my Samsung Galaxy Note 2 using Tapatalk 2.x
 
Confused now. Mike was the one originally asking, and now John mentions the solution. Were you collaborating?

Answer I gave was in Microsoft speak because I thought that was the tool available. grep and sed are unixisms, and a regular expression there is different than the MS expression. (substitute . for ? in above)

grep is a search tool. sed is an editor where you can actually change stuff. vi, emacs and most other unix editors supprt regular expressions as well.
 
Confused now. Mike was the one originally asking, and now John mentions the solution. Were you collaborating?

Helping a net friend out :) It took me about 20 minutes.

Answer I gave was in Microsoft speak because I thought that was the tool available. grep and sed are unixisms, and a regular expression there is different than the MS expression. (substitute . for ? in above)

grep is a search tool. sed is an editor where you can actually change stuff. vi, emacs and most other unix editors supprt regular expressions as well.

Rocky runs a Mac, which runs OpenBSD so he can do stuff that way too.

grep (Get Regular eXpression and Put) can be used as a filter / editor, but it is too coarse a tool. grep will match the entire line that contains the string, and you can filter out matching lines with the -v flag. That is useful in some instances. In this case it could pull out text that we still want.

sed (Stream EDitor) edits each line and changes only the matching strings within the line. It's better suited to this task.

This isn't my first text processing rodeo ;)
 
This isn't my first text processing rodeo ;)

Me either. I go back to troff on HP-UX days. I actually was on the team that did the first port of SCI Unix to the HP desktops.

Also not upset or knee jerking. Just curious where it went as the data received was somewhat vague.

Glad it all worked out.
 
I appreciate both of your help. In the future, the word solution will be easiest, as the files come in from lexis in word format, but the script is valuable too as now I have a bunch of text files and this can clean them up.

Muchos gracias!
 

Users Who Are Viewing This Thread (Total: 0, Members: 0, Guests: 0)

Who Read This Thread (Total Members: 1)